Back to Rankings

Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models

Bruno Bianchi, Diego Tiscornia, Matias Travizano, Ariel Futoransky

cs.AI
Share
#3081 of 3572 · Artificial Intelligence
Tournament Score
1284±37
10501800
29%
Win Rate
10
Wins
24
Losses
34
Matches
Rating
4.8/ 10
Significance5.5
Rigor3.5
Novelty5.5
Clarity5.5

Abstract

Since the advent of Large Language Models (LLMs), a significant area of research has focused on their intrinsic biases, particularly in political discourse. This study investigates a different but related concept, "political plasticity", which is defined as the capacity of models to adapt their responses based on the user supplied context. To analyze this, a testing framework was developed using an expanded corpus of 200 politically-oriented questions across economic and personal freedom axes, based on a prior framework by Lester (1996). The study explored several methods to induce political bias, including simplified and topic-based system prompts, as well as user prompts with few-shot examples. The results show that while system prompts were largely ineffective, user prompts successfully elicited significant ideological shifts, particularly along the Economic Freedom axis in larger and newer models. Through a validation experiment, we examined whether models answer questionnaires by recognizing the underlying question format. Inverting the sense of the questions revealed unexpected, counter-intuitive shifts in most models, suggesting potential data leakage. Finally, we also analyzed how model plasticity varies when the experiment is conducted in different languages. The results reveal subtle yet notable shifts across each of the analyzed languages. Overall, our results indicate that small and older LLMs exhibit limited or unstable political plasticity, whereas newer frontier models display reliable, expected adaptability.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models"

1. Core Contribution

This paper introduces the concept of "political plasticity" — the capacity of LLMs to shift ideological positions based on user-supplied context — and distinguishes it from the more commonly studied intrinsic political bias. The authors develop a testing framework with 200 politically-oriented questions (expanded from Lester's 1996 20-item questionnaire) across Economic Freedom and Personal Freedom axes. They systematically compare multiple bias-induction methods (system prompts with simplified labels, system prompts with topic descriptions, and user prompts with few-shot examples) and evaluate plasticity across 14 models of varying sizes and generations, multiple languages, and using inverted question formulations.

The key findings are: (1) system prompts are largely ineffective at inducing ideological shifts; (2) user prompts with few-shot political examples successfully elicit shifts, particularly along the Economic Freedom axis; (3) inverting question polarity reveals counter-intuitive behavior suggesting potential data leakage or training contamination; and (4) newer, larger models (especially GPT-5 variants) exhibit more reliable and expected plasticity.

2. Methodological Rigor

The experimental design follows a reasonable progression from simple to complex bias-induction strategies, with validation experiments that strengthen the findings. The expansion from 20 to 200 questions and cross-model validation of the questionnaire are sensible improvements. The use of two complementary metrics (Most Probable Response and Probability of Affirmative Response) adds robustness.

However, several methodological concerns limit confidence in the results:

  • Question generation and validation: The augmented 200-question corpus was generated by ChatGPT and validated by other LLMs, then manually reviewed. This introduces potential circularity — LLMs generating test items for LLM evaluation. The corpus is not yet publicly available, preventing independent scrutiny.
  • Statistical analysis: The paper lacks formal statistical testing. Results are presented as visual differences in scatter plots without confidence intervals, effect sizes, or significance tests. Given the stochastic nature of LLM outputs, this is a notable gap.
  • Probability estimation for closed-API models: For GPT-5 mini and nano, probabilities were estimated by running queries 10 times at maximum temperature — a crude approximation that may introduce substantial noise.
  • Limited control conditions: The paper does not include a neutral (no-bias) baseline systematically compared against biased conditions, making it difficult to assess the magnitude and direction of shifts relative to the models' default positions.
  • Few-shot example design: The antithetical question pairs used to balance "Yes"/"No" responses in Experiment 3 are a thoughtful control, but the random selection of 4 from 9 topics per instance introduces variance that is not quantified.
  • 3. Potential Impact

    The concept of political plasticity is practically important. If LLMs can be easily steered toward particular ideological positions through user interaction alone (without system prompt access), this has implications for:

  • Electoral manipulation and misinformation: Malicious actors could craft prompts to generate politically biased content at scale.
  • Sycophancy and echo chambers: Models that adapt to perceived user ideology may reinforce existing beliefs rather than provide balanced information.
  • AI policy and governance: Regulators need to understand not just what biases models have by default, but how malleable those biases are.
  • The finding that user prompts are more effective than system prompts is particularly relevant, as it suggests bias induction is accessible to ordinary users, not just those with backend access.

    4. Timeliness & Relevance

    This work addresses a timely concern as LLMs become embedded in information-seeking behavior during election cycles worldwide. The distinction between static bias and dynamic plasticity is a useful conceptual contribution to the growing literature on AI and political discourse. The multilingual analysis (six languages) adds practical relevance for non-English-speaking populations.

    The inclusion of very recent models (GPT-5-mini, GPT-5-nano, dated 2025) makes this among the first studies to benchmark these specific systems for political adaptability. However, this also means the findings are highly perishable — model updates could invalidate results quickly.

    5. Strengths & Limitations

    Strengths:

  • Clear conceptual framing distinguishing plasticity from intrinsic bias
  • Systematic experimental progression with multiple induction methods
  • The inverted-question validation experiment is a creative and valuable contribution, revealing that models may recognize questionnaire formats (potential data leakage)
  • Multilingual analysis adds breadth
  • Broad model coverage spanning open-source and proprietary systems
  • Limitations:

  • Absence of statistical rigor (no hypothesis tests, confidence intervals, or effect size measures)
  • The 2D political framework (Economic Freedom / Personal Freedom) is acknowledged as reductive but not supplemented with alternatives
  • LLM-generated test corpus raises validity concerns
  • Results are presented primarily through scatter plots that are difficult to interpret precisely
  • The paper does not deeply engage with why certain models show plasticity and others don't — architectural, training data, or RLHF differences are mentioned but not investigated
  • Limited comparison to concurrent work (e.g., Bernardelle et al. 2025 is cited but not deeply contrasted)
  • The "data leakage" hypothesis from Validation Experiment 2 is suggestive but not confirmed through additional investigation
  • Overall Assessment

    This paper makes a useful conceptual contribution by framing "political plasticity" as distinct from intrinsic bias and provides a broad empirical survey across models, methods, and languages. However, the lack of statistical rigor, the LLM-generated test corpus, and the absence of deeper mechanistic analysis limit the strength of its conclusions. The work reads more as an exploratory survey than a definitive study, but it raises important questions and provides a reasonable framework that others could build upon.

    Rating:4.8/ 10
    Significance 5.5Rigor 3.5Novelty 5.5Clarity 5.5

    Generated May 12, 2026

    Comparison History (34)

    Lostvs. An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

    Paper 1 presents a novel agentic AI framework combining LLMs with hierarchical deep reinforcement learning for a complex hybrid scheduling problem in UAV-assisted logistics with mobile edge computing. It addresses a timely, practical problem in cloud manufacturing with strong methodological rigor (hierarchical PPO, RAG, chain-of-thought reasoning) and demonstrates strong empirical results. Paper 2 offers an interesting empirical analysis of LLM political plasticity but is more observational and narrower in scope, with fewer direct applications. Paper 1's cross-disciplinary contribution spanning AI, optimization, UAV systems, and edge computing gives it broader potential impact.

    claude-opus-4-6·May 14, 2026
    Lostvs. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    Paper 1 is likely higher impact: it introduces a novel, well-scoped evaluation dimension (strategy diversity) beyond accuracy for mathematical reasoning, with a structured taxonomy (217 strategy families), dual-AI coding plus human adjudication, and robustness checks—stronger methodological rigor. Its findings generalize to evaluation, training, and interpretability of reasoning models, with broad relevance across ML evaluation and education. Paper 2 addresses an important topic (political behavior), but relies on questionnaire-style elicitation that appears confounded by format recognition/data leakage, making conclusions less robust and potentially less transferable.

    gpt-5.2·May 12, 2026
    Lostvs. WindINR: Latent-State INR for Fast Local Wind Query and Correction in Complex Terrain

    Paper 2 likely has higher scientific impact: it proposes a technically novel latent-state implicit neural representation with a clear separation of reusable learning and fast inference-time correction, addressing a practical, high-value need in atmospheric/wind modeling. The methodology is more rigorously specified (OSSEs, robustness tests, uncertainty-aware correction objective, CPU benchmarks) and has direct real-world applications (UAV operations, local forecasting in complex terrain). Its ideas (latent correction, continuous queryable fields) are broadly transferable to other geophysical and spatiotemporal inference problems.

    gpt-5.2·May 12, 2026
    Wonvs. Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

    Paper 2 has higher likely scientific impact due to broader timeliness and cross-field relevance: it targets LLM behavior under prompting, ideological bias/adaptability, multilingual effects, and potential data leakage—issues central to AI safety, alignment, evaluation, policy, and social science measurement. It proposes a reusable testing framework and reports systematic experiments across model sizes and languages, with a validation manipulation (question inversion) that can influence future methodology. Paper 1 is valuable but largely a retrospective of a single competition/evaluation design, with narrower generalizability and application scope.

    gpt-5.2·May 12, 2026
    Wonvs. Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives

    Paper 2 addresses a timely, broadly relevant topic—LLM political biases and adaptability—with rigorous empirical methodology (200 questions, multiple prompting strategies, cross-lingual analysis, validation experiments). It has clear implications for AI safety, policy, and alignment research, attracting interest from both NLP and social science communities. Paper 1, while intellectually interesting in combining causal reasoning with narrative theory, is a niche research artifact without benchmarking, limiting its immediate measurable impact and uptake. Paper 2's findings on data leakage and model plasticity are actionable for the broader AI community.

    claude-opus-4-6·May 12, 2026
    Lostvs. SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

    Paper 1 presents a novel, agent-centric training framework for autonomous skill creation/refinement/selection with concrete algorithmic contributions (trajectory-informed review, counterfactual utility probes, DualAdv-GRPO) and demonstrated performance gains on standard embodied/web agent benchmarks, suggesting strong methodological rigor and direct applicability to building more capable, self-improving agents. Its impact likely spans RL, LLM agents, and automation. Paper 2 is timely and socially relevant, but is primarily an analysis with potential confounds (question inversion effects/data leakage) and less clear actionable methodological advances, limiting scientific and cross-field impact.

    gpt-5.2·May 12, 2026
    Lostvs. Agentic Performance at the Edge: Insights from Benchmarking

    Paper 1 addresses a critical and highly practical bottleneck in AI: deploying agentic models on resource-constrained edge devices. Its findings on the accuracy-latency Pareto front and model-tool interactions offer broad, immediate real-world applications across IoT, mobile computing, and robotics. While Paper 2 provides interesting insights into LLM behavioral alignment and social science, Paper 1's systems-level focus on overcoming hardware constraints positions it for wider foundational impact in enabling ubiquitous AI deployment.

    gemini-3.1-pro-preview·May 12, 2026
    Lostvs. E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability

    Paper 1 (E-TCAV) addresses a fundamental challenge in AI interpretability—computational efficiency of concept-based explanations—with rigorous methodology across multiple architectures and datasets. It provides theoretical grounding and practical speedups that enable real-time applications, broadly impacting the model debugging and trustworthy AI community. Paper 2, while timely and interesting regarding LLM political biases, is more descriptive/empirical with narrower scope and less methodological novelty. E-TCAV's contributions to efficient interpretability have broader cross-domain applicability and stronger potential for adoption in production ML systems.

    claude-opus-4-6·May 12, 2026
    Lostvs. Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach

    Paper 2 has higher impact potential: it introduces a biologically grounded, model-agnostic, plug-and-play classifier module (L3-PPI) that can improve many existing PPI predictors, with clear real-world applications in biology, drug discovery, and disease mechanism studies. It leverages an explicit interaction prior (L3 rule), provides empirical dataset evidence, and proposes a concrete method with extensive experiments, suggesting stronger methodological rigor and translational relevance. Paper 1 is timely for AI safety/bias but is more diagnostic/behavioral, with narrower downstream utility and potential confounds (e.g., questionnaire leakage) limiting actionable impact.

    gpt-5.2·May 12, 2026
    Lostvs. Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

    Paper 1 offers a more clearly novel, integrative methodology (KG-grounded selective retrieval + LLM generation) aimed at a concrete, high-value problem: actionable interpretability of ML in manufacturing. It demonstrates applied evaluation with tailored question sets and mixed quantitative/qualitative metrics, and has direct real-world deployment potential and cross-domain relevance (XAI, industrial AI, knowledge representation, LLM tool-use). Paper 2 is timely and interesting for AI safety/policy, but its contribution is primarily evaluative/diagnostic with potential confounds (question inversion/data leakage) and narrower immediate application.

    gpt-5.2·May 12, 2026