AMEL: Accumulated Message Effects on LLM Judgments
Sid-ali Temkit
Abstract
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AMEL — Accumulated Message Effects on LLM Judgments
1. Core Contribution
This paper identifies and rigorously quantifies a specific bias in LLMs used as evaluators: when multiple items are assessed within a single conversation, the polarity of prior evaluations systematically shifts subsequent judgments. The authors term this the "Accumulated Message Effect on LLM Judgments" (AMEL). The key distinction from related phenomena (sycophancy, few-shot label bias, anchoring) is clearly articulated: no user opinion is expressed, and the model conforms to *its own* prior evaluation pattern rather than to formatted label demonstrations or explicit anchors.
The paper delivers several well-quantified findings: (1) a moderate overall effect (d = −0.17), (2) concentration on uncertain items (d = −0.34 for high-entropy items), (3) rapid saturation (5 turns ≈ 50 turns), (4) negativity asymmetry (1.62× paired ratio), and (5) scaling helps but doesn't eliminate the effect. These are complemented by three mechanistic experiments probing continuous probability shifts, token-vs-semantic sources of negativity bias, and positional independence.
2. Methodological Rigor
The experimental design is notably thorough for a bias characterization study. The scale — 75,898 API calls across 11 models from 4 providers — provides strong statistical power and cross-model generalizability. Several methodological choices demonstrate care:
However, there are notable limitations in rigor. The mechanistic experiments (Sections 5.1–5.3) cover only 1–2 models and one domain, making the mechanistic conclusions considerably weaker than the main effect characterization. The flipped-framing experiment's per-model differences are explicitly underpowered (n = 21 items), and the authors appropriately label these as "exploratory." The temperature sensitivity check covers a single model. The Gemini data is partial due to API quotas. The author-coded item labels lack inter-rater validation, though the empirical-entropy stratification largely mitigates this concern.
3. Potential Impact
Practical impact is likely the strongest dimension. LLM-as-judge is now standard practice across industry (code review, content moderation, exam grading, RLHF reward modeling). The finding that batched evaluation within a single conversation introduces systematic bias — particularly on the most uncertain (and thus most consequential) items — has immediate operational implications. The actionable recommendation ("fresh context per item") is simple and directly implementable.
Research impact is moderate. The paper contributes to the growing literature on LLM cognitive biases and LLM-as-judge reliability. The negativity asymmetry finding connects to existing work on negative bias in LLMs (Braun, Cheung et al., Yu et al.), and the saturation finding connects to drift equilibria (Dongre et al.). However, the paper is primarily empirical characterization rather than mechanistic insight or novel mitigation — it documents the problem thoroughly but offers only obvious fixes (fresh contexts, balanced batching).
The connection to concurrent work on conversational inertia (Simhi et al., Wan et al.) positions AMEL as one instantiation of a broader phenomenon, which somewhat reduces its novelty but increases its relevance as a convergent finding.
4. Timeliness & Relevance
Highly timely. LLM-as-judge is experiencing rapid adoption, and the cost pressures that motivate batched evaluation (multiple items per conversation) are real. The paper addresses a practical vulnerability that many production systems likely exhibit. The breadth of model coverage, including frontier models from 2026 (GPT-5.2, Claude Opus 4.6), ensures the findings are current.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations
The finding that Qwen3 4B shows *contrarian* bias (d = +0.19) is intriguing and suggests instruction-tuning choices can overcorrect for conversational inertia, potentially introducing a different kind of unreliability. The Gemini Pro > Flash reversal in bias magnitude, potentially linked to thinking tokens, deserves follow-up with adequate sample sizes.
The congruent/incongruent analysis (Section 4.8) adds nuance but the qualitative examples (Table 5) show alarming behavior: GPT-4.1 Nano justifying clearly inappropriate comments as "appropriate" after negative context history, suggesting AMEL can produce not just statistical shifts but qualitatively wrong evaluations.
Overall, this is a solid empirical contribution to an important practical problem. Its impact will be strongest in informing best practices for LLM evaluation pipelines, though the lack of novel mitigation methods and limited mechanistic depth constrain its scientific contribution.
Generated May 22, 2026
Comparison History (19)
Paper 2 addresses a ubiquitous and critical issue in modern AI: the reliability of LLMs as automated evaluators. Its massive empirical analysis across major models exposes a fundamental in-context bias that immediately impacts how AI evaluation pipelines are designed, giving it broader and more immediate real-world relevance than Paper 1's safe RL framework.
Paper 2 (AMEL) has higher scientific impact due to its broader applicability, methodological rigor, and practical relevance. It identifies a systematic bias affecting all LLM-as-judge pipelines across 11 models and 4 providers with ~76K API calls, offering actionable mitigation strategies. This affects numerous downstream applications (code review, content moderation, evaluation benchmarks). Paper 1 is a narrow single-speech case study with limited generalizability, comparing specific tools for political speech analysis. AMEL's findings are more fundamental, timely, and widely relevant to the rapidly growing LLM evaluation ecosystem.
Paper 2 (AMEL) addresses a fundamental bias in LLM-as-judge evaluation pipelines that affects a vast range of applications (code review, content moderation, output scoring). Its findings—that conversational history biases subsequent judgments, with negativity asymmetry and entropy-dependent effects—have immediate, broad implications for anyone using LLMs as evaluators. The rigorous experimental design (75,898 API calls, 11 models, 4 providers) and actionable mitigation advice (fresh context per item) give it high practical impact. Paper 1 introduces a valuable but more niche benchmark for T2I prompting proficiency, with narrower applicability.
While Paper 1 provides valuable empirical insights into LLM-as-a-judge biases, Paper 2 tackles a critical bottleneck in the highly active field of reasoning agents. By demonstrating that complex training pipelines can be replaced with a simple, scalable self-evolution method (GRPO + self-distillation) to achieve state-of-the-art results, Paper 2 has immense potential to reshape how open-source reasoning models are trained.
Gated DeltaNet-2 introduces a fundamental architectural innovation in linear attention mechanisms—decoupling erase and write gates—with strong empirical results across language modeling, reasoning, and retrieval benchmarks. It generalizes multiple existing architectures (Gated DeltaNet, KDA), provides theoretical grounding (fast-weight update view, chunkwise algorithm), and addresses a core challenge in efficient sequence modeling. Its impact spans architecture design, efficient inference, and long-context modeling. Paper 1 identifies an interesting LLM evaluation bias (AMEL) but is more narrowly scoped to evaluation practices with a straightforward mitigation (fresh context per item), limiting its broader scientific impact.
Paper 1 addresses a fundamental and widely relevant bias in LLM-as-judge pipelines, backed by a large-scale empirical study (75,898 API calls, 11 models, 4 providers). Given the explosive adoption of LLMs as automated evaluators across NLP, software engineering, and content moderation, this finding has immediate broad impact. The rigorous quantification of the accumulated message effect, negativity asymmetry, and practical mitigation advice makes it highly actionable. Paper 2 proposes a domain-specific architecture for circular manufacturing—valuable but narrower in scope and audience, with evaluation limited to two use cases.
Paper 2 likely has higher scientific impact due to broad, immediate relevance: it identifies and quantifies a systematic bias in LLM-as-judge workflows across many models/providers with a large-scale experimental design (75,898 calls), clear effect sizes, and actionable mitigation guidance. This impacts evaluation, moderation, benchmarking, and agent pipelines across fields. Paper 1 is novel and useful for prompt optimization, but is narrower in scope (aggregate-only Bayesian optimization for system prompts) and its evidence is more task/budget-specific, making generalization and downstream impact less certain.
Paper 1 addresses a critical, highly timely issue (biases in LLM-as-a-judge systems) with a rigorous, large-scale empirical study. Its findings have immediate and broad implications for AI evaluation pipelines across numerous domains. While Paper 2 presents a useful benchmark for knowledge graph integration, its scope and potential breadth of impact are much more niche compared to the widespread, cross-disciplinary reliance on LLM evaluators.
Paper 2 (AMEL) has higher potential scientific impact due to its rigorous empirical methodology (75,898 API calls across 11 models), quantifiable and reproducible findings about a systematic bias in LLM-as-judge pipelines, and broad applicability across any field using LLMs for automated evaluation. It identifies a concrete, measurable problem with practical mitigations. Given the explosive adoption of LLMs as evaluators in research and industry, this bias characterization is both timely and consequential. Paper 1, while valuable, is a qualitative interview study with 24 participants offering primarily descriptive insights about AI's impact on workplace culture.
Paper 2 likely has higher scientific impact due to timeliness and broad relevance: it identifies a systematic bias in LLM-as-judge settings that directly affects widely deployed evaluation, moderation, and benchmarking pipelines. It demonstrates the effect across many models/providers with large-scale experiments, quantifies key moderators (uncertainty, negativity asymmetry, context length), and proposes actionable mitigations. The findings generalize across fields using LLM evaluation (NLP, software engineering, HCI, safety). Paper 1 is novel within assurance-case confidence semantics but is narrower in audience and application domain.
Paper 1 identifies and quantifies a fundamental bias in LLMs used as evaluators, a practice now ubiquitous across AI research. Its rigorous methodology, large-scale evaluation across multiple models, and broad applicability to any field using 'LLM-as-a-judge' give it a wider scientific impact compared to Paper 2, which presents a highly effective but more specialized applied system for enterprise workflow automation.
Paper 2 (AMEL) addresses a fundamental and broadly relevant issue affecting the rapidly growing use of LLMs as automated evaluators across many domains. Its rigorous experimental design (75,898 API calls, 11 models, 4 providers) identifies a systematic bias with clear practical implications for any LLM evaluation pipeline. The findings—negativity asymmetry, entropy-dependent effects, and actionable mitigations—are immediately applicable across AI safety, content moderation, code review, and benchmarking. Paper 1, while methodologically interesting, addresses a narrower domain (Holocaust oral history archives) with less generalizable impact.
Paper 2 has higher potential impact due to its broad, timely relevance to the rapidly expanding use of LLMs as evaluators in industry and research. It introduces a general, actionable bias phenomenon (AMEL) with clear operational mitigations, and tests it at large scale across many models/providers with strong statistical evidence and mechanistic follow-ups, indicating solid rigor and replicability. The findings affect evaluation methodology, safety/moderation, benchmarking, and deployment practices across fields. Paper 1 is innovative and valuable for digital humanities, but its direct applicability and cross-domain spillover are narrower.
Paper 2 is more novel and broadly impactful: it contributes theoretical decomposition results and a principled algorithm (Fisher-SEP) for combining simulators with real-world experimentation under confounding/drift, a core sim-to-real problem spanning robotics, operations, healthcare, and RL. Its results (identifiability limits, passive-learning reachability gap, and variance-minimizing experimental design) are likely to generalize and influence methodology and practice. Paper 1 is timely and rigorously measured with clear practical implications for LLM evaluation pipelines, but its scope is narrower and mainly diagnostic/mitigative within LLM-based judging.
Paper 1 addresses a fundamental problem in sequential decision-making—bridging sim-to-real gaps—with novel theoretical contributions (extended simulation lemma, value gap decomposition, reachability bounds) and a principled algorithm (Fisher-SEP). It has broad applicability across operations research, reinforcement learning, healthcare, and supply chains. Paper 2 identifies an important but relatively narrow bias (AMEL) in LLM evaluators, offering useful empirical findings and practical mitigations (fresh context per item). While timely, its contributions are primarily observational and specific to LLM evaluation pipelines. Paper 1's theoretical depth, methodological rigor, and cross-domain applicability give it greater long-term scientific impact.
AMEL identifies a fundamental and previously undercharacterized bias in LLM-as-judge paradigms, which is now a widespread practice across NLP research, content moderation, and code review. The finding that conversational history systematically shifts LLM judgments has broad implications for any pipeline using LLMs as evaluators, affecting reproducibility across many fields. The large-scale empirical rigor (75K+ API calls, 11 models, 4 providers) and actionable mitigation advice give it high practical relevance. PALS addresses an important but more niche systems optimization problem (energy-efficient LLM serving) with incremental engineering contributions over existing power-capping and scheduling work.
Paper 2 identifies a fundamental bias in LLMs-as-judges, a widely used paradigm across AI research and industry. Its extensive, rigorous evaluation across 11 models provides critical insights into context-induced biases and negativity asymmetry. While Paper 1 offers a valuable system optimization for agent latency, Paper 2's findings have broader, immediate implications for the reliability of AI evaluation pipelines and general LLM behavior.
Paper 1 introduces a novel paradigm—adapting the runtime harness rather than model weights—demonstrating large improvements (88.5% average relative) across 18 models and 7 environments with strong transferability. This reframes agent improvement methodology and has broad practical applicability. Paper 2 identifies an important but relatively narrower bias (AMEL) in LLM-as-judge settings with a modest effect size (d=-0.17) and a straightforward mitigation (fresh context per item). While rigorous and useful, Paper 1's conceptual contribution and demonstrated breadth of impact position it for higher scientific influence.
Paper 2 likely has higher scientific impact: it identifies and quantifies a broadly relevant, underappreciated bias in LLM-based evaluation pipelines across many models/providers with large-scale, controlled experiments and clear practical mitigations. Its implications span ML evaluation, alignment, HCI, and any domain using LLM judges, making the breadth and timeliness very high. Paper 1 is a solid, application-specific advance for AV stress testing with good engineering novelty, but its impact is narrower to autonomous driving simulation and depends more on benchmark/ecosystem adoption.