AMEL: Accumulated Message Effects on LLM Judgments

Sid-ali Temkit

May 21, 2026

arXiv:2605.22714v1 PDF

cs.AI(primary)cs.CLcs.LG

#827of 2292·Artificial Intelligence

#827 of 2292 · Artificial Intelligence

Tournament Score

1441±48

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor7.5

Novelty5.5

Clarity8

Tournament Score

1441±48

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AMEL — Accumulated Message Effects on LLM Judgments

1. Core Contribution

This paper identifies and rigorously quantifies a specific bias in LLMs used as evaluators: when multiple items are assessed within a single conversation, the polarity of prior evaluations systematically shifts subsequent judgments. The authors term this the "Accumulated Message Effect on LLM Judgments" (AMEL). The key distinction from related phenomena (sycophancy, few-shot label bias, anchoring) is clearly articulated: no user opinion is expressed, and the model conforms to *its own* prior evaluation pattern rather than to formatted label demonstrations or explicit anchors.

The paper delivers several well-quantified findings: (1) a moderate overall effect (d = −0.17), (2) concentration on uncertain items (d = −0.34 for high-entropy items), (3) rapid saturation (5 turns ≈ 50 turns), (4) negativity asymmetry (1.62× paired ratio), and (5) scaling helps but doesn't eliminate the effect. These are complemented by three mechanistic experiments probing continuous probability shifts, token-vs-semantic sources of negativity bias, and positional independence.

2. Methodological Rigor

The experimental design is notably thorough for a bias characterization study. The scale — 75,898 API calls across 11 models from 4 providers — provides strong statistical power and cross-model generalizability. Several methodological choices demonstrate care:

Bonferroni correction with a factor of 21 matching the exact number of primary tests prevents p-hacking concerns.

Empirical entropy stratification (Section 4.9) supersedes the author-coded ambiguity labels, providing a more objective measure of model uncertainty that strengthens the "uncertainty predicts susceptibility" claim.

Mixed-effects modeling (Section 4.10) accounts for hierarchical nesting, finding low ICC (0.031) — confirming AMEL is universal, not model-specific.

Parser v2 development after discovering asymmetric extraction patterns shows intellectual honesty; the transparent reporting of how v1→v2 changed results strengthens credibility.

Multiple deduplication strategies for the Qwen3 30B data incident, with transparent reporting of how each affects results.

However, there are notable limitations in rigor. The mechanistic experiments (Sections 5.1–5.3) cover only 1–2 models and one domain, making the mechanistic conclusions considerably weaker than the main effect characterization. The flipped-framing experiment's per-model differences are explicitly underpowered (n = 21 items), and the authors appropriately label these as "exploratory." The temperature sensitivity check covers a single model. The Gemini data is partial due to API quotas. The author-coded item labels lack inter-rater validation, though the empirical-entropy stratification largely mitigates this concern.

3. Potential Impact

Practical impact is likely the strongest dimension. LLM-as-judge is now standard practice across industry (code review, content moderation, exam grading, RLHF reward modeling). The finding that batched evaluation within a single conversation introduces systematic bias — particularly on the most uncertain (and thus most consequential) items — has immediate operational implications. The actionable recommendation ("fresh context per item") is simple and directly implementable.

Research impact is moderate. The paper contributes to the growing literature on LLM cognitive biases and LLM-as-judge reliability. The negativity asymmetry finding connects to existing work on negative bias in LLMs (Braun, Cheung et al., Yu et al.), and the saturation finding connects to drift equilibria (Dongre et al.). However, the paper is primarily empirical characterization rather than mechanistic insight or novel mitigation — it documents the problem thoroughly but offers only obvious fixes (fresh contexts, balanced batching).

The connection to concurrent work on conversational inertia (Simhi et al., Wan et al.) positions AMEL as one instantiation of a broader phenomenon, which somewhat reduces its novelty but increases its relevance as a convergent finding.

4. Timeliness & Relevance

Highly timely. LLM-as-judge is experiencing rapid adoption, and the cost pressures that motivate batched evaluation (multiple items per conversation) are real. The paper addresses a practical vulnerability that many production systems likely exhibit. The breadth of model coverage, including frontier models from 2026 (GPT-5.2, Claude Opus 4.6), ensures the findings are current.

5. Strengths & Limitations

Key Strengths:

Exceptional breadth: 11 models, 4 providers, 3 domains, systematic variation of context length, polarity, and item difficulty.

Clean experimental design with well-defined bias score metric.

The non-accumulation finding (saturation at 5 turns) is surprising and practically important — it means even minimal biased history carries the full effect.

The empirical-entropy stratification is a methodological contribution — using the model's own baseline behavior rather than subjective item labels.

Excellent transparency: data quality incidents, parser asymmetries, and alternative deduplication strategies all openly reported.

Complete reproducibility package (code, data, seeds).

Key Limitations:

Binary yes/no judgments only — the most common real-world LLM evaluations use Likert scales or comparative rankings.

The mechanistic experiments are underpowered and cover too few models to support strong causal claims.

No novel mitigation beyond obvious recommendations (fresh contexts, balanced batching).

The neutral condition finding (50/50 history also shifts toward "no") is interesting but unresolved — the authors acknowledge they cannot distinguish "any history" from "any evaluative history" without a non-evaluative control.

Single-author item categorization without inter-rater validation.

The paper is quite long and could benefit from tighter presentation; many results are incremental refinements of the same basic finding.

Additional Observations

The finding that Qwen3 4B shows *contrarian* bias (d = +0.19) is intriguing and suggests instruction-tuning choices can overcorrect for conversational inertia, potentially introducing a different kind of unreliability. The Gemini Pro > Flash reversal in bias magnitude, potentially linked to thinking tokens, deserves follow-up with adequate sample sizes.

The congruent/incongruent analysis (Section 4.8) adds nuance but the qualitative examples (Table 5) show alarming behavior: GPT-4.1 Nano justifying clearly inappropriate comments as "appropriate" after negative context history, suggesting AMEL can produce not just statistical shifts but qualitatively wrong evaluations.

Overall, this is a solid empirical contribution to an important practical problem. Its impact will be strongest in informing best practices for LLM evaluation pipelines, though the lack of novel mitigation methods and limited mechanistic depth constrain its scientific contribution.

Rating:6.5/ 10

Significance 7Rigor 7.5Novelty 5.5Clarity 8

Generated May 22, 2026

Comparison History (19)

vs. Implicit Safety Alignment from Crowd Preferences

gemini-3.15/22/2026

Paper 2 addresses a ubiquitous and critical issue in modern AI: the reliability of LLMs as automated evaluators. Its massive empirical analysis across major models exposes a fundamental in-context bias that immediately impacts how AI evaluation pipelines are designed, giving it broader and more immediate real-world relevance than Paper 1's safe RL framework.

vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

claude-opus-4.65/22/2026

Paper 2 (AMEL) has higher scientific impact due to its broader applicability, methodological rigor, and practical relevance. It identifies a systematic bias affecting all LLM-as-judge pipelines across 11 models and 4 providers with ~76K API calls, offering actionable mitigation strategies. This affects numerous downstream applications (code review, content moderation, evaluation benchmarks). Paper 1 is a narrow single-speech case study with limited generalizability, comparing specific tools for political speech analysis. AMEL's findings are more fundamental, timely, and widely relevant to the rapidly growing LLM evaluation ecosystem.

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

claude-opus-4.65/22/2026

Paper 2 (AMEL) addresses a fundamental bias in LLM-as-judge evaluation pipelines that affects a vast range of applications (code review, content moderation, output scoring). Its findings—that conversational history biases subsequent judgments, with negativity asymmetry and entropy-dependent effects—have immediate, broad implications for anyone using LLMs as evaluators. The rigorous experimental design (75,898 API calls, 11 models, 4 providers) and actionable mitigation advice (fresh context per item) give it high practical impact. Paper 1 introduces a valuable but more niche benchmark for T2I prompting proficiency, with narrower applicability.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

gemini-3.15/22/2026

While Paper 1 provides valuable empirical insights into LLM-as-a-judge biases, Paper 2 tackles a critical bottleneck in the highly active field of reasoning agents. By demonstrating that complex training pipelines can be replaced with a simple, scalable self-evolution method (GRPO + self-distillation) to achieve state-of-the-art results, Paper 2 has immense potential to reshape how open-source reasoning models are trained.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

claude-opus-4.65/22/2026

Gated DeltaNet-2 introduces a fundamental architectural innovation in linear attention mechanisms—decoupling erase and write gates—with strong empirical results across language modeling, reasoning, and retrieval benchmarks. It generalizes multiple existing architectures (Gated DeltaNet, KDA), provides theoretical grounding (fast-weight update view, chunkwise algorithm), and addresses a core challenge in efficient sequence modeling. Its impact spans architecture design, efficient inference, and long-context modeling. Paper 1 identifies an interesting LLM evaluation bias (AMEL) but is more narrowly scoped to evaluation practices with a straightforward mitigation (fresh context per item), limiting its broader scientific impact.

vs. KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental and widely relevant bias in LLM-as-judge pipelines, backed by a large-scale empirical study (75,898 API calls, 11 models, 4 providers). Given the explosive adoption of LLMs as automated evaluators across NLP, software engineering, and content moderation, this finding has immediate broad impact. The rigorous quantification of the accumulated message effect, negativity asymmetry, and practical mitigation advice makes it highly actionable. Paper 2 proposes a domain-specific architecture for circular manufacturing—valuable but narrower in scope and audience, with evaluation limited to two use cases.

vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to broad, immediate relevance: it identifies and quantifies a systematic bias in LLM-as-judge workflows across many models/providers with a large-scale experimental design (75,898 calls), clear effect sizes, and actionable mitigation guidance. This impacts evaluation, moderation, benchmarking, and agent pipelines across fields. Paper 1 is novel and useful for prompt optimization, but is narrower in scope (aggregate-only Bayesian optimization for system prompts) and its evidence is more task/budget-specific, making generalization and downstream impact less certain.

vs. Evaluation of Pipelines for Data Integration into Knowledge Graphs

gemini-3.15/22/2026

Paper 1 addresses a critical, highly timely issue (biases in LLM-as-a-judge systems) with a rigorous, large-scale empirical study. Its findings have immediate and broad implications for AI evaluation pipelines across numerous domains. While Paper 2 presents a useful benchmark for knowledge graph integration, its scope and potential breadth of impact are much more niche compared to the widespread, cross-disciplinary reliance on LLM evaluators.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

claude-opus-4.65/22/2026

Paper 2 (AMEL) has higher potential scientific impact due to its rigorous empirical methodology (75,898 API calls across 11 models), quantifiable and reproducible findings about a systematic bias in LLM-as-judge pipelines, and broad applicability across any field using LLMs for automated evaluation. It identifies a concrete, measurable problem with practical mitigations. Given the explosive adoption of LLMs as evaluators in research and industry, this bias characterization is both timely and consequential. Paper 1, while valuable, is a qualitative interview study with 24 participants offering primarily descriptive insights about AI's impact on workplace culture.

vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to timeliness and broad relevance: it identifies a systematic bias in LLM-as-judge settings that directly affects widely deployed evaluation, moderation, and benchmarking pipelines. It demonstrates the effect across many models/providers with large-scale experiments, quantifies key moderators (uncertainty, negativity asymmetry, context length), and proposes actionable mitigations. The findings generalize across fields using LLM evaluation (NLP, software engineering, HCI, safety). Paper 1 is novel within assurance-case confidence semantics but is narrower in audience and application domain.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gemini-3.15/22/2026

Paper 1 identifies and quantifies a fundamental bias in LLMs used as evaluators, a practice now ubiquitous across AI research. Its rigorous methodology, large-scale evaluation across multiple models, and broad applicability to any field using 'LLM-as-a-judge' give it a wider scientific impact compared to Paper 2, which presents a highly effective but more specialized applied system for enterprise workflow automation.

vs. The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

claude-opus-4.65/22/2026

Paper 2 (AMEL) addresses a fundamental and broadly relevant issue affecting the rapidly growing use of LLMs as automated evaluators across many domains. Its rigorous experimental design (75,898 API calls, 11 models, 4 providers) identifies a systematic bias with clear practical implications for any LLM evaluation pipeline. The findings—negativity asymmetry, entropy-dependent effects, and actionable mitigations—are immediately applicable across AI safety, content moderation, code review, and benchmarking. Paper 1, while methodologically interesting, addresses a narrower domain (Holocaust oral history archives) with less generalizable impact.

vs. The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

gpt-5.25/22/2026

Paper 2 has higher potential impact due to its broad, timely relevance to the rapidly expanding use of LLMs as evaluators in industry and research. It introduces a general, actionable bias phenomenon (AMEL) with clear operational mitigations, and tests it at large scale across many models/providers with strong statistical evidence and mechanistic follow-ups, indicating solid rigor and replicability. The findings affect evaluation methodology, safety/moderation, benchmarking, and deployment practices across fields. Paper 1 is innovative and valuable for digital humanities, but its direct applicability and cross-domain spillover are narrower.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

gpt-5.25/22/2026

Paper 2 is more novel and broadly impactful: it contributes theoretical decomposition results and a principled algorithm (Fisher-SEP) for combining simulators with real-world experimentation under confounding/drift, a core sim-to-real problem spanning robotics, operations, healthcare, and RL. Its results (identifiability limits, passive-learning reachability gap, and variance-minimizing experimental design) are likely to generalize and influence methodology and practice. Paper 1 is timely and rigorously measured with clear practical implications for LLM evaluation pipelines, but its scope is narrower and mainly diagnostic/mitigative within LLM-based judging.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental problem in sequential decision-making—bridging sim-to-real gaps—with novel theoretical contributions (extended simulation lemma, value gap decomposition, reachability bounds) and a principled algorithm (Fisher-SEP). It has broad applicability across operations research, reinforcement learning, healthcare, and supply chains. Paper 2 identifies an important but relatively narrow bias (AMEL) in LLM evaluators, offering useful empirical findings and practical mitigations (fresh context per item). While timely, its contributions are primarily observational and specific to LLM evaluation pipelines. Paper 1's theoretical depth, methodological rigor, and cross-domain applicability give it greater long-term scientific impact.

vs. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

claude-opus-4.65/22/2026

AMEL identifies a fundamental and previously undercharacterized bias in LLM-as-judge paradigms, which is now a widespread practice across NLP research, content moderation, and code review. The finding that conversational history systematically shifts LLM judgments has broad implications for any pipeline using LLMs as evaluators, affecting reproducibility across many fields. The large-scale empirical rigor (75K+ API calls, 11 models, 4 providers) and actionable mitigation advice give it high practical relevance. PALS addresses an important but more niche systems optimization problem (energy-efficient LLM serving) with incremental engineering contributions over existing power-capping and scheduling work.

vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

gemini-3.15/22/2026

Paper 2 identifies a fundamental bias in LLMs-as-judges, a widely used paradigm across AI research and industry. Its extensive, rigorous evaluation across 11 models provides critical insights into context-induced biases and negativity asymmetry. While Paper 1 offers a valuable system optimization for agent latency, Paper 2's findings have broader, immediate implications for the reliability of AI evaluation pipelines and general LLM behavior.

vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

claude-opus-4.65/22/2026

Paper 1 introduces a novel paradigm—adapting the runtime harness rather than model weights—demonstrating large improvements (88.5% average relative) across 18 models and 7 environments with strong transferability. This reframes agent improvement methodology and has broad practical applicability. Paper 2 identifies an important but relatively narrower bias (AMEL) in LLM-as-judge settings with a modest effect size (d=-0.17) and a straightforward mitigation (fresh context per item). While rigorous and useful, Paper 1's conceptual contribution and demonstrated breadth of impact position it for higher scientific influence.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it identifies and quantifies a broadly relevant, underappreciated bias in LLM-based evaluation pipelines across many models/providers with large-scale, controlled experiments and clear practical mitigations. Its implications span ML evaluation, alignment, HCI, and any domain using LLM judges, making the breadth and timeliness very high. Paper 1 is a solid, application-specific advance for AV stress testing with good engineering novelty, but its impact is narrower to autonomous driving simulation and depends more on benchmark/ecosystem adoption.