Self-Correction as Feedback Control: Error Dynamics, Stability Thresholds, and Prompt Interventions in LLMs

Aofan Liu, Jingxiang Meng

Apr 24, 2026

arXiv:2604.22273v2 PDF

v1v2

cs.AI(primary)

#118of 2292·Artificial Intelligence

#118 of 2292 · Artificial Intelligence

Tournament Score

1536±34

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty5

Clarity7.5

Tournament Score

1536±34

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Iterative self-correction is increasingly deployed in agentic LLM systems, yet whether repeated refinement improves or degrades performance remains inconsistent across models. We recast self-correction as a closed-loop feedback-control problem in which the same model is both controller and plant, and analyze its error dynamics via a two-state Markov model over {Correct, Incorrect}, parameterized by the Error Introduction Rate (EIR) and Error Correction Rate (ECR). The model yields a directly measurable stability threshold -- iterate only when ECR/EIR > Acc/(1-Acc) -- in which EIR acts as a stability margin and prompting becomes lightweight controller design. Empirically, across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), a sharp near-zero EIR boundary (< 0.5%) cleanly separates beneficial from harmful self-correction: only o3-mini (+3.4 pp), Claude Opus 4.6 (+0.6 pp), and o4-mini (+/-0 pp) stay non-degrading, while GPT-5 and four others lose accuracy. A verify-first prompt intervention then provides causal evidence: it drives GPT-4o-mini's EIR from 2% to 0% and converts a -6.2 pp degradation into +0.2 pp (paired McNemar, p<10^{-4}), with negligible change on already-sub-threshold models -- exactly as the diagnostic predicts. A complementary analysis of adaptive self-consistency (ASC) shows it halts harmful refinement at a 3.8 pp confidence-elicitation cost, exposing a two-tier capability structure: prompt-level EIR suppression prevents degradation, whereas ECR enhancement -- plausibly training-level -- is required for genuine gains. Self-correction should thus be treated not as a default behavior but as a control decision governed by measurable error dynamics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper recasts LLM iterative self-correction as a closed-loop feedback control problem, modeling correctness evolution as a two-state Markov chain parameterized by Error Introduction Rate (EIR) and Error Correction Rate (ECR). The central insight is an operationalizable stability threshold: self-correction is beneficial only when ECR/EIR > Acc/(1−Acc). The paper identifies near-zero EIR (≲0.5%) as the sharp empirical boundary separating beneficial from harmful self-correction, validates this causally with a "verify-first" prompt intervention, and articulates a two-tier capability model distinguishing prompt-level EIR suppression from training-level ECR enhancement.

The core novelty is not the Markov formalism itself—two-state Markov chains are elementary—but rather its operationalization into a practical diagnostic. The EIR/ECR decomposition provides a more actionable lens than simply tracking accuracy curves, and the verify-first intervention demonstrates that this framing has engineering utility.

2. Methodological Rigor

Strengths: The experimental design is reasonably thorough: 7 models across 4 capability tiers, 3 datasets, and 4 refinement iterations. The verify-first ablation is well-designed as a causal probe—it targets EIR specifically and produces the predicted differential effect (large impact on high-EIR models, negligible on low-EIR models), with appropriate statistical testing (McNemar's test, paired bootstrap CIs).

Concerns: Several methodological issues weaken the claims:

Sample sizes are modest: 500 GSM8K problems, 400 MATH, 200 StrategyQA. When EIR is near zero, the actual number of error-introduction events is extremely small (e.g., 0–2 events out of ~460 correct answers), making EIR estimates unreliable. The paper claims EIR = 0.0% for o3-mini across all iterations, but with 500 problems this could mask a true EIR of ~0.5% that simply didn't manifest.

The Markov model assumes stationarity, yet the authors observe non-stationarity (GPT-4o-mini's EIR rising from 1.3% to 3.8%). This undermines the theoretical framework's applicability precisely where it matters most—for degrading models.

Only 4 iterations are tested. The convergence and steady-state claims remain largely theoretical extrapolations rather than empirical observations.

The "theorems" are trivial—they are standard properties of two-state Markov chains (stationary distribution, geometric convergence). Calling these "Theorem 1–3" somewhat overclaims their contribution.

Cross-dataset analysis is incomplete: detailed EIR/ECR dynamics are reported primarily for GSM8K; MATH and StrategyQA results are mentioned briefly. The o3-mini StrategyQA anomaly (47% accuracy due to extraction issues) raises concerns about evaluation robustness.

3. Potential Impact

The paper addresses a genuinely practical problem: deciding when to deploy self-correction loops in production LLM systems. The EIR-based diagnostic is simple to compute and could inform deployment decisions in agentic systems. The verify-first prompt is a low-cost intervention that practitioners could immediately adopt.

The two-tier capability model (EIR suppression vs. ECR enhancement) provides useful conceptual vocabulary for the field, cleanly distinguishing what prompt engineering can achieve from what requires training-level changes. This framing could influence how researchers think about self-correction capabilities.

However, the impact may be bounded by several factors: (1) the rapidly evolving model landscape means specific EIR thresholds may become outdated quickly; (2) the binary {Correct, Incorrect} state space limits applicability to open-ended generation; (3) the practical recommendation ("measure EIR on a calibration set before deploying self-correction") is somewhat obvious once stated; (4) the connection to control theory, while metaphorically appealing, doesn't leverage actual control-theoretic tools beyond the basic Markov formulation.

4. Timeliness & Relevance

The paper is highly timely. Agentic LLM systems with self-correction loops are proliferating rapidly, and the question of when self-correction helps versus hurts is practically urgent. The inclusion of very recent models (GPT-5, Claude Opus 4.6, o4-mini) enhances relevance. The finding that GPT-5 degrades under self-correction despite frontier capability is particularly noteworthy and counter-intuitive.

The paper also connects to the broader test-time compute scaling literature, where understanding when to allocate compute to refinement versus other strategies (like self-consistency) is an active research question.

5. Strengths & Limitations

Key Strengths:

Clean, actionable framing that transforms an empirical observation (self-correction sometimes hurts) into a measurable diagnostic

The verify-first ablation provides compelling causal evidence for EIR as the operative variable

Strong contrast cases (GPT-5 vs. Opus 4.6) effectively illustrate the EIR threshold

The two-tier capability model is a useful conceptual contribution

Practical recommendations are concrete and immediately applicable

Key Limitations:

Theoretical contribution is overstated—the Markov model is elementary, and the real contribution is empirical/operational

Small effective sample sizes for near-zero EIR estimates undermine precision

The "control theory" framing is largely metaphorical; no actual controller synthesis, stability margins in the control-theoretic sense, or robustness analysis is performed

ASC appears underdeveloped—it hurts accuracy by 3.8pp due to confidence elicitation, and the paper essentially acknowledges it's not a practical solution

Limited to tasks with binary correctness; the framework doesn't extend naturally to generation quality

Single evaluation run per configuration (no variance estimates across random seeds or problem subsets)

Additional Observations

The paper's comparison showing Self-Consistency outperforming iterative refinement at matched compute (93.4% vs. 86.6%) is a useful practical finding, though not novel. The "accuracy-correction paradox" terminology effectively captures an important phenomenon, though the underlying pool-size asymmetry argument has been informally noted in prior work.

The paper would benefit from a more honest positioning: it is primarily an empirical contribution with a simple but useful mathematical framework, rather than a deep theoretical contribution to either control theory or Markov chain analysis.

Rating:5.8/ 10

Significance 6Rigor 5.5Novelty 5Clarity 7.5

Generated May 5, 2026

Comparison History (29)

vs. Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning

gemini-35/5/2026

Paper 2 provides a fundamental theoretical framework (control theory) to explain and predict LLM self-correction, addressing a major inconsistency in the field. By establishing a mathematically grounded and empirically validated stability threshold, it offers a rigorous, actionable metric for system design. This bridges theoretical analysis with practical application, likely driving broader and more foundational impact than the architectural improvements proposed in Paper 1.

vs. Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning

gemini-35/5/2026

Paper 1 offers a rigorous, theoretically grounded framework by mapping LLM self-correction to a feedback control problem. By establishing measurable stability thresholds (EIR/ECR), it resolves inconsistencies in current literature regarding when self-correction actually works. Its combination of mathematical modeling, causal empirical testing, and actionable diagnostics gives it exceptional methodological rigor and broad applicability, likely making a larger fundamental impact than the architectural approach of Paper 2.

vs. LACE: Lattice Attention for Cross-thread Exploration

gemini-35/5/2026

Paper 1 provides a novel theoretical framework by applying closed-loop feedback control to LLM self-correction. By establishing measurable stability thresholds (EIR/ECR), it offers a principled explanation for inconsistent self-correction performance across models. This fundamental theoretical insight, combined with strong empirical validation, has broad implications for the design and evaluation of autonomous AI agents, likely yielding a higher and more enduring scientific impact than the specific architectural intervention proposed in Paper 2.

vs. Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

claude-opus-4.65/5/2026

Paper 2 introduces a novel theoretical framework (feedback control theory applied to LLM self-correction) with broad applicability across the rapidly growing field of agentic AI systems. It provides a measurable stability threshold, actionable diagnostic criteria, and causal evidence via prompt interventions—offering both theoretical depth and practical guidance. Paper 1, while technically solid, is primarily an engineering optimization study for a specific ASR deployment scenario with incremental contributions. Paper 2's cross-cutting insights into when self-correction helps vs. harms have wider implications for LLM system design, making it more likely to influence future research directions.

vs. Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

gpt-5.25/5/2026

Paper 2 likely has higher impact due to timeliness and breadth: it targets widely deployed LLM agent behaviors and provides a simple, measurable stability criterion (ECR/EIR threshold) with strong empirical validation across multiple models/datasets plus causal prompt intervention evidence. The feedback-control framing is novel and potentially general across iterative reasoning, tool use, and agent loops, enabling practical diagnostics and design guidance. Paper 1 is rigorous and valuable but more domain-specific (chess, human-in-the-loop overrides) with narrower immediate cross-field relevance.

vs. When Agents Evolve, Institutions Follow

gpt-5.25/5/2026

Paper 1 offers a clearer methodological contribution: a formal feedback-control framing with a simple, testable Markov error-dynamics model that yields an explicit stability threshold and actionable diagnostics (EIR/ECR) validated across models/datasets, plus a causal prompt intervention with strong statistics. This combination of theory + measurable quantities + prescriptive guidance is likely to generalize widely across LLM agent design and evaluation. Paper 2 is timely and potentially broad, but appears more exploratory/empirical with a metaphor-to-architecture mapping that may be harder to formalize and reproduce as a general scientific principle.

vs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

gpt-5.25/5/2026

Paper 1 likely has higher impact due to a substantial, reusable benchmark artifact (300 tasks, trajectory-aware evidence, fine-grained rubrics) that can become community infrastructure for evaluating agent reliability, safety, and robustness across modalities—high real-world relevance and broad applicability. Its methodological contribution (multi-channel auditing, multi-trial metrics) addresses pressing evaluation failures in deployed agents. Paper 2 is novel and timely with a clear theoretical framing and actionable prompting insight, but its scope is narrower (self-correction on a few NLP benchmarks) and may have less cross-field and tooling impact than a widely adoptable evaluation suite.

vs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

gpt-5.25/5/2026

Paper 1 likely has higher impact: it delivers a substantial, reusable evaluation infrastructure (300 tasks, trajectory-aware evidence, 2,159 rubric items) directly addressing urgent gaps in agent benchmarking (safety/robustness, multimodality, interaction paradigms). This enables broad, cross-model and cross-lab comparability and can influence deployment standards and research directions across AI safety, HCI, and agent systems. Paper 2 is novel and rigorous with a useful control-theoretic framing, but its scope is narrower (self-correction on select datasets) and may translate more as a diagnostic/prompting guideline than a field-wide benchmark resource.

vs. Belief-Guided Inference Control for Large Language Model Services via Verifiable Observations

gemini-35/5/2026

Paper 2 addresses a critical open problem in LLMs (inconsistent self-correction) by elegantly framing it as a feedback control problem. By establishing measurable stability thresholds (EIR/ECR), it provides both a strong theoretical foundation and actionable prompt interventions. This novel cross-disciplinary approach will likely broadly influence how agentic loops are designed and evaluated, offering deeper scientific impact than Paper 1's practical but more conventional cost-routing POMDP framework.

vs. AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

gpt-5.25/5/2026

Paper 2 likely has higher impact due to broad, immediate applicability: it introduces a benchmark (AgentFloor) and large-scale evaluation corpus that can standardize comparisons, drive model routing decisions, and influence both research and production agent design. Releasing tasks, harness, sweeps, and runs increases reuse and citation potential across academia and industry. While Paper 1 offers a novel control-theoretic framing and useful diagnostic for self-correction, its scope is narrower (self-correction dynamics) and may be less broadly adopted than a widely usable benchmark for tool-using agents.

vs. MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

gpt-5.25/5/2026

Paper 2 is more likely to have higher scientific impact: it introduces a novel, general feedback-control framing of LLM self-correction with a simple, testable stability criterion, validated across multiple models/datasets with causal prompt interventions and statistical testing. Its applications are immediate for agentic LLM pipelines, evaluation, and safety/reliability, and the insights generalize across tasks and model families. Paper 1 is ambitious and practically relevant, but resembles scaling/aggregation of existing offline RL + transformer ideas; impact depends heavily on reproducibility, compute access, and whether it materially advances MARL beyond dataset scale.

vs. Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

claude-opus-4.65/5/2026

Paper 2 provides a novel theoretical framework recasting LLM self-correction as feedback control, yielding a measurable stability threshold (ECR/EIR > Acc/(1-Acc)) that offers actionable diagnostic criteria. Its interdisciplinary contribution bridging control theory and LLM behavior, empirical validation across multiple models/datasets, and causal prompt intervention evidence make it broadly impactful. It addresses a fundamental question about when self-correction helps vs. harms, relevant to all agentic LLM deployments. Paper 1, while practically useful for multi-agent infrastructure optimization, is more incremental engineering with narrower speedup improvements.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gpt-5.25/5/2026

Paper 1 has higher likely impact due to strong methodological rigor and immediate applicability to a fast-moving, high-stakes domain (LLM agents). It introduces a concrete, testable control-theoretic model with measurable quantities (EIR/ECR), derives an actionable stability threshold, and provides multi-model, multi-dataset evidence plus causal intervention with statistical testing. Its guidance directly affects how practitioners design self-correction/verification loops, with broad relevance to reliability, safety, and evaluation. Paper 2 is ambitious and cross-disciplinary, but such unification frameworks often face harder empirical falsification and slower uptake unless the validations are exceptionally definitive.

vs. SCALE-LoRA: Auditing Post-Retrieval LoRA Composition with Residual Merging and View Reliability

gemini-35/5/2026

Paper 2 offers a fundamental paradigm shift by formalizing LLM self-correction as a feedback-control problem. By introducing quantifiable metrics (EIR, ECR) and a mathematical stability threshold, it resolves widespread empirical inconsistencies in agentic LLM performance. This theoretical grounding, validated across top-tier models, provides actionable insights for prompt engineering and model training. While Paper 1 presents a solid technical solution for LoRA composition, Paper 2 addresses a ubiquitous, highly debated problem in reasoning and autonomous agents, guaranteeing wider interdisciplinary and practical impact across the rapidly growing field of agentic AI.

vs. Hidden Biases in Conditioning Autoregressive Models

claude-opus-4.65/5/2026

Paper 1 establishes fundamental computational complexity results (NP-hardness, #P-hardness) for exact conditioning in autoregressive models, providing theoretical foundations that will remain relevant as long as these models are used. These hardness results formalize widely-held intuitions and have broad implications across NLP, music generation, and any constrained generation task. Paper 2 offers a useful practical framework for self-correction but is more incremental—recasting an empirical phenomenon via a simple Markov model with limited novelty. Paper 1's theoretical contributions have broader, more lasting impact across multiple fields.

vs. Accelerating battery research with an AI interface between FINALES and Kadi4Mat

gemini-35/5/2026

Paper 2 addresses a critical, timely issue in AI—LLM self-correction degradation—using a highly novel and rigorous control-theory framework. Its theoretical and empirical contributions have broad applicability across the rapidly expanding field of agentic AI. In contrast, Paper 1 offers a valuable but more niche methodological contribution tied to specific data management ecosystems within battery materials science, limiting its cross-disciplinary impact.

vs. When Roles Fail: Epistemic Constraints on Advocate Role Fidelity in LLM-Based Political Statement Analysis

gpt-5.25/5/2026

Paper 1 is likely higher impact: it introduces a principled control-theoretic framing of LLM self-correction with a measurable stability threshold (ECR/EIR) that generalizes across tasks/models and yields actionable interventions (verify-first) with strong causal evidence. The methodology links theory to empirical diagnostics and provides broadly applicable guidance for agentic LLM design, affecting reliability, evaluation, and deployment across many domains. Paper 2 is timely and useful for political multi-agent pipelines, but its scope is narrower and more application-specific, with less cross-field generality than a feedback-control stability framework.

vs. Process Reward Agents for Steering Knowledge-Intensive Reasoning

gemini-35/5/2026

Paper 2 offers a highly novel theoretical framework by casting LLM self-correction as a feedback-control problem, addressing a major inconsistency in agentic systems. Its derivation of measurable stability thresholds provides deep foundational insights into error dynamics, applicable across all LLMs. While Paper 1 presents a strong practical method for process rewards in specific domains, Paper 2's theoretical grounding and causal analysis of self-correction have broader, field-wide implications for AI reliability and agent design.

vs. RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a new user-facing paradigm (interactive unlearning at inference time) plus a concrete, efficient method (STAMP with low-rank acceleration) that enables practical on-device model editing—highly timely given privacy, safety, and regulatory pressures. The approach has broad applications (data erasure, misinformation, harmful content) and crosses security/privacy/ML systems. Paper 1 offers a valuable theoretical framing and diagnostic for self-correction, but its primary contribution is analysis and prompting guidance with narrower downstream leverage compared to scalable unlearning capabilities.

vs. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

claude-opus-4.65/5/2026

Paper 1 offers a more novel and rigorous theoretical framework by recasting LLM self-correction as a feedback-control problem with a measurable stability threshold (ECR/EIR), validated across 7 models and 3 datasets with causal evidence from prompt interventions. It provides actionable, principled guidance for a widely-used agentic LLM technique. Paper 2 provides useful diagnostics for prompt optimization but addresses a narrower scope with more incremental insights (interaction effects are null, success depends on output format). Paper 1's control-theoretic framing has broader interdisciplinary appeal and deeper implications for LLM system design.