Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

May 27, 2026

arXiv:2605.28010v1 PDF

cs.AI(primary)

#1277of 2682·Artificial Intelligence

#1277 of 2682 · Artificial Intelligence

Tournament Score

1414±48

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance5

Rigor3

Novelty4.5

Clarity7

Tournament Score

1414±48

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: COSE (Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback)

1. Core Contribution

COSE addresses the problem of noisy self-generated feedback in self-evolving LLMs—systems where the same model generates tasks, validates them, solves them, and judges solutions. The key insight is that erroneous self-judgments become erroneous gradient updates, creating a compounding error problem. COSE proposes two mechanisms: (1) confidence-weighted PPO, which reduces gradient contributions from uncertain Validator/Judge feedback, and (2) confidence-prioritized replay, which biases sampling toward confidently validated, intermediate-difficulty questions. Confidence is estimated via normalized token-level entropy averaged across the feedback sequence. The idea is simple, intuitive, and computationally lightweight.

2. Methodological Rigor

Serious evaluation concerns. The most significant issue is the evaluation methodology and the resulting numbers. The paper uses `gpt-4.1-nano` as an LLM judge rather than standard exact-match or established evaluation harnesses. This produces results that are extremely difficult to credit:

Qwen3-0.6B achieves 95.00% on MMLU (from 44.00 base), 82.67% on GPQA, and 97.00% on LiveBench Reasoning through self-evolution alone. For context, GPQA is designed to be "graduate-level google-proof," and no 0.6B model has been credibly reported near these levels.

Llama-3.2-3B-Instruct under MAE reaches 95.00% MMLU, 81.00% GPQA, 93.00% LiveBench-R. These would represent state-of-the-art results for models 10-100× larger.

The fact that *all* self-evolution methods achieve these extraordinary numbers suggests systematic evaluation inflation rather than genuine capability gains.

The paper does not acknowledge or discuss the implausibility of these numbers, nor does it cross-validate against standard evaluation protocols (e.g., lm-evaluation-harness with exact match). This undermines the paper's central empirical claims. If the LLM judge is giving generous credit for detailed reasoning outputs regardless of final answer correctness, then the entire comparison between methods could be measuring output verbosity/style rather than reasoning capability.

Design and ablation quality. Setting evaluation aside, the experimental design is otherwise thorough: four backbones (0.6B–4B), 19 benchmarks across three domains, three baselines, and a clean ablation study. The ablation (Table 5) clearly shows confidence-weighted PPO as the dominant contributor, and the hyperparameter study (Tables 6–7) demonstrates reasonable robustness to batch size and confidence signal choice. The per-benchmark training dynamics in the appendix are commendably detailed.

3. Potential Impact

If the evaluation concerns are resolved and the gains hold under standard protocols, confidence-weighted self-evolution could be a practical and broadly applicable technique. The approach is model-agnostic, requires no external verifiers, and adds minimal overhead. It would be particularly valuable for domains lacking executable verification (e.g., open-ended reasoning, humanities, social sciences). However, the current evidence is insufficient to establish this impact with confidence.

The conceptual contribution—treating noisy self-feedback as a per-sample weighting problem rather than a filtering problem—is a useful framing that could influence future work on self-improving systems regardless of COSE's specific implementation.

4. Timeliness & Relevance

The paper addresses a timely problem. Self-evolving and self-play methods for LLM training are rapidly developing (AZR, R-Zero, MAE, SPICE), and the noisy feedback bottleneck is a genuine and recognized challenge. The paper is well-positioned in the literature progression from SFT → RLVR → self-evolution with verifiers → self-evolution with LLM judges, and COSE fills a natural gap. The focus on small models (0.6B–4B) is relevant for resource-constrained settings.

5. Strengths & Limitations

Strengths:

Clear problem identification and well-motivated solution

Lightweight mechanism (entropy-based confidence) with no external dependencies

Comprehensive experimental coverage across models and benchmarks

Clean ablation separating PPO weighting from replay priority contributions

Detailed training dynamics analysis (Appendix F) showing per-benchmark behavior

Code and data availability pledged

Limitations:

Critical: evaluation methodology produces implausible results that undermine all quantitative claims. No cross-validation with standard evaluation harnesses is provided.

Limited novelty: entropy-based confidence estimation, sample weighting, and prioritized replay are all well-established techniques. The combination is sensible but incremental.

Small model scale only (≤4B parameters). The paper's own limitations section notes that larger models may behave differently, and this is important since the confidence-weighting argument may change qualitatively at scale.

No analysis of confidence calibration: the paper claims confidence need not be calibrated, but provides no empirical evidence that entropy correlates with judgment correctness in practice.

Curriculum drift: the TriviaQA regression on Qwen3-4B and the acknowledged inability to handle distribution mismatch between self-generated curriculum and evaluation are meaningful limitations.

The confidence floor of 0.1 means even maximally uncertain feedback still contributes 10% gradient weight; no justification for this specific threshold is provided beyond preserving exploration.

Additional Observations

The paper would be substantially strengthened by: (1) reporting results under standard exact-match evaluation to validate the LLM-judge numbers, (2) analyzing the empirical correlation between confidence scores and actual judgment correctness, and (3) testing on at least one model ≥7B parameters. Without resolving the evaluation concerns, it is impossible to determine whether COSE genuinely outperforms baselines or whether the apparent gains reflect evaluation artifacts.

Rating:4/ 10

Significance 5Rigor 3Novelty 4.5Clarity 7

Generated May 28, 2026

Comparison History (15)

vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

gpt-5.25/28/2026

Paper 2 is likely to have higher scientific impact due to stronger novelty and broader cross-field relevance: it introduces a general cognitive scheduling principle for when to acquire visual evidence, addressing structural limitations of both caption-then-reason and end-to-end VLMs. This idea is timely for multimodal agents and has clear real-world applications (interactive perception, robotics, document/UI understanding) and broader impact spanning vision, NLP, and agentic planning. Paper 1 is methodologically solid and useful, but confidence-weighted RL updates/replay are more incremental and narrower in scope, with impact mainly within self-training/RLHF-style LLM training.

vs. Verifiable Benchmarking of Long-Horizon Spatial Biology

gpt-5.25/28/2026

Paper 2 likely has higher impact: it introduces a rigorously validated, multi-technology benchmark for end-to-end scientific reasoning over complex spatial biology data, addressing an urgent evaluation gap for real-world scientific agents. Its deterministic grading, hardened claims, and broad modality coverage increase methodological rigor and reproducibility, and the benchmark can catalyze progress across ML, computational biology, and bioinformatics. Paper 1 is a solid algorithmic improvement for self-training LLMs, but confidence-weighted updates are a narrower methodological contribution with more incremental novelty and less domain-shaping infrastructure effect.

vs. Automatic Layer Selection for Hallucination Detection

gpt-5.25/28/2026

Paper 1 likely has higher impact: it proposes a general, scalable training paradigm (confidence-modulated self-evolution with PPO and replay) that directly improves LLM capability across many benchmarks and backbones, affecting core model training practice. Its novelty is in leveraging intrinsic confidence to mitigate noisy self-feedback, a key bottleneck for autonomous improvement, with broad applications (reasoning, math, potentially other domains) and strong empirical coverage. Paper 2 is elegant and practical but narrower (hallucination detection), with impact mainly in evaluation/monitoring rather than improving underlying model competence.

vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

claude-opus-4.65/28/2026

AIBuildAI-2 addresses a broader and more transformative problem—automating AI model development for non-experts, particularly scientists—with strong empirical validation (top rankings on MLE-Bench and competitive with human experts). Its knowledge-enhanced agent architecture with evolving knowledge systems has wider cross-disciplinary impact potential. While COSE makes a solid contribution to self-evolving LLMs with confidence-based learning, it represents a more incremental improvement within the LLM training paradigm. AIBuildAI-2's practical applications for democratizing AI across scientific domains give it higher potential impact.

vs. SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

gpt-5.25/28/2026

Paper 2 (COSE) has higher likely scientific impact: it proposes a concrete, broadly applicable training method for self-improving LLMs using confidence-weighted RL updates and replay, evaluated across many benchmarks and multiple backbones with released code/data—suggesting strong methodological rigor and near-term adoptability. Its contributions directly target a timely core problem (learning from uncertain self-feedback) relevant to many LLM training pipelines. Paper 1 is ambitious and potentially impactful, but relies on more speculative protocol/economic assumptions and lacks demonstrated empirical validation in the abstract, making near-term scientific uptake less certain.

vs. PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

gemini-3.15/28/2026

Paper 2 presents a foundational methodological advancement in LLM training (self-evolution using intrinsic confidence), addressing the critical challenge of noisy self-generated feedback without relying on external verifiers. Its extensive evaluation across 19 benchmarks demonstrates broad applicability to general reasoning, mathematics, and code. In contrast, Paper 1 introduces a domain-specific benchmark for petroleum engineering. While practically useful for that specific industry, Paper 2 has a significantly wider breadth of impact and higher potential to influence the broader AI and machine learning research community.

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

claude-opus-4.65/28/2026

Paper 2 introduces a novel and practically significant threat model ('Sleeper Attack') for LLM agents that formalizes cross-interaction persistence of adversarial content—a largely unexplored attack surface. This has broad implications for AI safety and security as LLM agents become widely deployed. The comprehensive benchmark (1,896 instances, multiple attack strategies, seven LLMs) demonstrates methodological rigor. Paper 1, while solid, represents an incremental improvement in self-evolving LLMs using confidence signals. Paper 2's novelty in identifying a new class of vulnerabilities and its timeliness given rapid LLM agent adoption give it higher potential impact.

vs. GONDOR to the Rescue: Satisficing Planning with Low Memory

gpt-5.25/28/2026

Paper 1 likely has higher impact: it tackles a timely, broadly relevant bottleneck in self-improving LLM training—learning from uncertain self-generated feedback—using a general, lightweight confidence-based mechanism (confidence-weighted PPO and prioritized replay). It is evaluated across many benchmarks and multiple popular backbones, suggesting robustness and wide applicability to LLM alignment, reasoning, and RLHF-like training. Paper 2 is methodologically solid and useful for memory-constrained planning, but its scope is narrower and likely impacts a more specialized community.

vs. Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

gpt-5.25/28/2026

Paper 1 has higher potential impact due to strong timeliness and breadth: reliable self-improvement/self-training of LLMs is a central, fast-moving problem with broad applicability across reasoning, math, and code. COSE’s use of intrinsic confidence to mitigate noisy self-feedback is a simple, general mechanism that could be adopted widely and combined with many training pipelines, and it is validated across many benchmarks and multiple model families. Paper 2 is rigorous and valuable for game-theoretic RL, but its scope and immediate cross-field adoption are narrower than LLM self-evolution.

vs. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to broader applicability and timeliness: COSE addresses a fundamental bottleneck in self-improving LLMs (learning under uncertain self-feedback) with a general, lightweight mechanism (confidence-weighted RL updates + replay) demonstrated across 19 benchmarks and multiple backbones. This can influence training paradigms for many domains beyond a single failure mode. Paper 2 is valuable and practical, but is narrower (object hallucination in LVLMs) and primarily an inference-time attention adjustment, with impact concentrated in multimodal generation reliability.

vs. GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

claude-opus-4.65/28/2026

GlobalDentBench has higher estimated scientific impact due to its broader interdisciplinary relevance spanning AI and healthcare, its novel contribution as the first multinational dental benchmark (8,978 questions across 88 countries, 14 specialties), and its critical safety findings (31% unsafe rate, 4.51% irreversible harm risk). These results have immediate implications for clinical AI deployment policy. The benchmark fills a clear gap in medical AI evaluation. While COSE offers meaningful methodological contributions to LLM self-evolution, its incremental improvements on existing paradigms and narrower scope limit its broader impact compared to GlobalDentBench's patient-safety implications.

vs. Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

gemini-3.15/28/2026

Paper 1 addresses a fundamental bottleneck in AI: the reliance on human-curated data and external verifiers for LLM training. By enabling self-evolving LLMs to reliably use intrinsic confidence to mitigate noisy self-feedback, it offers a scalable solution to the AI 'data wall' problem. This methodological innovation has broad, domain-agnostic impact across all fields utilizing generative AI. While Paper 2 presents a strong, applied framework for biomedical discovery, Paper 1's foundational contribution to autonomous AI capability development gives it a wider and more immediate transformational impact across the broader scientific landscape.

vs. Hylos: Operability Contracts for Model-Native Spatial Intelligence

gpt-5.25/28/2026

Paper 1 has higher likely impact due to a concrete, broadly applicable training method for self-improving LLMs, validated with extensive experiments across 19 benchmarks and multiple backbones, and offering an immediately adoptable technique (confidence-weighted PPO + prioritized replay). It is timely for scalable RLHF/RLAIF-like regimes and can influence many downstream reasoning/math applications. Paper 2 presents an important systems/position vision for operable 3D generation, but with limited empirical validation and narrower near-term adoption; its impact may be longer-term and contingent on community uptake.

vs. Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

gemini-3.15/28/2026

Paper 1 addresses a highly critical and timely bottleneck in the rapidly expanding field of Large Language Models: reducing reliance on human-curated data through self-evolution. Its approach of using intrinsic confidence to mitigate noisy self-generated feedback is novel and widely applicable across various LLM domains. While Paper 2 presents a solid advancement in hierarchical reinforcement learning, the current explosion of LLM applications and the foundational importance of scalable self-improvement methods give Paper 1 a significantly broader and more immediate potential scientific impact.

vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

gpt-5.25/28/2026

Paper 2 has higher potential impact due to a clearer, high-stakes real-world application (clinical decision support) and a novel, broadly reusable pipeline that converts clinical practice guidelines into executable logic to generate factual/counterfactual supervision. This directly targets reliability and faithfulness—key barriers to deployment—supported by benchmark gains and physician evaluation. Paper 1 is methodologically interesting and broadly applicable, but leverages model confidence (often poorly calibrated) as a training signal and appears more incremental relative to existing uncertainty-weighting and replay ideas. MedGuideX is also timely given regulatory and safety pressure in medical AI.