Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu
Abstract
Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.
AI Impact Assessments
(1 models)Scientific Impact Assessment: COSE (Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback)
1. Core Contribution
COSE addresses the problem of noisy self-generated feedback in self-evolving LLMs—systems where the same model generates tasks, validates them, solves them, and judges solutions. The key insight is that erroneous self-judgments become erroneous gradient updates, creating a compounding error problem. COSE proposes two mechanisms: (1) confidence-weighted PPO, which reduces gradient contributions from uncertain Validator/Judge feedback, and (2) confidence-prioritized replay, which biases sampling toward confidently validated, intermediate-difficulty questions. Confidence is estimated via normalized token-level entropy averaged across the feedback sequence. The idea is simple, intuitive, and computationally lightweight.
2. Methodological Rigor
Serious evaluation concerns. The most significant issue is the evaluation methodology and the resulting numbers. The paper uses `gpt-4.1-nano` as an LLM judge rather than standard exact-match or established evaluation harnesses. This produces results that are extremely difficult to credit:
The paper does not acknowledge or discuss the implausibility of these numbers, nor does it cross-validate against standard evaluation protocols (e.g., lm-evaluation-harness with exact match). This undermines the paper's central empirical claims. If the LLM judge is giving generous credit for detailed reasoning outputs regardless of final answer correctness, then the entire comparison between methods could be measuring output verbosity/style rather than reasoning capability.
Design and ablation quality. Setting evaluation aside, the experimental design is otherwise thorough: four backbones (0.6B–4B), 19 benchmarks across three domains, three baselines, and a clean ablation study. The ablation (Table 5) clearly shows confidence-weighted PPO as the dominant contributor, and the hyperparameter study (Tables 6–7) demonstrates reasonable robustness to batch size and confidence signal choice. The per-benchmark training dynamics in the appendix are commendably detailed.
3. Potential Impact
If the evaluation concerns are resolved and the gains hold under standard protocols, confidence-weighted self-evolution could be a practical and broadly applicable technique. The approach is model-agnostic, requires no external verifiers, and adds minimal overhead. It would be particularly valuable for domains lacking executable verification (e.g., open-ended reasoning, humanities, social sciences). However, the current evidence is insufficient to establish this impact with confidence.
The conceptual contribution—treating noisy self-feedback as a per-sample weighting problem rather than a filtering problem—is a useful framing that could influence future work on self-improving systems regardless of COSE's specific implementation.
4. Timeliness & Relevance
The paper addresses a timely problem. Self-evolving and self-play methods for LLM training are rapidly developing (AZR, R-Zero, MAE, SPICE), and the noisy feedback bottleneck is a genuine and recognized challenge. The paper is well-positioned in the literature progression from SFT → RLVR → self-evolution with verifiers → self-evolution with LLM judges, and COSE fills a natural gap. The focus on small models (0.6B–4B) is relevant for resource-constrained settings.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations
The paper would be substantially strengthened by: (1) reporting results under standard exact-match evaluation to validate the LLM-judge numbers, (2) analyzing the empirical correlation between confidence scores and actual judgment correctness, and (3) testing on at least one model ≥7B parameters. Without resolving the evaluation concerns, it is impossible to determine whether COSE genuinely outperforms baselines or whether the apparent gains reflect evaluation artifacts.
Generated May 28, 2026
Comparison History (15)
Paper 2 is likely to have higher scientific impact due to stronger novelty and broader cross-field relevance: it introduces a general cognitive scheduling principle for when to acquire visual evidence, addressing structural limitations of both caption-then-reason and end-to-end VLMs. This idea is timely for multimodal agents and has clear real-world applications (interactive perception, robotics, document/UI understanding) and broader impact spanning vision, NLP, and agentic planning. Paper 1 is methodologically solid and useful, but confidence-weighted RL updates/replay are more incremental and narrower in scope, with impact mainly within self-training/RLHF-style LLM training.
Paper 2 likely has higher impact: it introduces a rigorously validated, multi-technology benchmark for end-to-end scientific reasoning over complex spatial biology data, addressing an urgent evaluation gap for real-world scientific agents. Its deterministic grading, hardened claims, and broad modality coverage increase methodological rigor and reproducibility, and the benchmark can catalyze progress across ML, computational biology, and bioinformatics. Paper 1 is a solid algorithmic improvement for self-training LLMs, but confidence-weighted updates are a narrower methodological contribution with more incremental novelty and less domain-shaping infrastructure effect.
Paper 1 likely has higher impact: it proposes a general, scalable training paradigm (confidence-modulated self-evolution with PPO and replay) that directly improves LLM capability across many benchmarks and backbones, affecting core model training practice. Its novelty is in leveraging intrinsic confidence to mitigate noisy self-feedback, a key bottleneck for autonomous improvement, with broad applications (reasoning, math, potentially other domains) and strong empirical coverage. Paper 2 is elegant and practical but narrower (hallucination detection), with impact mainly in evaluation/monitoring rather than improving underlying model competence.
AIBuildAI-2 addresses a broader and more transformative problem—automating AI model development for non-experts, particularly scientists—with strong empirical validation (top rankings on MLE-Bench and competitive with human experts). Its knowledge-enhanced agent architecture with evolving knowledge systems has wider cross-disciplinary impact potential. While COSE makes a solid contribution to self-evolving LLMs with confidence-based learning, it represents a more incremental improvement within the LLM training paradigm. AIBuildAI-2's practical applications for democratizing AI across scientific domains give it higher potential impact.
Paper 2 (COSE) has higher likely scientific impact: it proposes a concrete, broadly applicable training method for self-improving LLMs using confidence-weighted RL updates and replay, evaluated across many benchmarks and multiple backbones with released code/data—suggesting strong methodological rigor and near-term adoptability. Its contributions directly target a timely core problem (learning from uncertain self-feedback) relevant to many LLM training pipelines. Paper 1 is ambitious and potentially impactful, but relies on more speculative protocol/economic assumptions and lacks demonstrated empirical validation in the abstract, making near-term scientific uptake less certain.
Paper 2 presents a foundational methodological advancement in LLM training (self-evolution using intrinsic confidence), addressing the critical challenge of noisy self-generated feedback without relying on external verifiers. Its extensive evaluation across 19 benchmarks demonstrates broad applicability to general reasoning, mathematics, and code. In contrast, Paper 1 introduces a domain-specific benchmark for petroleum engineering. While practically useful for that specific industry, Paper 2 has a significantly wider breadth of impact and higher potential to influence the broader AI and machine learning research community.
Paper 2 introduces a novel and practically significant threat model ('Sleeper Attack') for LLM agents that formalizes cross-interaction persistence of adversarial content—a largely unexplored attack surface. This has broad implications for AI safety and security as LLM agents become widely deployed. The comprehensive benchmark (1,896 instances, multiple attack strategies, seven LLMs) demonstrates methodological rigor. Paper 1, while solid, represents an incremental improvement in self-evolving LLMs using confidence signals. Paper 2's novelty in identifying a new class of vulnerabilities and its timeliness given rapid LLM agent adoption give it higher potential impact.
Paper 1 likely has higher impact: it tackles a timely, broadly relevant bottleneck in self-improving LLM training—learning from uncertain self-generated feedback—using a general, lightweight confidence-based mechanism (confidence-weighted PPO and prioritized replay). It is evaluated across many benchmarks and multiple popular backbones, suggesting robustness and wide applicability to LLM alignment, reasoning, and RLHF-like training. Paper 2 is methodologically solid and useful for memory-constrained planning, but its scope is narrower and likely impacts a more specialized community.
Paper 1 has higher potential impact due to strong timeliness and breadth: reliable self-improvement/self-training of LLMs is a central, fast-moving problem with broad applicability across reasoning, math, and code. COSE’s use of intrinsic confidence to mitigate noisy self-feedback is a simple, general mechanism that could be adopted widely and combined with many training pipelines, and it is validated across many benchmarks and multiple model families. Paper 2 is rigorous and valuable for game-theoretic RL, but its scope and immediate cross-field adoption are narrower than LLM self-evolution.
Paper 1 likely has higher scientific impact due to broader applicability and timeliness: COSE addresses a fundamental bottleneck in self-improving LLMs (learning under uncertain self-feedback) with a general, lightweight mechanism (confidence-weighted RL updates + replay) demonstrated across 19 benchmarks and multiple backbones. This can influence training paradigms for many domains beyond a single failure mode. Paper 2 is valuable and practical, but is narrower (object hallucination in LVLMs) and primarily an inference-time attention adjustment, with impact concentrated in multimodal generation reliability.
GlobalDentBench has higher estimated scientific impact due to its broader interdisciplinary relevance spanning AI and healthcare, its novel contribution as the first multinational dental benchmark (8,978 questions across 88 countries, 14 specialties), and its critical safety findings (31% unsafe rate, 4.51% irreversible harm risk). These results have immediate implications for clinical AI deployment policy. The benchmark fills a clear gap in medical AI evaluation. While COSE offers meaningful methodological contributions to LLM self-evolution, its incremental improvements on existing paradigms and narrower scope limit its broader impact compared to GlobalDentBench's patient-safety implications.
Paper 1 addresses a fundamental bottleneck in AI: the reliance on human-curated data and external verifiers for LLM training. By enabling self-evolving LLMs to reliably use intrinsic confidence to mitigate noisy self-feedback, it offers a scalable solution to the AI 'data wall' problem. This methodological innovation has broad, domain-agnostic impact across all fields utilizing generative AI. While Paper 2 presents a strong, applied framework for biomedical discovery, Paper 1's foundational contribution to autonomous AI capability development gives it a wider and more immediate transformational impact across the broader scientific landscape.
Paper 1 has higher likely impact due to a concrete, broadly applicable training method for self-improving LLMs, validated with extensive experiments across 19 benchmarks and multiple backbones, and offering an immediately adoptable technique (confidence-weighted PPO + prioritized replay). It is timely for scalable RLHF/RLAIF-like regimes and can influence many downstream reasoning/math applications. Paper 2 presents an important systems/position vision for operable 3D generation, but with limited empirical validation and narrower near-term adoption; its impact may be longer-term and contingent on community uptake.
Paper 1 addresses a highly critical and timely bottleneck in the rapidly expanding field of Large Language Models: reducing reliance on human-curated data through self-evolution. Its approach of using intrinsic confidence to mitigate noisy self-generated feedback is novel and widely applicable across various LLM domains. While Paper 2 presents a solid advancement in hierarchical reinforcement learning, the current explosion of LLM applications and the foundational importance of scalable self-improvement methods give Paper 1 a significantly broader and more immediate potential scientific impact.
Paper 2 has higher potential impact due to a clearer, high-stakes real-world application (clinical decision support) and a novel, broadly reusable pipeline that converts clinical practice guidelines into executable logic to generate factual/counterfactual supervision. This directly targets reliability and faithfulness—key barriers to deployment—supported by benchmark gains and physician evaluation. Paper 1 is methodologically interesting and broadly applicable, but leverages model confidence (often poorly calibrated) as a training signal and appears more incremental relative to existing uncertainty-weighting and replay ideas. MedGuideX is also timely given regulatory and safety pressure in medical AI.