EVE-Agent: Evidence-Verifiable Self-Evolving Agents

Yamato Arai, Yuma Ichikawa

May 21, 2026

arXiv:2605.22905v1 PDF

cs.AI(primary)cs.CL

#441of 2682·Artificial Intelligence

#441 of 2682 · Artificial Intelligence

Tournament Score

1486±45

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor6

Novelty6.5

Clarity7.5

Tournament Score

1486±45

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EVE-Agent: Evidence-Verifiable Self-Evolving Agents

1. Core Contribution

EVE-Agent addresses a specific and well-motivated gap in data-free self-evolving search agents: the lack of evidence verification in the training loop. Prior self-evolving frameworks (e.g., Dr. Zero) reward proposers based on question difficulty but never check whether the generated answer is actually grounded in retrievable source text. EVE-Agent introduces an evidence verifier that measures the marginal accuracy gain when a proposer-emitted evidence span is provided to the solver versus withheld. This creates a causal, annotation-free signal that rewards evidence spans only when they genuinely help the solver answer correctly.

The key design insight is elegant: by comparing solver accuracy with and without evidence (both in single-turn, search-disabled mode), the system isolates the informational contribution of the span itself, without requiring oracle labels, human annotations, or external knowledge graphs. The modification is deliberately minimal—only the reward function changes; the backbone, retriever, search tool, and optimization framework (HRPO/GRPO) remain identical.

2. Methodological Rigor

The paper demonstrates solid methodological discipline:

Formal grounding: The evidence verifier is formalized precisely (Eq. 11), with a proven unbiased Monte Carlo estimator (Proposition B.2) and bounded variance (≤1/2m). The inherited difficulty reward is also formally characterized (Lemma B.1) with a closed-form expectation and unique interior maximizer. These aren't deep theoretical results, but they provide clean foundations.

Controlled comparison: The experimental design carefully isolates the evidence verifier's contribution. EVE-Agent and Dr. Zero share the same backbone (Qwen2.5-3B-Instruct), retrieval corpus (FlashRAG Wikipedia-2018), search tool configuration, compute budget (single 8×B200 node, 50 steps per phase), and optimization framework. This matched comparison is a strength.

Evaluation metrics: The three-metric framework (answer EM, evidence score, joint correctness) is well-chosen. The joint metric—requiring both correct answer AND supporting evidence—is the most diagnostic and shows the largest relative improvements.

Weaknesses in rigor: The evidence quality evaluation relies on GPT-4.1 as an external judge, which introduces its own biases and is not itself validated. No inter-annotator agreement or comparison with human judgments is reported. The experiments use only one backbone model (3B parameters), so generalization across model scales is unknown. Statistical significance tests are absent—given that some margins are small (e.g., MuSiQue), this matters. The optional cluster bandit selector is described in detail but never ablated, making its contribution unclear.

3. Potential Impact

Immediate applications: The evidence-verifiability principle has clear practical value for deploying search agents where auditability matters—legal research, medical QA, fact-checking systems. The audit trail (each training example carries an inspectable source span) addresses a real trust concern.

Broader influence: The paper's framing—"self-evolving agents should not train on examples they cannot justify"—articulates a useful design principle that could influence how the community thinks about self-play and self-improvement loops beyond search agents. The marginal-accuracy-gain formulation for evidence scoring could be adapted to other domains where grounding is important.

Limitations to impact: The absolute performance numbers are modest. Average answer EM is 0.221 (EVE-Agent) vs. 0.115 (Dr. Zero)—improved but still low. The joint answer-and-evidence rate averages 0.167, meaning roughly 83% of instances fail the strictest criterion. The approach is demonstrated only on a 3B model with a specific retrieval setup, limiting immediate practical deployment claims.

4. Timeliness & Relevance

The paper is highly timely. Self-evolving and self-play training paradigms are rapidly gaining traction (Absolute Zero, R-Zero, Dr. Zero, SAGE), and retrieval-augmented RL agents (Search-R1, R1-Searcher) represent an active frontier. The evidence-grounding gap identified here is a genuine vulnerability in existing systems—one that becomes more concerning as these systems scale. The paper correctly identifies that search-based QA lacks the clean verification mechanisms available in code/math domains, and proposes a principled alternative.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation with a well-defined gap in prior work (Section 4.2's diagnostic is convincing)

Minimal, modular design—changing only the reward while preserving all other infrastructure makes the contribution easy to adopt

The evidence verifier requires no external annotations, maintaining the data-free property

Formal properties (unbiasedness, bounded variance) are stated and proved

The "auditable by construction" framing adds practical value

Notable Weaknesses:

Single model scale (3B)—no evidence this works at larger scales where memorization patterns differ

GPT-4.1 judge is unvalidated; evidence quality evaluation could be circular or biased

No ablation of the cluster bandit selector, despite detailed description

Brevity bonus (Eq. 15) and coefficient choices (λ_V=0.5, λ_B=0.1) appear ad hoc with no sensitivity analysis

The two-phase training schedule (freeze solver → train proposer → freeze proposer → train solver) is pragmatic but limits the co-adaptation that might further improve performance

Absolute performance remains low, suggesting the approach helps but doesn't solve the fundamental challenge

Limited to exact-match evaluation; no analysis of partial credit or answer quality beyond binary correctness

The paper references several works dated 2026, suggesting either a future publication date or unconventional citation practices that complicate situating the work in the literature

Summary

EVE-Agent makes a focused, well-motivated contribution by introducing evidence verifiability into self-evolving search agent loops. The design is clean, the controlled comparison is fair, and the empirical improvements on the joint answer-and-evidence metric are substantial in relative terms. The paper's primary limitation is the narrow experimental scope (single model, single corpus, no ablation of all components) and modest absolute performance. It represents a useful incremental advance that articulates an important principle for the growing self-evolution paradigm.

Rating:6/ 10

Significance 6.5Rigor 6Novelty 6.5Clarity 7.5

Generated May 25, 2026

Comparison History (27)

vs. Proper Scoring Rules for Agentic Uncertainty Quantification

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it targets a timely, high-stakes problem (trustworthy self-improvement in agentic search) and proposes a broadly applicable, tool- and model-agnostic mechanism (evidence-verifiable training signal via marginal utility of evidence) with clear real-world relevance (auditable, grounded agents). If validated empirically, it can influence training paradigms across IR, RL/self-training, and agent design. Paper 1 is methodologically rigorous and novel in evaluation theory, but its impact is narrower (metrics/UQ evaluation) and more incremental to practice.

vs. Learning to Reason Efficiently with A* Post-Training

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in LLM reasoning by integrating classical A* search algorithms into post-training. Its empirical results—enabling 1B-3B models to outperform much larger state-of-the-art models—demonstrate massive potential impact. Bridging classical search with process reward models aligns perfectly with the current frontier of 'System 2' AI reasoning, offering broader methodological implications and immediate real-world efficiency gains compared to the specialized agent self-evolution framework in Paper 1.

vs. CoRe-Code: Collaborative Reinforcement Learning for Code Generation

gemini-3.15/26/2026

While Paper 1 offers strong methodological improvements for code generation, Paper 2 tackles a more fundamental bottleneck in AI: the unreliability and hallucination risks of self-evolving agents training on synthetic data. By introducing an unsupervised, evidence-verifiable reward mechanism, EVE-Agent addresses core issues of trustworthiness, auditability, and scalable oversight. This approach to verifiable self-evolution has broader, cross-disciplinary implications for the safe and reliable deployment of autonomous AI systems.

vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

gemini-3.15/26/2026

Paper 1 presents a massive-scale foundation model for medical AI, trained on millions of ECGs. Its robust evaluation across 89 clinical tasks and multiple external cohorts demonstrates unprecedented generalization and real-world applicability for opportunistic screening and rare disease detection. While Paper 2 offers a valuable methodological improvement for LLM agents, Paper 1's potential to directly impact clinical practice, improve diagnostic capabilities, and save lives gives it a substantially higher scientific and societal impact.

vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

claude-opus-4.65/25/2026

EVE-Agent addresses a fundamental challenge in self-evolving AI agents—trustworthy self-improvement without human supervision—by introducing evidence verifiability as a core principle. This has broader impact across retrieval-augmented generation, autonomous agents, and AI safety/trustworthiness. The approach is more generalizable (applicable to any self-evolving agent framework) and addresses the timely concern of AI hallucination and groundedness. Paper 1, while addressing a valid problem in multi-agent planning, targets a narrower issue (epistemic miscalibration) with more modest improvements (9.75%) and less broad applicability.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

gpt-5.25/25/2026

Paper 1 has higher estimated scientific impact due to a more novel and broadly applicable principle: making self-evolution data generation evidence-verifiable via a measurable marginal-utility signal for retrieved spans, enabling auditable, label-free improvement. This directly addresses a central reliability failure mode (self-reinforced hallucinated curricula) and can generalize across search/RAG, agent training, and safety/verification. Paper 2 is timely and useful for improving ReAct trajectories, but rubric-guided inference and learned evaluators are a more incremental extension with narrower cross-field reach and potentially higher sensitivity to rubric quality/domain shift.

vs. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

claude-opus-4.65/25/2026

SkillOpt demonstrates broader empirical impact across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses, consistently outperforming all competitors. Its framing of text-space skill optimization as analogous to weight-space optimization introduces a novel paradigm with wide applicability to any LLM agent. The transferability results across models and environments further strengthen its practical impact. While EVE-Agent addresses an important trustworthiness problem in self-evolving search agents, its scope is narrower (search agents only), and its contribution is more incremental—adding an evidence verification step to existing proposer-solver frameworks.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

claude-opus-4.65/25/2026

Paper 1 documents a counterintuitive and consequential finding—inverse scaling in LLM forecasting on critical tasks like epidemics and financial crises—with broad implications for how the field evaluates and deploys LLMs. It challenges the default assumption that more capable models are uniformly better, proposes a new benchmark, and identifies specific failure modes (tail risk underestimation) with real-world safety implications. Its methodological rigor (synthetic + real-world replication, per-quantile decomposition, within-family ablation) and relevance to high-stakes domains give it broader cross-disciplinary impact than Paper 2, which presents an incremental improvement to self-evolving agent training.

vs. Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

gpt-5.25/25/2026

Paper 2 is likely to have higher scientific impact due to stronger novelty and broader, timely relevance: it proposes an evidence-verifiable training signal for self-evolving agents, a central problem in current LLM agent research, and is broadly applicable across retrieval-augmented generation, autonomous agents, and trustworthy AI. Its approach is model/tool-agnostic and targets a rapidly growing research area, increasing adoption potential. Paper 1 is rigorous and practically useful for compliance engineering, but its impact is more domain-specific (governance/semantic-web compliance tooling) and may diffuse more slowly across core ML/agent research.

vs. EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

gpt-5.25/25/2026

Paper 2 is likely higher impact due to broader applicability and timeliness: evidence-verifiable self-evolution targets a central reliability problem in agentic LLM systems (self-training without hallucinated rewards) and provides an auditable training signal usable across many retrieval/search agent settings. The approach is conceptually novel (marginal utility-based evidence verification), has clear real-world utility for trustworthy QA/search, and can influence multiple fields (agents, RAG, self-training, evaluation). Paper 1 is methodologically interesting but more specialized to OPD/privileged-context distillation and narrower in scope.

vs. Foundation Protocol: A Coordination Layer for Agentic Society

gemini-3.15/25/2026

Paper 1 presents a concrete, methodologically rigorous solution to a critical problem in AI agent training (evidence verifiability) backed by empirical validation. In contrast, Paper 2 proposes a broad, conceptual framework for agent coordination that, while visionary, lacks the empirical grounding and immediate technical applicability of Paper 1. Consequently, Paper 1 is much more likely to drive tangible, near-term scientific follow-up, implementation, and citations in the rapidly evolving field of autonomous agents.

vs. Agentic Proving for Program Verification

gemini-3.15/25/2026

Paper 2 introduces a novel and scalable methodology for evidence-verifiable self-evolution in agents, addressing the critical issue of hallucination and opaque training signals in LLMs. This foundational contribution has broader applicability and potential for real-world impact across various domains compared to Paper 1, which primarily serves as an empirical evaluation of an existing model on a specific program verification benchmark.

vs. SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

gpt-5.25/25/2026

Paper 2 is likely higher impact due to a more broadly applicable and timely contribution: making self-evolving LLM/search agents evidence-verifiable without human labels. The core idea (rewarding evidence by marginal utility) is a general mechanism that can transfer across domains, improving trustworthiness, auditability, and training stability—key blockers for real-world deployment. It also proposes an actionable framework change rather than primarily diagnosing failures. Paper 1 is valuable and rigorous as an evaluation/diagnostic benchmark, but its impact is narrower to VLM spatial numeracy and may be more incremental.

vs. Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

claude-opus-4.65/25/2026

Paper 2 (IDS) addresses the fundamental challenge of automated formal verification of distributed systems—a problem with enormous practical significance. It demonstrates a concrete, measurable breakthrough: achieving 7/7 on specifications where SOTA agents manage only 2/7, at ~200x speedup over expert effort. This bridges the long-standing gap between AI code generation and formal correctness guarantees, with broad implications for software reliability. Paper 1 (EVE-Agent) makes a meaningful but more incremental contribution to self-evolving agents with evidence verification. While valuable, its impact is narrower compared to enabling AI-driven formally verified system synthesis.

vs. GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

claude-opus-4.65/25/2026

EVE-Agent addresses a fundamental and broadly applicable challenge in self-evolving AI agents—ensuring evidence verifiability in self-generated training data. This has wide implications for trustworthy AI, retrieval-augmented generation, and scalable agent training without human supervision. Paper 2 (GENSTRAT) contributes a valuable benchmark for strategic reasoning in LLMs, but benchmarks tend to have narrower and more transient impact. EVE-Agent's principle that self-evolving agents must justify their training data introduces a methodological contribution with lasting influence across multiple agent paradigms, making it more impactful overall.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

claude-opus-4.65/25/2026

EVE-Agent addresses a critical and timely problem in AI—trustworthy self-evolution of language agents—with broad applicability across NLP, information retrieval, and AI safety. Its evidence-verifiability framework introduces a novel principle (marginal accuracy gain from evidence spans) that could influence how self-improving AI systems are designed. Paper 1, while technically sound in combining CP and DP, is narrower in scope (scheduling), acknowledges it's not competitive with state-of-the-art, and primarily demonstrates feasibility rather than advancing performance. Paper 2's relevance to the rapidly growing LLM agent ecosystem gives it significantly broader potential impact.

vs. One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

gemini-3.15/25/2026

Paper 1 tackles a fundamental challenge in AI—trustworthy self-evolution of LLM agents—by introducing an evidence-verifiable framework that reduces hallucination without human data. This addresses critical bottlenecks in foundation models and has broad applications across reasoning, search, and autonomous systems. Paper 2, while methodologically strong, focuses on the narrower domain of NPC scaling in gaming environments, limiting its broader scientific applicability compared to the foundational advancements in Paper 1.

vs. Parallel Context Compaction for Long-Horizon LLM Agent Serving

gpt-5.25/25/2026

Paper 1 is more conceptually novel: it introduces an evidence-verifiability principle and a concrete training signal (marginal utility of evidence spans) for data-free self-evolving agents, addressing a core reliability/alignment bottleneck in self-improvement. This has broad implications for trustworthy agent training, retrieval-augmented generation, and auditable curricula, with likely cross-field impact (IR, NLP, ML safety). Paper 2 is timely and useful systems work for serving long-horizon agents, but is more incremental/engineering-focused and narrower in scientific generality.

vs. The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

gemini-3.15/25/2026

Paper 2 establishes fundamental theoretical limits (impossibility results) for AI architectures, defining a computable 'Deterministic Horizon' for reasoning depth. Foundational theoretical bounds that span multiple AI subfields offer broader, paradigm-shifting scientific impact compared to the algorithmic improvements for search agents presented in Paper 1.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gemini-3.15/25/2026

Paper 2 identifies a critical 'inverse scaling' failure mode in LLMs for high-stakes forecasting tasks, challenging the prevailing 'scale is all you need' paradigm. Its implications for finance, epidemiology, and AI safety are profound, highlighting how increased capabilities can paradoxically increase tail risk. While Paper 1 offers a valuable methodological improvement for self-evolving agents, Paper 2's discovery of a systematic flaw in advanced models across broad, real-world domains gives it a wider scientific and societal impact.