EXG: Self-Evolving Agents with Experience Graphs

Yuxin Jin, Siyuan Zhang, Hanchen Wang, Lu Qin, Ying Zhang, Wenjie Zhang

May 18, 2026

arXiv:2605.17721v1 PDF

cs.AI(primary)

#782of 2292·Artificial Intelligence

#782 of 2292 · Artificial Intelligence

Tournament Score

1445±42

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance6.5

Rigor5.8

Novelty6.5

Clarity7.5

Tournament Score

1445±42

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EXG: Self-Evolving Agents with Experience Graphs

1. Core Contribution

EXG proposes a graph-based framework for organizing LLM agent experience into a structured, relational representation that supports both online (real-time, cross-task) and offline (frozen memory) self-evolution. The key innovation is the experience graph itself—a heterogeneous graph with case nodes (golden/warning), task anchor nodes, and three edge types (contain, similarity, correction). Cases are abstracted from interaction trajectories, and the graph grows incrementally during deployment. Experience is retrieved through a multi-source algorithm combining task-local anchors, semantic seed expansion, and corrective traces, followed by relevance-aware reranking with one-hop propagation. The framework is designed as a plug-and-play module compatible with existing self-evolving agents like Reflexion and SE-Agent.

The problem addressed—how to systematically accumulate, structure, and reuse agent experience across tasks—is genuine and practically important. The specific gap targeted is that online methods confine experience to single tasks while offline methods require costly post-processing, and EXG bridges both regimes with a unified representation.

2. Methodological Rigor

The framework is well-formalized with clear definitions of trajectories, cases, graph structure, retrieval, and reranking. The algorithms are presented with sufficient detail for reproduction. The experimental protocol is reasonable: four benchmarks (HumanEval, EvalPlus, MuSiQue, HotpotQA), multiple model scales (Qwen3-1.7B through Qwen-Max), and both online and offline evaluations.

However, several methodological concerns arise:

Limited retry budget. Each task is allowed only two attempts (one initial + one retry). This is a relatively constrained setting that may favor EXG's first-attempt guidance but doesn't fully stress-test the framework under deeper interaction horizons. The paper acknowledges this indirectly by noting SE-Agent is simplified to a revision-only variant.

Baseline selection. The comparison is primarily against Reflexion and SE-Agent variants. The offline comparison is limited to ExpeL only. More recent memory-based approaches (e.g., SAGE, AgentEvolver) are discussed in related work but not compared experimentally. The baseline coverage, while reasonable, could be broader.

Statistical reporting. Results are presented as single-run numbers without confidence intervals or variance across runs. Given the stochasticity of LLM outputs, this is a notable gap.

Evaluation metrics. Pass@1 and pass@2 are standard but limited. For reasoning tasks, partial credit or reasoning quality metrics would provide richer signal. The efficiency metrics (LLM calls, latency) are appreciated and add significant value.

3. Potential Impact

The work has several dimensions of potential impact:

Practical utility. The plug-and-play nature of EXG is compelling—it can augment existing agents without architectural changes. The demonstrated efficiency gains (up to 45.7% fewer LLM calls, 30.5% lower latency) are practically significant for deployment cost.

Conceptual contribution. Framing experience as a graph with typed relations (similarity, correction, containment) provides a principled abstraction that could influence how the community thinks about agent memory. The correction edge concept—explicitly linking failures to their fixes—is particularly intuitive and powerful, as confirmed by the ablation showing it provides the most critical signal.

Scalability questions. The paper shows approximately linear graph growth with tasks and stable local connectivity (~18-20 neighbors per case), which is encouraging. However, evaluation is limited to hundreds of tasks. Real-world deployment at thousands or millions of tasks would stress both retrieval efficiency and graph quality in ways not yet tested.

Transferability. The offline reuse results suggest the graph can generalize to unseen tasks, though gains are modest and sometimes slightly below ExpeL on HumanEval (0.879 vs 0.909). Cross-domain transfer is not evaluated.

4. Timeliness & Relevance

This work addresses a timely need. As LLM agents are increasingly deployed in production settings, the inability to learn from deployment experience is a recognized bottleneck. The shift from static to self-evolving agents is an active research frontier, and EXG contributes a concrete architectural proposal. The focus on inference-time evolution (no parameter updates) aligns with practical deployment constraints where model retraining is expensive or infeasible.

The paper also responds to the proliferation of ad hoc memory mechanisms by proposing a more principled, graph-structured alternative. This is relevant as the field matures from proof-of-concept agents to scalable systems.

5. Strengths & Limitations

Key Strengths:

Clean, well-formalized design with clear algorithmic specifications

Unified framework bridging online and offline self-evolution—conceptually elegant

Strong efficiency improvements alongside accuracy gains, not just accuracy alone

Comprehensive ablation studies demonstrating contribution of each graph component

The correction edge mechanism is well-motivated and empirically validated as the most impactful structural element

Learning curve analysis (Appendix C.2) convincingly shows experience compounding over time

Token usage analysis reveals that EXG shifts computation from redundant output generation to structured input guidance

Notable Limitations:

The 2-attempt budget is restrictive; deeper interaction horizons remain untested

No statistical significance testing or variance reporting

Limited offline baselines (only ExpeL)

Graph maintenance at scale (thousands of tasks, graph pruning, staleness) is unexplored

The similarity edge construction relies on MiniLM embeddings—the quality ceiling is bounded by this encoder

Domain-specific case abstraction (error signatures, failure types) may not generalize seamlessly across diverse task types

The paper doesn't address potential negative transfer—when retrieved experience misleads rather than helps

Offline performance sometimes underperforms ExpeL, partially attributed to fewer reflection traces, which raises questions about the robustness of the offline mode

Additional Observations

The paper is well-written with clear figures, though the contribution framing as "first experience graph for self-evolving agents" is a strong claim that depends on how narrowly one defines the space. The graph statistics analysis (Table 4) is insightful—showing that correction edges are sparse but high-impact provides useful design intuition. The 150%+ relative improvement on pass@1 for small models is notable but should be contextualized: absolute performance for Qwen3-1.7B starts very low, making relative gains appear dramatic.

Rating:6.3/ 10

Significance 6.5Rigor 5.8Novelty 6.5Clarity 7.5

Generated May 19, 2026

Comparison History (17)

vs. MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

claude-opus-4.65/22/2026

MindLoom addresses a fundamental challenge in LLM training—generating high-quality frontier-level reasoning data—which has broad impact across the entire LLM ecosystem. Its novel decomposition of reasoning difficulty into composable 'thought modes' offers a principled, generalizable framework evaluated across 9 benchmarks, 5 STEM disciplines, and multiple model families. The open-sourced implementation enhances reproducibility and adoption. While EXG contributes a useful structured experience graph for self-evolving agents, MindLoom's impact is broader: improving reasoning data synthesis benefits all downstream model training, making it more foundational and widely applicable.

vs. The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to a more novel technical contribution (structured experience graphs for self-evolving LLM agents), clear methodology with benchmarked performance/efficiency gains, and broad applicability across agentic systems, continual learning, memory/knowledge representation, and software automation. Its timeliness is high given rapid adoption of deployable agents and the need for scalable improvement mechanisms. Paper 2 is important and relevant for human-AI learning and policy, but its impact may be narrower and more context-dependent (specific task/experimental setting) and less likely to generalize into widely reusable methods or systems.

vs. MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

claude-opus-4.65/22/2026

EXG introduces a novel and broadly applicable framework for self-evolving agents using experience graphs, addressing a fundamental limitation of LLM-based agents (inability to systematically learn from deployment experience). Its plug-and-play design, dual online/offline functionality, and applicability across diverse agent architectures give it broader impact potential. While MindLoom makes solid contributions to reasoning data synthesis, it addresses a more specific problem (training data generation for STEM reasoning). EXG's concept of structured experience accumulation has wider implications for agent architectures, continual learning, and autonomous systems.

vs. GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

gpt-5.25/20/2026

Paper 2 (GeoX) is likely higher impact due to stronger novelty and timeliness: self-play with executable programs and verifiable rewards for multimodal geospatial reasoning reduces dependence on costly annotations and targets a high-value, under-served domain. It offers clear real-world applications (remote sensing, mapping, disaster response) and broader cross-field relevance (VLMs, RL, program synthesis, geospatial AI). Releasing a benchmark further amplifies adoption and reproducibility. Paper 1 is valuable but more incremental within crowded agent-memory/experience-structuring work and may have less immediate domain-specific payoff.

vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

gpt-5.25/20/2026

Paper 2 (EXG) likely has higher impact: it proposes a general, reusable framework for self-evolving LLM agents via structured experience graphs, addressing a timely and widely relevant problem (continual improvement during deployment). Its plug-and-play design and demonstrated performance–efficiency gains across multiple agent tasks suggest broad applicability across reasoning, coding, and autonomous systems. Paper 1 (PRISM) is a strong benchmark contribution, but its impact is narrower (programmatic video/code spatial-temporal evaluation) and mainly advances evaluation methodology rather than a broadly deployable capability.

vs. Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

claude-opus-4.65/19/2026

EXG addresses a fundamental challenge in LLM-based agents—systematic learning from experience—with broad applicability across domains (code generation, reasoning, and beyond). Its plug-and-play experience graph framework introduces a principled, general-purpose architecture for self-evolving agents that could influence the entire agent ecosystem. While CardioThink is innovative in ECG diagnosis with structured clinical reasoning, its impact is more domain-specific. EXG's broader applicability, novelty as the first experience graph for self-evolving agents, and timeliness given the rapid growth of agentic AI give it higher potential cross-field impact.

vs. From Prompts to Protocols: An AI Agent for Laboratory Automation

claude-opus-4.65/19/2026

Paper 1 addresses a high-impact practical problem—automating laboratory workflows via natural language—with demonstrated results across three scientific domains (chemistry, biology, materials science), a 97% success rate, and an order-of-magnitude reduction in interface actions. It bridges AI and experimental science, enabling broader adoption of lab automation by non-programmers. Paper 2 presents a solid contribution to LLM agent self-improvement via experience graphs, but operates in a more incremental space (agent memory/reflection) with narrower immediate real-world applications. Paper 1's interdisciplinary impact and direct acceleration of scientific discovery give it higher potential.

vs. XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

gpt-5.25/19/2026

Paper 2 is likely to have higher scientific impact because it introduces a large, systematically constructed benchmark (8,598 interactive sessions, 20 domains) that diagnoses a broadly relevant failure mode—reasoning collapse under interdisciplinary composition—in realistic multi-turn workflows. Benchmarks often become community standards, enabling reproducible evaluation, model comparison, and targeted method development across AI4Science, agentic systems, and general LLM reasoning. Paper 1’s experience-graph memory is a useful engineering contribution, but its impact may be narrower and more dependent on specific agent setups, whereas Paper 2 can shape evaluation practice across many subfields.

vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

gemini-3.15/19/2026

Paper 1 addresses a critical vulnerability in multimodal LLMs (safety alignment across modalities) using a novel geometric perspective. Its theoretical depth in identifying 'Safety Geometry Collapse' and practical, training-free intervention (ReGap) offer profound implications for AI safety. While Paper 2 presents a useful framework for agent memory, Paper 1's focus on fundamental safety alignment in widely deployed foundational models provides a higher potential for broad, immediate impact in both mechanistic understanding and practical deployment.

vs. Holistic Evaluation and Failure Diagnosis of AI Agents

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a new agent capability (self-evolution via structured experience graphs) with direct real-world relevance to long-lived deployed agents, offering a reusable component that can generalize across tasks and systems. The experience-graph idea is broadly applicable (memory, continual learning, tool use, retrieval, agent engineering) and timely as production agents need improvement over time. Paper 1 is methodologically strong and valuable for evaluation/diagnosis, but its primary contribution is to benchmarking and measurement rather than expanding agent functionality, which may limit breadth of downstream impact.

vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

gemini-3.15/19/2026

Paper 2 introduces a foundational framework (Experience Graphs) for self-evolving LLM agents, addressing a critical bottleneck in AI: the inability of agents to systematically learn from experience over time. Its plug-and-play nature and broad applicability across general reasoning and coding tasks give it significantly wider cross-disciplinary impact. In contrast, Paper 1, while valuable, is a domain-specific benchmark constrained to computational science. The generalized architectural innovation in Paper 2 promises broader adoption, higher potential for downstream real-world applications, and greater overall scientific impact across the AI community.

vs. Reasoning Compression with Mixed-Policy Distillation

claude-opus-4.65/19/2026

EXG introduces a novel architectural framework (experience graphs) for self-evolving agents that addresses a fundamental limitation in LLM-based agent systems. It offers broader impact across multiple domains (code generation, reasoning), introduces a new paradigm for structured experience reuse that is plug-and-play compatible with existing systems, and tackles the foundational problem of agent learning over time. While Paper 2 (MPD) presents a useful distillation technique for reasoning compression, it is more incremental—combining known approaches (on/off-policy distillation) and demonstrates results on a single model size. EXG's contribution is more architecturally novel with wider applicability and longer-term research implications.

vs. GIM: Evaluating models via tasks that integrate multiple cognitive domains

claude-opus-4.65/19/2026

EXG introduces a novel, principled framework (experience graphs) for self-evolving LLM agents that addresses a fundamental limitation—structuring agent experience for reuse across tasks. It offers both online and offline modes, plug-and-play compatibility, and demonstrates broad applicability across code generation and reasoning tasks. While GIM contributes a well-designed benchmark with IRT modeling and insightful test-time compute analysis, benchmarks tend to have shorter-lived impact as they saturate. EXG's architectural contribution to agent learning has broader potential to influence future agent design and enable cumulative improvement, a key challenge in the field.

vs. Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

gpt-5.25/19/2026

Paper 2 has higher potential impact due to stronger novelty and cross-field relevance: it introduces an explicit, inspectable belief-update mechanism (log-odds with interpretable controls) that directly addresses auditability/causality in LLM deliberation—key for social science, HCI, AI safety, and governance. Its method is more principled and falsifiable (stance dynamics tied to parameters and evidence trails) and maps to real-world needs like transparent negotiation/decision support. Paper 1 is useful engineering for continual improvement, but graph-structured experience/memory is a more incremental extension with narrower conceptual reach.

vs. Latent Action Reparameterization for Efficient Agent Inference

gemini-3.15/19/2026

Paper 2 addresses a fundamental bottleneck in autonomous agents—continuous learning and long-term memory—by proposing a structured experience graph. This enables self-evolving capabilities with broader implications for general AI and cross-task transfer. Paper 1 offers a valuable optimization for inference efficiency via latent actions, but self-evolution and systematic improvement over time (Paper 2) represent a more profound paradigm shift with higher potential impact across various AI domains.

vs. An Algebraic Exposition of the Theory of Dyadic Morality

claude-opus-4.65/19/2026

EXG addresses a highly active and practically important area—self-evolving LLM agents—with a concrete, experimentally validated framework. Its structured experience graph is novel, broadly applicable across tasks, and directly improves deployed AI systems. Paper 2, while intellectually interesting in formalizing moral psychology algebraically, addresses a narrower interdisciplinary niche with more speculative applications (neurosymbolic moral reasoning in AI). Paper 1's extensive empirical validation, plug-and-play design, and relevance to the booming LLM agent ecosystem give it substantially broader near-term impact and adoption potential.

vs. Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to its broader cross-domain relevance and timeliness: it introduces a replicable psychometric framework to validate LLM-derived user-state measures, addressing a critical reliability gap for real-world deployment in adaptive systems. The work directly informs responsible AI practice and evaluation standards, and its findings (only 31/213 metrics reliable) are broadly actionable across HCI, NLP, psychometrics, and applied ML. Paper 1 is innovative and useful for agent systems, but its impact is more concentrated within LLM-agent research and subject to fast-moving incremental competition.