AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

Jun 1, 2026

arXiv:2606.02461v2 PDF

v1v2

cs.AI(primary)cs.CL

#685of 3355·Artificial Intelligence

#685 of 3355 · Artificial Intelligence

Tournament Score

1468±45

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity7.5

Tournament Score

1468±45

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AgentCL

1. Core Contribution

AgentCL addresses a genuine gap in evaluating continual learning (CL) for language agents. The central insight is that existing benchmarks either focus on long-context retrieval/reasoning or stream tasks naively without controlling cross-task relationships, making it impossible to attribute performance changes to genuine knowledge reuse versus incidental overlap. The paper makes three linked contributions: (1) a formal evaluation framework distinguishing *compositional streams* (where subtask solutions are intentionally reusable) from *naive streams* (no guaranteed reusability); (2) CL-specific metrics (Plasticity Gain, Stability Gain, Generalization Gain) derived from a two-pass evaluation protocol; and (3) MemProbe, a diagnostic memory method that decomposes experience into interaction, insight, and skill memories with quality-aware consolidation.

The key finding—that naive streams compress method differences while compositional streams amplify them—is both intuitive and empirically validated, providing a concrete methodological recommendation for future CL benchmarks.

2. Methodological Rigor

Strengths in evaluation design: The two-pass protocol is well-motivated. The first pass (read+write) measures plasticity; the second pass (read-only with frozen memory) isolates stability after consolidation. This is a clean experimental design that separates confounds more effectively than single-pass evaluation. The held-out evaluation using HumanEval-Pro tasks after BigCodeBench-Lite-Pro streams is a reasonable generalization test.

Concerns:

The compositional streams are constructed with known reuse relationships, which somewhat biases the evaluation toward methods that perform well on retrieval-augmented composition. While the paper acknowledges this is by design, it limits ecological validity—real-world task streams rarely have such clean compositional structure.

BrowseComp+ subtasks are synthesized by GPT-5.2, introducing potential artifacts. Although quality controls are described (Jaccard similarity checks, verification prompts), the subtask generation relies on having gold answers and evidence documents, which is a privileged construction not available in practice.

Statistical significance is addressed through bootstrap CIs, but the confidence intervals on BrowseComp+ (±10pp for many comparisons) are quite wide, suggesting the 100-task sample may be insufficient for reliable conclusions.

Only one LLM backbone (Qwen3.5-35B-A3B for coding, gpt-oss-120b for deep research) is tested, limiting conclusions about whether findings generalize across model capabilities.

Results are averaged over only three runs for coding tasks, and BrowseComp+ appears to use single runs.

3. Potential Impact

Benchmark utility: The framework fills a real need. As language agents become more deployed in long-running settings, understanding whether memory helps or hurts is critical. The finding that memory frequently causes *degradation* on naive and held-out tasks (negative GG values across most methods in Table 10) is an important negative result that should influence practitioners.

Methodological influence: The distinction between compositional and naive streams could become standard practice in CL benchmark design. The PG/SG/GG metrics provide a vocabulary for discussing CL properties that is currently missing from agent evaluation.

Limitations of impact scope: The paper focuses exclusively on non-parametric memory, explicitly deferring parametric/training-based CL to future work. This is a significant scope limitation given that parameter-efficient fine-tuning and in-context learning represent major CL paradigms. Additionally, the benchmark covers only three domains (coding, deep research, language understanding), and the compositional streams are available for only a subset of these.

4. Timeliness & Relevance

The paper is highly timely. The agent community is rapidly scaling inference-time compute, and the question of whether experience compounds across episodes is becoming urgent. Several concurrent works (Evo-Memory, LifelongAgentBench, SWE-Bench-CL, Continual Learning Bench) address related problems, positioning this work within an active research front. AgentCL's emphasis on controlled task relationships and CL-specific metrics differentiates it from these concurrent efforts.

5. Strengths & Limitations

Key strengths:

The core observation—that uncontrolled task streams cannot distinguish memory methods—is well-supported empirically across multiple domains (Figures 3-4, Tables 2-3).

The plasticity-stability tradeoff analysis is nuanced. Methods like ExpRAG achieve +32pp PG on compositional BrowseComp+ but only +1pp on naive streams, clearly demonstrating that gains are setting-dependent.

MemProbe ablations (Table 4) usefully isolate the contribution of different memory types, showing that interaction, insight, and skill memories serve complementary roles primarily visible in compositional streams.

The case study (Table 15) provides concrete positive/negative transfer examples that illuminate failure modes.

Notable weaknesses:

The paper doesn't establish whether compositional streams reflect realistic deployment scenarios or are primarily diagnostic tools. The practical utility of the framework depends on this distinction.

MemProbe is presented as both a diagnostic probe and a competitive method, creating a dual role that somewhat muddies the evaluation narrative. As a method, it consistently outperforms alternatives on compositional streams, but the paper doesn't sufficiently discuss whether this advantage comes from architectural choices or from being tuned on the evaluation framework it was designed alongside.

The generalization results are uniformly disappointing (most GG values are negative), but the paper offers limited insight into *why* or what alternative designs might help.

Missing comparison with parametric CL baselines means we cannot assess whether non-parametric memory is the right paradigm at all.

Overall Assessment

AgentCL makes a solid contribution to evaluation methodology for an increasingly important problem. The compositional vs. naive stream distinction and CL-specific metrics are well-conceived and empirically validated. However, the scope is somewhat narrow (non-parametric memory only), the statistical power on some benchmarks is limited, and the practical implications beyond "current methods are insufficient" remain underdeveloped. The work is likely to influence benchmark design practices in the agent CL community but represents an incremental rather than transformative advance.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (22)

vs. LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

gpt-5.26/5/2026

Paper 2 (AgentCL) likely has higher impact because it provides a rigorous, general evaluation framework and diagnostics for continual learning in language agents—an area with immediate, broad relevance as agentic systems proliferate. Benchmarks and metrics often become community standards, influencing many subsequent methods and enabling fair comparison across memory/continual-learning designs. Its controlled task streams and transfer metrics improve methodological rigor and interpretability across domains (coding, research, reasoning). Paper 1 is novel but more specialized to a particular self-alignment/reward-decomposition approach, with harder-to-validate assumptions and potentially narrower adoption.

vs. Reasoning Structure of Large Language Models

gemini-3.16/3/2026

Paper 2 addresses the critical and highly timely challenge of evaluating Large Reasoning Models (LRMs). By transforming opaque reasoning traces into verifiable, measurable topological graphs, it offers a novel and rigorous methodology to analyze test-time compute. This structural approach to assessing reasoning efficiency has broader potential impact on understanding and improving state-of-the-art LLMs compared to Paper 1's focus on continual learning benchmarks for agents.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

gemini-3.16/3/2026

Paper 2 introduces a novel evaluation framework and benchmark (AgentCL) for continual learning in language agents. Benchmarks and rigorous evaluation frameworks typically have a broader and longer-lasting scientific impact by setting new standards for a rapidly growing field, whereas Paper 1 proposes a more specific methodological improvement for spatial reasoning in vision-language models.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

claude-opus-4.66/3/2026

LEAP demonstrates remarkable concrete results—solving all 12 Putnam 2025 problems, achieving 70% on IMO-style problems (surpassing specialized systems), and formalizing research-level proofs. It introduces both a novel agentic framework and a new benchmark (Lean-IMO-Bench). The work bridges informal and formal mathematical reasoning with immediate practical applications in automated theorem proving and mathematical research. AgentCL makes valuable contributions to continual learning evaluation methodology, but its impact is more incremental—primarily diagnostic rather than achieving breakthrough performance. LEAP's results are more transformative with broader cross-disciplinary implications.

vs. Forget Attention: Importance-Aware Attention Is All You Need

gpt-5.26/3/2026

Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) plus diagnostic tooling (MemProbe) for continual learning in language agents, addressing a broadly recognized measurement gap. Benchmarks can influence many future papers across agent memory, adaptation, and evaluation, with immediate real-world relevance as agents are deployed in long-running settings. Paper 1 is a promising modeling contribution with good results and practical implementation (stock SDPA), but its impact is narrower to hybrid attention/SSM architecture design and may compete with many fast-moving alternatives.

vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

gemini-3.16/3/2026

Paper 1 addresses a cutting-edge and complex challenge in AI research—continual learning in language agents—by introducing a rigorous methodological framework and probing method. Its focus on understanding memory and plasticity pushes the theoretical boundaries of the field. While Paper 2 offers a practical and accessible tool for LLM evaluation, it represents more of an engineering and resource-efficiency contribution rather than a fundamental scientific advancement, making Paper 1 more likely to drive future scientific innovation.

vs. Effect of Demographic Bias on Skin Lesion Classification

gpt-5.26/3/2026

Paper 2 likely has higher impact due to greater novelty and breadth: it proposes a rigorous evaluation framework (AgentCL) and diagnostic tool (MemProbe) for continual learning in language agents, a timely and fast-moving area with broad relevance to ML, NLP, and agent systems. Benchmarking frameworks can become community standards, enabling reproducible comparisons and accelerating progress across many applications (coding, research, reasoning). Paper 1 addresses an important clinical fairness issue, but its methods (ResNet variants, balancing/adv/adversarial schemes) are more incremental and narrower in scope to dermatology imaging datasets.

vs. Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

gemini-3.16/3/2026

Paper 1 offers higher scientific impact by directly addressing a critical data bottleneck in biomedical AI. Its automated data synthesis pipeline and resulting open-source VLM provide immediate, high-value utility for real-world scientific research. While Paper 2 introduces a valuable evaluation framework for continual learning, Paper 1's combination of methodological innovation in evidence extraction and its demonstrable performance gains over state-of-the-art proprietary models present a more tangible and immediate breakthrough in applied AI for science.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

claude-opus-4.66/3/2026

AgentCL addresses a timely and high-impact problem at the intersection of continual learning and language agents—a rapidly growing area. It introduces a practical evaluation framework (AgentCL) and diagnostic tool (MemProbe) with broad applicability across coding, research, and reasoning tasks. Its relevance to the booming LLM agent ecosystem gives it wide potential adoption. Paper 2, while technically rigorous, advances a niche area of non-monotonic reasoning in a specific modal logic fragment, limiting its breadth of impact and real-world applicability compared to Paper 1's contributions to the active AI agents community.

vs. Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact because it proposes a rigorous, general evaluation framework (AgentCL) and diagnostics (MemProbe) for continual learning in language agents—an area with broad relevance across agent architectures and application domains. By addressing a core bottleneck (benchmark/metric validity) with controlled task streams and transfer metrics, it can reshape how the field measures progress, influencing many subsequent methods. Paper 1 is a useful systems contribution for proactive mobile agents, improving efficiency and false triggers, but its scope is narrower and more application-specific.

vs. Controllable User Simulation

gpt-5.26/3/2026

Paper 1 has higher potential impact due to a deeper conceptual contribution: it reframes controllable user simulation as a causal inference/off-policy evaluation problem, identifies a previously underappreciated structural bias (look-ahead bias from post-hoc labels), and provides theoretical results (causal consistency conditions, variance explosion under policy shift) plus practical mitigations with empirical validation. This can influence evaluation methodology broadly across conversational AI, RLHF-style policy changes, and simulation-based testing. Paper 2 is timely and useful as a benchmark/framework, but is more incremental and narrower in methodological novelty.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental methodological gap in evaluating continual learning for language agents—a rapidly growing field. Its contribution of controlled evaluation frameworks (AgentCL) and diagnostic tools (MemProbe) provides infrastructure that can influence how the broader AI/ML community benchmarks agent learning. Paper 2, while useful as a scoping review of AI in dentistry, is primarily a literature synthesis without novel methods or models. Its impact is domain-specific and incremental. Paper 1's novelty in formalizing continual learning evaluation for agents, combined with the timeliness of the agent paradigm, gives it broader and deeper potential impact.

vs. Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

gpt-5.26/3/2026

Paper 1 is likely to have higher scientific impact because it introduces a more novel and methodologically rigorous evaluation framework (controlled compositional task streams + transfer metrics) and a diagnostic tool (MemProbe) with empirical validation across multiple task domains. This directly advances how the community measures continual learning in language agents, a timely and broadly relevant bottleneck for agent research. Paper 2 addresses an important application area (edge/embedded agents) with potentially wide real-world relevance, but it is primarily an architectural/reference proposal with less empirical rigor, making its scientific impact more dependent on downstream adoption.

vs. Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to broader, more general relevance: a rigorous evaluation framework for continual learning in language agents applies across many domains and can become a standard benchmark, influencing both academia and industry. Its controlled task streams and transfer metrics address a timely, widely recognized gap, enabling reproducible comparisons of memory/learning methods. Paper 1 is innovative and high-value for structure-based drug design, but its impact is narrower (specialized to molecular optimization pipelines and specific objectives/benchmarks) and may depend more on downstream experimental validation for real-world adoption.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

claude-opus-4.66/3/2026

AgentCL addresses a fundamental and broadly applicable challenge—rigorous evaluation of continual learning in language agents—proposing a comprehensive framework with controlled task streams, transfer metrics, and diagnostic tools (MemProbe). Its breadth of impact spans coding, research, and reasoning tasks, establishing evaluation methodology for a rapidly growing field. Paper 2, while novel in applying process-guided refactoring to formal proofs, addresses a narrower problem (proof readability/modularity in Lean) with more limited applicability. AgentCL's contributions to benchmarking and memory design evaluation have broader implications for the entire language agent community.

vs. LAP: An Agent-to-Instrument Protocol for Autonomous Science

claude-opus-4.66/3/2026

LAP addresses a critical infrastructure gap in autonomous science by proposing a standardized protocol for agent-to-instrument communication, complementing existing protocols (MCP, A2A). Its potential impact spans all experimental sciences adopting self-driving laboratories—a rapidly growing field. The protocol's design addressing safety, measurement provenance, and cross-lab federation could become foundational infrastructure. While AgentCL contributes a useful evaluation framework for continual learning in agents, it is more incremental—improving benchmarking methodology rather than enabling new capabilities. LAP's broader cross-disciplinary relevance and timeliness in the autonomous lab revolution give it higher impact potential.

vs. Iteris: Agentic Research Loops for Computational Mathematics

gpt-5.26/3/2026

Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) and diagnostics (MemProbe) for continual learning in language agents, a timely and broadly relevant area as agents become widely deployed. Benchmarks and metrics can become community standards, influencing many subsequent methods across NLP, agent systems, and ML evaluation, with clear real-world implications for long-lived assistants. Paper 1 is novel and exciting, showing agentic AI aiding computational mathematics on two open problems, but its impact may be narrower (computational math) and currently depends on expert correction/validation, limiting immediate scalability.

vs. SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems

gemini-3.16/3/2026

Paper 2 introduces a foundational evaluation framework for a critical but poorly measured area: continual learning in language agents. By providing controlled task streams and rigorous transfer metrics, it addresses a fundamental methodological gap. Evaluation benchmarks typically achieve broader scientific impact and higher citation rates than specific algorithmic architectures (Paper 1), as they set the standard for future research in the field.

vs. Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

gemini-3.16/3/2026

Paper 2 addresses continual learning, a fundamental and pervasive challenge in AI, offering a rigorous framework applicable across multiple domains like coding and reasoning. While Paper 1 provides a valuable, high-quality benchmark, its scope is domain-specific to finance. Therefore, Paper 2 has greater potential for broad cross-disciplinary impact and foundational methodological advancement in agentic AI.

vs. Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

claude-opus-4.66/3/2026

Paper 1 addresses a more novel and immediately impactful problem: understanding *where* deep-research agents fail within their trajectories, rather than just whether they fail. The TELBench benchmark and DRIFT framework provide concrete, actionable tools for improving agent reliability with demonstrated 30pp improvements. As AI agents become increasingly deployed, process-level debugging is critically needed. Paper 2 tackles continual learning evaluation for agents—an important but more incremental contribution, with findings that largely confirm known challenges (memory degradation, need for better designs) without offering strong solutions. Paper 1's novelty and practical applicability give it higher impact potential.