AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su
Abstract
Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AgentCL
1. Core Contribution
AgentCL addresses a genuine gap in evaluating continual learning (CL) for language agents. The central insight is that existing benchmarks either focus on long-context retrieval/reasoning or stream tasks naively without controlling cross-task relationships, making it impossible to attribute performance changes to genuine knowledge reuse versus incidental overlap. The paper makes three linked contributions: (1) a formal evaluation framework distinguishing *compositional streams* (where subtask solutions are intentionally reusable) from *naive streams* (no guaranteed reusability); (2) CL-specific metrics (Plasticity Gain, Stability Gain, Generalization Gain) derived from a two-pass evaluation protocol; and (3) MemProbe, a diagnostic memory method that decomposes experience into interaction, insight, and skill memories with quality-aware consolidation.
The key finding—that naive streams compress method differences while compositional streams amplify them—is both intuitive and empirically validated, providing a concrete methodological recommendation for future CL benchmarks.
2. Methodological Rigor
Strengths in evaluation design: The two-pass protocol is well-motivated. The first pass (read+write) measures plasticity; the second pass (read-only with frozen memory) isolates stability after consolidation. This is a clean experimental design that separates confounds more effectively than single-pass evaluation. The held-out evaluation using HumanEval-Pro tasks after BigCodeBench-Lite-Pro streams is a reasonable generalization test.
Concerns:
3. Potential Impact
Benchmark utility: The framework fills a real need. As language agents become more deployed in long-running settings, understanding whether memory helps or hurts is critical. The finding that memory frequently causes *degradation* on naive and held-out tasks (negative GG values across most methods in Table 10) is an important negative result that should influence practitioners.
Methodological influence: The distinction between compositional and naive streams could become standard practice in CL benchmark design. The PG/SG/GG metrics provide a vocabulary for discussing CL properties that is currently missing from agent evaluation.
Limitations of impact scope: The paper focuses exclusively on non-parametric memory, explicitly deferring parametric/training-based CL to future work. This is a significant scope limitation given that parameter-efficient fine-tuning and in-context learning represent major CL paradigms. Additionally, the benchmark covers only three domains (coding, deep research, language understanding), and the compositional streams are available for only a subset of these.
4. Timeliness & Relevance
The paper is highly timely. The agent community is rapidly scaling inference-time compute, and the question of whether experience compounds across episodes is becoming urgent. Several concurrent works (Evo-Memory, LifelongAgentBench, SWE-Bench-CL, Continual Learning Bench) address related problems, positioning this work within an active research front. AgentCL's emphasis on controlled task relationships and CL-specific metrics differentiates it from these concurrent efforts.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall Assessment
AgentCL makes a solid contribution to evaluation methodology for an increasingly important problem. The compositional vs. naive stream distinction and CL-specific metrics are well-conceived and empirically validated. However, the scope is somewhat narrow (non-parametric memory only), the statistical power on some benchmarks is limited, and the practical implications beyond "current methods are insufficient" remain underdeveloped. The work is likely to influence benchmark design practices in the agent CL community but represents an incremental rather than transformative advance.
Generated Jun 3, 2026
Comparison History (22)
Paper 2 (AgentCL) likely has higher impact because it provides a rigorous, general evaluation framework and diagnostics for continual learning in language agents—an area with immediate, broad relevance as agentic systems proliferate. Benchmarks and metrics often become community standards, influencing many subsequent methods and enabling fair comparison across memory/continual-learning designs. Its controlled task streams and transfer metrics improve methodological rigor and interpretability across domains (coding, research, reasoning). Paper 1 is novel but more specialized to a particular self-alignment/reward-decomposition approach, with harder-to-validate assumptions and potentially narrower adoption.
Paper 2 addresses the critical and highly timely challenge of evaluating Large Reasoning Models (LRMs). By transforming opaque reasoning traces into verifiable, measurable topological graphs, it offers a novel and rigorous methodology to analyze test-time compute. This structural approach to assessing reasoning efficiency has broader potential impact on understanding and improving state-of-the-art LLMs compared to Paper 1's focus on continual learning benchmarks for agents.
Paper 2 introduces a novel evaluation framework and benchmark (AgentCL) for continual learning in language agents. Benchmarks and rigorous evaluation frameworks typically have a broader and longer-lasting scientific impact by setting new standards for a rapidly growing field, whereas Paper 1 proposes a more specific methodological improvement for spatial reasoning in vision-language models.
LEAP demonstrates remarkable concrete results—solving all 12 Putnam 2025 problems, achieving 70% on IMO-style problems (surpassing specialized systems), and formalizing research-level proofs. It introduces both a novel agentic framework and a new benchmark (Lean-IMO-Bench). The work bridges informal and formal mathematical reasoning with immediate practical applications in automated theorem proving and mathematical research. AgentCL makes valuable contributions to continual learning evaluation methodology, but its impact is more incremental—primarily diagnostic rather than achieving breakthrough performance. LEAP's results are more transformative with broader cross-disciplinary implications.
Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) plus diagnostic tooling (MemProbe) for continual learning in language agents, addressing a broadly recognized measurement gap. Benchmarks can influence many future papers across agent memory, adaptation, and evaluation, with immediate real-world relevance as agents are deployed in long-running settings. Paper 1 is a promising modeling contribution with good results and practical implementation (stock SDPA), but its impact is narrower to hybrid attention/SSM architecture design and may compete with many fast-moving alternatives.
Paper 1 addresses a cutting-edge and complex challenge in AI research—continual learning in language agents—by introducing a rigorous methodological framework and probing method. Its focus on understanding memory and plasticity pushes the theoretical boundaries of the field. While Paper 2 offers a practical and accessible tool for LLM evaluation, it represents more of an engineering and resource-efficiency contribution rather than a fundamental scientific advancement, making Paper 1 more likely to drive future scientific innovation.
Paper 2 likely has higher impact due to greater novelty and breadth: it proposes a rigorous evaluation framework (AgentCL) and diagnostic tool (MemProbe) for continual learning in language agents, a timely and fast-moving area with broad relevance to ML, NLP, and agent systems. Benchmarking frameworks can become community standards, enabling reproducible comparisons and accelerating progress across many applications (coding, research, reasoning). Paper 1 addresses an important clinical fairness issue, but its methods (ResNet variants, balancing/adv/adversarial schemes) are more incremental and narrower in scope to dermatology imaging datasets.
Paper 1 offers higher scientific impact by directly addressing a critical data bottleneck in biomedical AI. Its automated data synthesis pipeline and resulting open-source VLM provide immediate, high-value utility for real-world scientific research. While Paper 2 introduces a valuable evaluation framework for continual learning, Paper 1's combination of methodological innovation in evidence extraction and its demonstrable performance gains over state-of-the-art proprietary models present a more tangible and immediate breakthrough in applied AI for science.
AgentCL addresses a timely and high-impact problem at the intersection of continual learning and language agents—a rapidly growing area. It introduces a practical evaluation framework (AgentCL) and diagnostic tool (MemProbe) with broad applicability across coding, research, and reasoning tasks. Its relevance to the booming LLM agent ecosystem gives it wide potential adoption. Paper 2, while technically rigorous, advances a niche area of non-monotonic reasoning in a specific modal logic fragment, limiting its breadth of impact and real-world applicability compared to Paper 1's contributions to the active AI agents community.
Paper 2 likely has higher scientific impact because it proposes a rigorous, general evaluation framework (AgentCL) and diagnostics (MemProbe) for continual learning in language agents—an area with broad relevance across agent architectures and application domains. By addressing a core bottleneck (benchmark/metric validity) with controlled task streams and transfer metrics, it can reshape how the field measures progress, influencing many subsequent methods. Paper 1 is a useful systems contribution for proactive mobile agents, improving efficiency and false triggers, but its scope is narrower and more application-specific.
Paper 1 has higher potential impact due to a deeper conceptual contribution: it reframes controllable user simulation as a causal inference/off-policy evaluation problem, identifies a previously underappreciated structural bias (look-ahead bias from post-hoc labels), and provides theoretical results (causal consistency conditions, variance explosion under policy shift) plus practical mitigations with empirical validation. This can influence evaluation methodology broadly across conversational AI, RLHF-style policy changes, and simulation-based testing. Paper 2 is timely and useful as a benchmark/framework, but is more incremental and narrower in methodological novelty.
Paper 1 addresses a fundamental methodological gap in evaluating continual learning for language agents—a rapidly growing field. Its contribution of controlled evaluation frameworks (AgentCL) and diagnostic tools (MemProbe) provides infrastructure that can influence how the broader AI/ML community benchmarks agent learning. Paper 2, while useful as a scoping review of AI in dentistry, is primarily a literature synthesis without novel methods or models. Its impact is domain-specific and incremental. Paper 1's novelty in formalizing continual learning evaluation for agents, combined with the timeliness of the agent paradigm, gives it broader and deeper potential impact.
Paper 1 is likely to have higher scientific impact because it introduces a more novel and methodologically rigorous evaluation framework (controlled compositional task streams + transfer metrics) and a diagnostic tool (MemProbe) with empirical validation across multiple task domains. This directly advances how the community measures continual learning in language agents, a timely and broadly relevant bottleneck for agent research. Paper 2 addresses an important application area (edge/embedded agents) with potentially wide real-world relevance, but it is primarily an architectural/reference proposal with less empirical rigor, making its scientific impact more dependent on downstream adoption.
Paper 2 likely has higher scientific impact due to broader, more general relevance: a rigorous evaluation framework for continual learning in language agents applies across many domains and can become a standard benchmark, influencing both academia and industry. Its controlled task streams and transfer metrics address a timely, widely recognized gap, enabling reproducible comparisons of memory/learning methods. Paper 1 is innovative and high-value for structure-based drug design, but its impact is narrower (specialized to molecular optimization pipelines and specific objectives/benchmarks) and may depend more on downstream experimental validation for real-world adoption.
AgentCL addresses a fundamental and broadly applicable challenge—rigorous evaluation of continual learning in language agents—proposing a comprehensive framework with controlled task streams, transfer metrics, and diagnostic tools (MemProbe). Its breadth of impact spans coding, research, and reasoning tasks, establishing evaluation methodology for a rapidly growing field. Paper 2, while novel in applying process-guided refactoring to formal proofs, addresses a narrower problem (proof readability/modularity in Lean) with more limited applicability. AgentCL's contributions to benchmarking and memory design evaluation have broader implications for the entire language agent community.
LAP addresses a critical infrastructure gap in autonomous science by proposing a standardized protocol for agent-to-instrument communication, complementing existing protocols (MCP, A2A). Its potential impact spans all experimental sciences adopting self-driving laboratories—a rapidly growing field. The protocol's design addressing safety, measurement provenance, and cross-lab federation could become foundational infrastructure. While AgentCL contributes a useful evaluation framework for continual learning in agents, it is more incremental—improving benchmarking methodology rather than enabling new capabilities. LAP's broader cross-disciplinary relevance and timeliness in the autonomous lab revolution give it higher impact potential.
Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) and diagnostics (MemProbe) for continual learning in language agents, a timely and broadly relevant area as agents become widely deployed. Benchmarks and metrics can become community standards, influencing many subsequent methods across NLP, agent systems, and ML evaluation, with clear real-world implications for long-lived assistants. Paper 1 is novel and exciting, showing agentic AI aiding computational mathematics on two open problems, but its impact may be narrower (computational math) and currently depends on expert correction/validation, limiting immediate scalability.
Paper 2 introduces a foundational evaluation framework for a critical but poorly measured area: continual learning in language agents. By providing controlled task streams and rigorous transfer metrics, it addresses a fundamental methodological gap. Evaluation benchmarks typically achieve broader scientific impact and higher citation rates than specific algorithmic architectures (Paper 1), as they set the standard for future research in the field.
Paper 2 addresses continual learning, a fundamental and pervasive challenge in AI, offering a rigorous framework applicable across multiple domains like coding and reasoning. While Paper 1 provides a valuable, high-quality benchmark, its scope is domain-specific to finance. Therefore, Paper 2 has greater potential for broad cross-disciplinary impact and foundational methodological advancement in agentic AI.
Paper 1 addresses a more novel and immediately impactful problem: understanding *where* deep-research agents fail within their trajectories, rather than just whether they fail. The TELBench benchmark and DRIFT framework provide concrete, actionable tools for improving agent reliability with demonstrated 30pp improvements. As AI agents become increasingly deployed, process-level debugging is critically needed. Paper 2 tackles continual learning evaluation for agents—an important but more incremental contribution, with findings that largely confirm known challenges (memory degradation, need for better designs) without offering strong solutions. Paper 1's novelty and practical applicability give it higher impact potential.