AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

Jun 1, 2026

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →

#1199of 3404·Artificial Intelligence

#1199 of 3404 · Artificial Intelligence

Tournament Score

1435±44

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance6.5

Rigor5.8

Novelty6.5

Clarity7.5

Tournament Score

1435±44

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AGENTCL

1. Core Contribution

AGENTCL introduces an evaluation framework for continual learning (CL) in language agents, addressing a genuine gap: existing benchmarks either focus on long-context retrieval/reasoning or stream tasks naively without controlling cross-task relationships. The paper's central insight is that task stream design fundamentally determines whether a benchmark can discriminate among memory methods. The framework contributes three things: (1) a distinction between "naive" streams (arbitrary task ordering) and "compositional" streams (where earlier subtasks provide intentionally reusable sub-solutions for later complex tasks); (2) a two-pass evaluation protocol that decouples plasticity, stability, and generalization gains through well-defined metrics (PG, SG, GG); and (3) MEMPROBE, a diagnostic non-parametric memory method with multi-view storage (interaction, insight, skill) and quality-aware consolidation.

The problem addressed is timely and meaningful: agents spend significant compute on each episode yet rarely accumulate transferable knowledge. However, the contribution is primarily evaluative and diagnostic rather than providing a strong new solution—the paper explicitly frames MEMPROBE as a "probing method" rather than a state-of-the-art approach.

2. Methodological Rigor

The experimental design is thoughtful in several respects. The two-pass protocol (read-write first pass, read-only second pass) cleanly separates what memory contributes during accumulation versus after consolidation. The held-out evaluation adds a generalization dimension often missing from CL benchmarks. Quality control for synthesized BrowseComp+ subtasks includes Jaccard similarity analysis, evidence overlap statistics, and multi-property verification.

However, there are notable limitations in rigor:

Single LLM backbone: Most experiments use Qwen3.5-35B-A3B, and deep research uses gpt-oss-120b. There is no systematic study of whether findings hold across model families or scales.

Three runs only: For coding tasks, averages over three runs are reported, but the bootstrap CIs in the appendix are wide (often ±7-10 percentage points), making many pairwise comparisons non-significant.

Limited statistical power: The authors acknowledge this indirectly through Table 9, where even on compositional streams, only 34.1% of pairwise CIs exclude zero.

Top-k=2 uniformly: Using a fixed retrieval budget across all methods may disadvantage some designs.

No parametric CL methods evaluated: The paper acknowledges this limitation but it reduces the comprehensiveness of the benchmark assessment.

3. Potential Impact

The framework addresses a real need in the emerging area of agentic CL. The key finding—that naive streams compress performance differences while compositional streams amplify them—is actionable for benchmark designers. The concrete recommendation that future CL benchmarks should "publish task streams more transparently and discuss relationships between tasks" could influence benchmark design norms.

Practical impact areas include:

Benchmark design methodology: The naive vs. compositional stream contrast provides a template for other domains.

Memory system evaluation: The plasticity-stability tradeoff findings (e.g., methods achieving +40 PG on compositional streams but negative GG on held-out tasks) highlight that current non-parametric memory designs are far from solving the problem.

Agent development: The insight that memory-induced degradation is common could redirect engineering efforts.

The impact is somewhat limited by the scope of environments tested (coding, deep research, BabyAI) and the focus exclusively on non-parametric memory. The benchmark does not yet cover more complex multi-tool environments, collaborative settings, or long-horizon planning tasks where CL might matter most.

4. Timeliness & Relevance

This work is highly timely. The explosion of language agent research (2024-2026) has outpaced rigorous evaluation methodology. The concurrent appearance of Continual Learning Bench, SWE-Bench-CL, and Evo-Memory confirms the community recognizes this gap. AGENTCL differentiates itself through the controlled stream design and formal metric decomposition, which is a meaningful advance over simply streaming existing datasets.

The paper arrives at a moment when the field needs principled evaluation more than incremental method improvements, making the framework-first approach appropriate.

5. Strengths & Limitations

Strengths:

Clean formalization of the CL evaluation problem for agents with well-motivated metrics

The compositional vs. naive stream contrast is intuitive yet surprisingly unexplored

Comprehensive coverage of non-parametric memory designs (7 methods + memoryless baseline)

The finding about discriminative power is empirically convincing across multiple domains

Quality control for synthesized data is thorough

Public data release enhances reproducibility

Limitations:

The compositional streams are relatively simple (subtask → complex task), not capturing more nuanced forms of knowledge transfer (e.g., analogical, procedural generalization across domains)

MEMPROBE, while useful as a probe, is not particularly novel—it's essentially multi-view memory with quality filtering

The paper does not address how to automatically identify or construct compositional relationships in arbitrary task domains, limiting scalability

Generalization gains are consistently negative or near-zero across all methods, yet the paper offers limited analysis of why or how this might be addressed

The BrowseComp+ subtasks are synthesized by GPT-5.2 using gold evidence—this privileged construction process raises questions about how naturalistic the resulting compositionality is

Limited analysis of failure modes beyond the two-case study

Additional Observations

The paper makes a meta-scientific contribution by demonstrating that evaluation methodology matters as much as method development in emerging fields. The variance analysis (comparing standard deviations across streams) is a simple but effective way to argue for benchmark design quality. The token usage statistics (Table 6) show that compositional streams can actually reduce inference cost, an underappreciated practical benefit.

The work would benefit from a stronger theoretical framework connecting compositional structure to expected transfer bounds, and from evaluation of how stream length affects the observed patterns.

Rating:6.3/ 10

Significance 6.5Rigor 5.8Novelty 6.5Clarity 7.5

Generated Jun 2, 2026

Comparison History (24)

vs. From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

gpt-5.26/5/2026

Paper 2 likely has higher impact because it introduces a rigorous evaluation framework and benchmark (AgentCL) plus diagnostic tooling (MemProbe) for continual learning in language agents—a timely, broadly relevant area for agentic systems. Benchmarks and metrics often become field standards, shaping research directions across LLM agents, memory systems, and continual learning. Paper 1 offers a useful reframing of hallucination detection via OOD methods and training-free detectors, but it is more narrow in scope and may overlap with existing uncertainty/OOD-based safety work. Paper 2’s methodological contribution and cross-field applicability are stronger.

vs. When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a broadly applicable, novel reliability mechanism (selective abstraction) for long-form generation that moves beyond binary abstention, with clear real-world value in safety-critical deployment. It formalizes risk/coverage tradeoffs and provides an end-to-end evaluation pipeline with quantitative gains on established benchmarks, suggesting methodological rigor and immediate usability across many LLM applications. Paper 1 is valuable but mainly advances benchmarking/evaluation for continual-learning agents; its impact depends on community adoption and future memory-method advances, and is narrower in near-term application.

vs. ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

claude-opus-4.66/3/2026

Paper 1 (AGENTCL) addresses a fundamental challenge in language agent research—rigorous evaluation of continual learning—with broader applicability across AI. It introduces a novel evaluation framework with controlled task streams, transfer metrics, and a diagnostic probing method (MemProbe), contributing methodological innovations applicable to the growing field of language agents. Paper 2 (ClinicalMC) is a valuable domain-specific benchmark for clinical decision-making but is narrower in scope. AGENTCL's contributions to understanding memory, plasticity, and knowledge reuse in agents have wider cross-field impact and address a timely, foundational problem.

vs. Reasoning Structure of Large Language Models

gemini-3.16/3/2026

Paper 1 introduces a highly novel and timely methodology for evaluating Large Reasoning Models by converting unstructured reasoning traces into verifiable graphs. Given the recent surge in reasoning-focused LLMs, moving beyond superficial metrics like accuracy to analyze the topological structure and efficiency of reasoning addresses a critical bottleneck in the field. While Paper 2 presents a valuable benchmark for continual learning in agents, Paper 1's approach has broader foundational implications for understanding and diagnosing the core cognitive mechanics of modern AI models.

vs. Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

gemini-3.16/2/2026

Paper 1 addresses an extremely timely and critical issue in modern LLM deployment: the security and intellectual property of hidden reasoning traces (e.g., OpenAI's o1). By demonstrating that these traces can be extracted via prompting, it has immediate, widespread implications for AI safety, model distillation, and commercial API design. While Paper 2 offers a valuable benchmark for continual learning in agents, Paper 1's findings represent a more fundamental discovery about model vulnerabilities with broader real-world consequences.

vs. Evaluation of Baseline Methods for IDD-based SSD External Memory Search

claude-opus-4.66/2/2026

Paper 1 addresses the timely and rapidly growing field of language agents and continual learning, proposing a novel evaluation framework (AgentCL) with controlled task streams and diagnostic tools (MemProbe). It has broader impact across AI/NLP communities, tackles a fundamental challenge in deploying LLM-based agents, and introduces methodological contributions (compositional streams, transfer metrics) that could shape future research. Paper 2 addresses a narrower, more incremental topic—evaluating simple baseline methods for external memory search—with limited novelty and a smaller potential audience in classical AI search.

vs. RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

gemini-3.16/2/2026

Paper 2 introduces a novel benchmark and evaluation framework for continual learning in LLM agents, a highly active and rapidly growing field. Foundational evaluation frameworks and benchmarks typically generate broader scientific impact and citations across the AI community than incremental methodological improvements in specific applications, as seen in Paper 1's marginal metric gains in radiology report generation.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gpt-5.26/2/2026

Paper 2 (AGENTCL) has higher estimated impact because it proposes a rigorous, controlled evaluation framework and metrics for continual learning in language agents—an enabling contribution likely to be reused broadly across labs and agent architectures. Its controlled task streams address a core methodological gap (confounded/naive streams), improving rigor and comparability, and it spans multiple domains (coding, research, reasoning). Paper 1 is practically valuable for debugging and shows strong applied gains, but is more tool/system-specific and likely narrower in long-term, cross-field adoption than a benchmark+methodology that can standardize future work.

vs. Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

claude-opus-4.66/2/2026

Paper 1 identifies a novel, practically important attack surface—the feed/ranking layer upstream of LLM agents—that current safety evaluations entirely overlook. It introduces a rigorous causal protocol, demonstrates strong empirical results across multiple models and labs, characterizes dose-response dynamics, and proposes mitigations. This has immediate implications for AI safety, deployment practices, and policy. Paper 2 contributes a useful evaluation framework for continual learning in agents, but is more incremental—refining benchmarks rather than uncovering a fundamentally new vulnerability. Paper 1's findings are more broadly actionable and timely given rapid LLM agent deployment.

vs. An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

gemini-3.16/2/2026

Paper 2 investigates a fundamental cognitive and architectural question regarding frontier reasoning models, exposing a critical 'production-evaluation gap.' Its use of rigorous mechanistic interpretability techniques, such as causal patching and linear probes, to explain the 'answer confirmation bias' provides deep scientific insights into model behavior. While Paper 1 offers a valuable benchmarking tool for continual learning agents, Paper 2's findings challenge dominant training paradigms for large language models, giving it broader implications, higher timeliness, and significantly greater potential impact across the AI alignment and capabilities communities.

vs. Evaluating Bivariate Causal Statements Based on Mutual Compatibility

gpt-5.26/2/2026

Paper 1 likely has higher impact due to timeliness and broad applicability: rigorous continual-learning evaluation for language agents addresses a central, fast-moving bottleneck in agent research and can directly shape benchmark practices and memory-system design across coding, research, and reasoning agents. Its controlled task-stream construction plus diagnostic tooling (MemProbe) offers a concrete methodology likely to be adopted widely. Paper 2 is novel and theoretically interesting for causal-claim vetting, but its impact may be narrower (bivariate/linear settings, stronger modeling assumptions) and less immediately actionable across mainstream ML systems.

vs. Transferring Information Across Interventions in Causal Bayesian Optimization

gemini-3.16/2/2026

Paper 1 offers a foundational methodological advancement at the intersection of causal inference and Bayesian optimization. By introducing a graph-coupled causal kernel with rigorous theoretical guarantees (e.g., logarithmic information-gain and regret bounds), it provides a robust mathematical framework for experimental design. This ensures long-lasting, cross-disciplinary impact in any field requiring expensive optimizations. While Paper 2 is highly timely for LLM agents, Paper 1's theoretical rigor and broad applicability to general scientific discovery give it a higher potential for sustained scientific impact.

vs. Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

gemini-3.16/2/2026

Paper 1 addresses a critical and highly timely challenge: the fragility of LLM safety alignment against practical perturbations like quantization. Its novel application of zeroth-order optimization to enhance robustness offers significant real-world utility for safe LLM deployment. While Paper 2 provides a valuable evaluation framework for continual learning in agents, the immediate necessity and broader applicability of robust safety alignment give Paper 1 a higher potential for widespread scientific and practical impact.

vs. OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

claude-opus-4.66/2/2026

Paper 1 (AgentCL) addresses a more fundamental and broadly impactful problem—continual learning in language agents—which is a core challenge as agents become more prevalent. Its evaluation framework with controlled task streams, transfer metrics, and the MemProbe diagnostic method provides reusable infrastructure for the growing agent research community. Paper 2 (OpenClawBench) tackles the important but narrower problem of process-side anomaly detection in agent executions. While valuable, its scope is more limited to runtime monitoring/auditing. AgentCL's broader applicability across coding, research, and reasoning tasks, combined with its methodological contributions to understanding memory and knowledge transfer, gives it higher potential for cross-field impact.

vs. A Mathematical Conflict Framework for Contextual Data Modulation

claude-opus-4.66/2/2026

Paper 2 addresses a concrete, timely problem—continual learning in language agents—with a practical evaluation framework (AgentCL), diagnostic tools (MemProbe), and empirical validation across multiple task domains. It connects to the rapidly growing field of LLM-based agents and provides actionable benchmarks. Paper 1 proposes an abstract mathematical framework for representing conflict in data modulation but lacks empirical validation, concrete instantiations, or demonstrated applications, limiting its immediate and foreseeable impact. Paper 2's relevance to the active AI agent community gives it significantly broader potential influence.

vs. LLM-Evolved Pattern Generators for Optimal Classical Planning

gpt-5.26/2/2026

Paper 2 is more novel and potentially higher impact: it introduces the first learned, domain-dependent heuristics that are admissible by construction for optimal planning, combining LLM-driven program synthesis with established admissible composition (saturated cost partitioning). This bridges modern LLM synthesis with classical guarantees, enabling real-world deployment where optimality matters (robotics, logistics, verification) while improving speed and maintaining coverage. Paper 1 is valuable and timely as an evaluation framework for continual learning in agents, but benchmarks/probing tools typically have narrower downstream impact than a new guarantee-preserving method that can directly improve core planning performance.

vs. PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

claude-opus-4.66/2/2026

PassNet addresses a concrete, high-impact problem at the intersection of LLMs and compiler optimization with a comprehensive ecosystem (dataset, benchmark, metrics, and tooling). It demonstrates clear practical value with measurable speedups, introduces a novel abstraction (pass generation vs. kernel generation), and provides publicly available infrastructure that can serve as live training data. AGENTCL contributes a useful evaluation framework for continual learning in agents, but is primarily a benchmarking/diagnostic contribution with narrower scope. PassNet's combination of novelty, practical impact on real-world compilation, and scalable infrastructure gives it broader and deeper potential influence.

vs. EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

gemini-3.16/2/2026

AgentCL addresses a critical bottleneck in the rapidly expanding field of language agents by providing a rigorous evaluation framework for continual learning. Its introduction of controlled task streams and the MemProbe diagnostic tool will likely become foundational for future research on agentic memory and lifelong learning. While Paper 2 presents an innovative multimodal approach for BCIs, Paper 1 has broader applicability and aligns with highly active, field-wide efforts to build autonomous, continually improving AI systems.

vs. Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

gpt-5.26/2/2026

Paper 1 likely has higher scientific impact due to stronger novelty and broader relevance: it proposes a rigorous evaluation framework for continual learning in language agents, a timely core problem in AI, with controlled task streams and diagnostic metrics that could become standard for benchmarking memory and adaptation methods. Its contributions can influence multiple subfields (LLM agents, continual learning, benchmarking, memory systems) and shape future method development. Paper 2 targets an important applied education problem, but appears more domain-specific and uses comparatively established techniques (knowledge graphs + temporal/GNN modeling), limiting breadth of impact.

vs. Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

gemini-3.16/2/2026

While Paper 1 presents a highly rigorous statistical approach to emotion dynamics, Paper 2 targets a more pressing and widely applicable challenge: continual learning in Large Language Model (LLM) agents. Developing rigorous evaluation frameworks for agent memory and lifelong learning is critical for the advancement of autonomous AI systems across diverse domains like coding, research, and reasoning. The broad relevance, extreme timeliness, and potential to standardize evaluation in the rapidly growing field of LLM agents give Paper 2 a higher potential for widespread scientific impact.