Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala
Abstract
Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Continual Learning Bench
1. Core Contribution
CL-Bench addresses a genuine gap in the evaluation landscape for LLM-based systems: there is no rigorous benchmark that measures whether AI agents truly improve through sequential experience in realistic settings. The paper introduces six expert-validated tasks across diverse domains (software engineering, signal processing, epidemiology, database querying, poker, demand forecasting), each engineered with latent structure that is discoverable only through online interaction—not from pretraining knowledge. The key conceptual contribution is the gain metric, which isolates learning-from-experience by comparing stateful system performance against a stateless baseline of the same system on identical instances. This is a clean and well-motivated design choice that addresses a fundamental confound: high absolute performance could reflect strong priors rather than genuine online adaptation.
The benchmark's three admission criteria—headroom, shared latent structure, and learning mechanism—are well-articulated and provide a principled framework for task design that could guide future benchmark construction beyond this specific effort.
2. Methodological Rigor
Strengths in design: The separation of stateful vs. stateless evaluation is methodologically sound and addresses the core identification problem in measuring online learning. The normalization scheme (dividing by system-specific headroom for gain, by a fixed external reference for reward) is thoughtful—it prevents tasks where the baseline is near-optimal from contributing negligible signal and makes reward scores submission-independent.
The expert validation pipeline (2-3 domain experts per task, structured axes of realism/reusable knowledge/learning improvement) adds credibility, though the paper could have been more transparent about expert selection and inter-rater agreement.
Concerns: The stability-plasticity decomposition in Section 5.1/Appendix C, while interesting, relies on variant boundaries as a partition—a somewhat coarse proxy. The decomposition assumes boundary instances cleanly separate "retention" from "adaptation," but in practice, knowledge decay within variants could confound the plasticity term.
With 5 rollouts per task and potentially high variance (some CIs are wide, e.g., ICL + GPT-5.4 at ±9.1% normalized gain), statistical power for distinguishing systems is limited. The paper acknowledges this but doesn't provide formal significance tests between systems. Codex is evaluated with only 1 run, making those results unreliable for ranking.
The tasks themselves range from 12 to 120 instances—relatively short horizons. Whether findings generalize to longer deployment settings remains unclear.
3. Potential Impact
Benchmark impact: CL-Bench fills a clear niche. The community currently lacks a standardized way to evaluate whether memory-augmented agents, context compaction methods, or test-time training approaches actually produce better online learners. This benchmark provides that. Its open-source nature and clear task admission criteria position it for community adoption and extension.
Finding impact: The headline empirical result—that naive ICL outperforms dedicated memory systems like Mem0 and ACE—is provocative and practically important. It challenges the assumption that more sophisticated memory architectures automatically yield better continual learning. The finding that ACE costs 7.6 Gemini Flash ICL on gain is a stark cost-effectiveness result.
Adjacent field impact: The benchmark design principles (hidden latent structure, concept drift, gain metrics) could influence evaluation methodology in continual RL, online learning, and adaptive systems more broadly. The plasticity-stability decomposition adapted for LLM agents connects classical continual learning theory to modern agentic systems.
4. Timeliness & Relevance
This is highly timely. The deployment of persistent LLM agents (coding assistants that work across sessions, data analysts that learn organizational context) is accelerating. Yet evaluation has lagged—most agent benchmarks test isolated task performance. The gap between what practitioners build (stateful, long-horizon agents) and what researchers evaluate (stateless, single-task performance) is exactly what CL-Bench targets.
The paper also arrives at an inflection point where memory-augmented and test-time training approaches are proliferating rapidly, making standardized evaluation infrastructure urgent.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
CL-Bench makes a valuable contribution by providing the first rigorous evaluation framework for online learning in LLM-based agents. The gain metric and task design criteria are its strongest conceptual contributions. The empirical finding that ICL dominates dedicated memory systems is important and actionable. The benchmark's main limitations—small task count, missing parametric methods, short horizons—are acknowledged and positioned as community growth opportunities. This is a solid benchmark paper that addresses a real evaluation gap at an opportune moment.
Generated Jun 5, 2026
Comparison History (20)
Paper 2 likely has higher impact because it introduces an expert-validated, multi-domain benchmark that can become a shared standard for evaluating continual learning in frontier LLM agents, shaping future research and enabling reproducible comparisons. Its breadth (six real-world domains), methodological contribution (stateful vs stateless design, gain metric isolating learning), and timeliness (clear evidence of headroom and counterintuitive findings about memory systems) make it broadly influential. Paper 1 is innovative and application-relevant, but is a narrower algorithmic contribution with impact more confined to agent memory/adaptation methods.
Paper 2 likely has higher impact because it introduces an expert-validated, multi-domain benchmark that can become shared infrastructure for evaluating continual learning in frontier LLM agents. Benchmarks with clear metrics (including a gain metric isolating online learning) tend to catalyze broad, comparable progress across academia and industry, affecting many subfields and enabling reproducibility. It is timely given interest in stateful agents and real-world deployment. Paper 1 is novel and useful but is narrower (a specific RL failure mode and fix) and may have more limited cross-domain uptake than a widely adopted benchmark.
Paper 1 introduces a comprehensive, expert-validated benchmark for continual learning in LLMs, a critical and rapidly growing area of AI. Benchmarks that establish standard evaluation metrics for new capabilities typically become foundational, driving extensive future research and accumulating high citations. While Paper 2 offers valuable insights into LLM interpretability, Paper 1's broad applicability across diverse real-world domains and its potential to guide the development of new agent architectures give it a higher ceiling for field-wide impact.
Paper 2 introduces a novel benchmark for continual learning in frontier AI systems across diverse domains. Addressing the evaluation of LLMs in continual learning is a critical and timely challenge in AI. This broad applicability and foundational contribution to AI evaluation give it a significantly higher potential scientific impact compared to Paper 1, which focuses on a more narrow, domain-specific application in infrastructure inspection.
Paper 2 (CL-Bench) likely has higher scientific impact because it introduces a broadly applicable, expert-validated benchmark targeting a core unsolved capability (continual learning) and provides evaluation methodology (gain metric) that can shape research agendas across many subfields and agent designs. Benchmarks often become community standards, enabling comparable progress and influencing model/agent development widely. Paper 1 (Vortex) is highly timely and valuable for systems/serving and sparse attention iteration, but its impact is narrower (serving-stack-dependent, primarily LLM inference optimization) and more engineering-specific.
Paper 2 has higher potential impact: it proposes a novel, technically specified zero-knowledge verification architecture for frontier-model training with clear governance and security applications, potentially enabling enforceable regulation and auditing across the AI ecosystem. Its approach is timely and broadly relevant across cryptography, distributed systems, ML infrastructure, and policy. While methodological feasibility remains to be proven, it lays out concrete protocol components, proof types, and an explicit research agenda. Paper 1 is a strong benchmark contribution, but its impact is narrower and incremental relative to the rapidly growing benchmark landscape.
Continual learning in LLMs is a critical bottleneck for developing autonomous AI agents. By providing the first expert-validated benchmark across diverse domains, Paper 1 establishes a foundational evaluation metric that will likely drive broad research in AI memory and learning systems. While Paper 2 offers a valuable methodological improvement for multimodal time-series foundation models, Paper 1 addresses a more universally recognized challenge in the rapidly expanding and highly impactful field of frontier AI systems, giving it a higher potential for broad scientific impact.
Paper 2 likely has higher impact: it introduces a broadly useful, expert-validated benchmark for evaluating continual learning in frontier AI systems across six real-world domains, with a clear metric to separate online learning from base capability. Benchmarks often become field standards, enabling reproducible comparisons and accelerating progress across many subareas. Its findings (naive ICL outperforming memory systems) are timely and directly actionable for AI research. Paper 1 is innovative in combining LLM-driven decisions with spatial agent-based epidemiological simulation, but its impact is more domain-specific and depends strongly on validation of LLM behavioral realism.
Paper 1 addresses continual learning in frontier LLMs, a highly relevant and rapidly growing field. By introducing a diverse, expert-validated benchmark, it is likely to drive significant future research and have broad applications across AI. In contrast, Paper 2 focuses on a specialized algorithmic improvement for classical search problems, which, while methodologically sound, has a much narrower scope and potential impact.
Paper 1 introduces a concrete, expert-validated benchmark (CL-Bench) addressing a critical gap in evaluating continual learning for LLM-based systems. It provides empirical findings across six domains with actionable metrics, making it immediately useful to the research community. Paper 2 proposes a theoretical motivational architecture for conversational AGI but remains largely speculative, lacking empirical validation. Benchmarks tend to have outsized impact by shaping research directions, and CL-Bench addresses a timely, well-defined problem with rigorous methodology, whereas Paper 2's contributions are more conceptual and harder to validate.
Paper 2 addresses a critical and universal challenge in AI: evaluating models that surpass human comprehension. By introducing a novel adversarial evaluation framework, it provides a scalable solution for future AI benchmarking across all domains. While Paper 1 offers a valuable benchmark for the specific subfield of continual learning, Paper 2 has a broader, more fundamental impact on how the entire field will measure AI progress and ensure safety as models approach superintelligence.
Paper 2 (CL-Bench) is likely higher impact: it introduces a broad, expert-validated benchmark across six real-world domains, a clear gain metric to disentangle online learning from base capability, and produces actionable findings about current agent/memory approaches. Benchmarks often catalyze community-wide progress, enabling reproducible comparison and shaping research agendas across fields (agents, evaluation, continual learning, safety). Paper 1 offers an interesting interpretable-by-design architecture with efficiency benefits, but architectural proposals face higher adoption risk and narrower immediate applicability than a widely usable evaluation suite.
Paper 1 likely has higher impact: it introduces an expert-validated, multi-domain benchmark targeting a central unsolved capability for frontier AI (online continual learning), along with an evaluation metric and surprising findings that naive ICL can outperform explicit memory systems. Benchmarks often become community standards, shaping research agendas across ML, agents, and domain applications, making it timely and broadly influential. Paper 2 is innovative and theoretically grounded with a strong application (optimal power flow), but its scope is narrower (constrained optimization architectures) and may impact a more specialized community.
Paper 2 likely has higher impact: it introduces a broad, expert-validated benchmark spanning six real-world domains, providing an immediately reusable standard for evaluating and comparing continual-learning agents. Benchmarks often drive field-wide progress, influence model development, and enable reproducible measurement across labs and subfields. Its “gain” metric and finding that naive ICL can beat memory-augmented systems are timely and actionable for frontier-agent design. Paper 1 is methodologically rigorous and novel for RLVR causal decomposition, but is narrower in scope/application and primarily affects a specific alignment/RLVR evaluation niche.
While Paper 1 presents an innovative self-reconfiguration method for LLM agents, Paper 2 introduces a much-needed, expert-validated benchmark for continual learning. Benchmarks like CL-Bench typically have broader scientific impact as they standardize evaluation, expose critical flaws in current systems, and catalyze field-wide future research across multiple domains.
Paper 1 exposes a fundamental, paradigm-shifting flaw in current AI alignment techniques. By analytically and causally demonstrating the 'Safety Paradox'—where enhanced safety mechanisms inherently create new vulnerabilities—it addresses a critical, immediate bottleneck in AI safety. This theoretical and empirical breakthrough is likely to force widespread structural changes in how frontier models are aligned, yielding a higher fundamental scientific impact than the benchmarking utility provided by Paper 2.
Paper 1 likely has higher impact: it introduces a challenging, expert-validated benchmark for continual learning in stateful, real-world settings across six domains, plus a metric to disentangle online learning from base model capability. Benchmarks often become community standards, shaping evaluation and driving progress broadly across ML/AI and agent research. Its negative/diagnostic findings (memory systems not helping; ICL strong) are timely for frontier LLM agents and can redirect research. Paper 2 offers a useful conceptual taxonomy and some diffusion experiments, but its scope is narrower and more incremental relative to existing knowledge-infusion work.
Paper 1 likely has higher impact: it introduces a broad, expert-validated benchmark for continual learning in LLM-based agents across six real-world domains, plus an evaluation methodology (gain metric) that can become a community standard. Benchmarks in fast-moving frontier AI tend to drive widespread follow-on work, enable reproducibility, and influence many subfields (agents, memory, evaluation, safety). Paper 2 is novel and rigorous with guaranteed-admissible learned heuristics, but its scope is narrower (optimal planning/search) and may affect a smaller research community despite strong technical contribution.
Paper 1 introduces a novel benchmark (CL-Bench) addressing a fundamental gap in evaluating continual learning for LLM-based systems—a timely and broadly impactful contribution given the rapid advancement of frontier AI. It spans six diverse domains, provides expert validation, and reveals important findings about memory systems' limitations. Its breadth of impact across the AI community is significantly larger than Paper 2, which applies a specific DRL algorithm to pharmaceutical inventory management—a more incremental, domain-specific contribution with narrower impact, despite its practical value.
While Paper 1 provides highly timely empirical data on a critical environmental issue, Paper 2 introduces a foundational benchmark for continual learning in AI. In the rapidly advancing AI field, comprehensive, expert-validated benchmarks typically drive extensive future research, establishing the standard for evaluating new models. Consequently, Paper 2 is likely to generate a massive citation volume and broadly shape future AI development methodologies.