FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman
Abstract
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7 over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below ) to as low as 1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with 40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
AI Impact Assessments
(1 models)Scientific Impact Assessment: FORGE — Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
1. Core Contribution
FORGE introduces a population-based, gradient-free protocol for evolving prompt-injected natural-language memory in LLM agents. The key novelty lies in combining three mechanisms: (1) a Reflexion-style inner loop that converts failed trajectories into reusable knowledge artifacts (Rules, Examples, or Mixed), (2) champion broadcast that propagates the best-performing instance's memory to the entire population between stages, and (3) a graduation criterion that freezes converged instances. The paper explicitly frames this as adapting Population-Based Training (PBT) from weight space to prompt space, which is a conceptually clean and well-motivated transfer.
The central insight — that single-stream verbal self-improvement lacks selection pressure and can accumulate counterproductive artifacts — is well-articulated. The population broadcast mechanism addresses this by providing a form of evolutionary selection over textual artifacts, which is genuinely novel in the context of long-horizon stochastic POMDPs.
2. Methodological Rigor
Strengths: The experimental design is commendable in several respects. The paper includes proper ablations (no-graduation variant, Reflexion baseline, zero-shot baseline), a sensitivity sweep over the failure trigger threshold, and evaluates across four distinct LLM families. The distinction between checkpoint evaluations (used for selection) and post-session evaluations (used for reporting) is methodologically clean. The total scale — 116 experiments, 2,640 episodes, ~12.4B tokens — represents substantial computational investment.
Concerns: The primary weakness is evaluation scope. All evidence comes from a single environment (CybORG CAGE-2), a single attacker type (B-line), and a single horizon (30 steps). The authors are transparent about this limitation, but it fundamentally constrains the strength of claims about the protocol's generality. The non-Gemini models receive only 3-4 FORGE sessions per representation, making cross-family conclusions genuinely "directional" as claimed.
The statistical reporting is adequate but could be stronger — standard errors of the mean are shown in figures, but formal significance tests are absent. Given the high variance in returns (SDs often exceeding 30-50), some of the claimed improvements, particularly for models with fewer sessions, may not be statistically robust.
The failure trigger sensitivity analysis reveals that τ = −11.0 actually outperforms the chosen τ = −1.1, which undermines confidence in the hyperparameter choices and raises questions about how much additional tuning could shift the results.
3. Potential Impact
Practical relevance: The work addresses a real need — adapting LLM agents to stochastic sequential environments without fine-tuning. This is relevant for deployment scenarios where model weights are inaccessible (API-only access) or where fine-tuning is prohibitively expensive. The cyber defense application domain adds practical motivation.
Broader influence: The population broadcast mechanism is domain-agnostic in principle and could transfer to other agentic settings (robotics, game playing, workflow automation). The controlled comparison of memory representations (Rules vs. Examples vs. Mixed) provides actionable guidance: Rules offer ~40% token savings with competitive performance, while Examples achieve slightly better returns. This cost-reliability tradeoff analysis is practically useful.
Limitations on impact: The gap between FORGE's best evaluation mean (~−24.5) and the DRL top score (−3.47) remains large. While a single checkpoint reached −3.60, this appears to be an outlier rather than representative of reliable performance. The protocol requires running 10 parallel instances over 6 stages with 3 attempts each — a substantial compute budget that may limit adoption for cost-sensitive applications.
4. Timeliness & Relevance
The paper addresses a timely bottleneck: how to improve LLM agent performance in complex environments without gradient updates. This sits at the intersection of several active research threads — prompt-only self-improvement, test-time adaptation, and agentic memory systems. The positioning relative to Reflexion, Voyager, ExpeL, CLIN, Dynamic Cheatsheet, and ACE is thorough and well-argued.
The choice of CybORG CAGE-2 as a testbed is both a strength (it provides a rigorous, well-benchmarked stochastic POMDP) and a limitation (it is a niche domain that may limit audience). The finding that weaker models benefit disproportionately is timely given the rapid proliferation of LLMs of varying capability levels.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Missing comparisons: The paper would benefit from comparing against simpler baselines like best-of-N sampling (running N independent zero-shot episodes and selecting the best), which would help quantify how much of the improvement comes from memory evolution versus simple selection pressure.
Additional Observations
The paper is well-written with clear figures and comprehensive appendices. The artifact availability (Zenodo + GitHub) supports reproducibility. The token cost analysis is a valuable practical contribution often missing from similar work. However, the paper could benefit from a clearer discussion of failure modes — when does FORGE fail to improve, and why do some sessions produce significantly worse results (e.g., Gemini Rules Session 6 at −50.2 vs. Session 4 at −16.3)?
Generated May 18, 2026
Comparison History (36)
Paper 2 demonstrates higher potential impact through superior methodological rigor and advancements in autonomous agent memory. While Paper 1 offers an interesting proof-of-concept for KG-based metacognition, its evaluation is limited to a small sample size (90 queries). Paper 2 tackles the critical challenge of test-time compute and agent memory evolution without weight updates, utilizing comprehensive ablations across four LLM families in a complex POMDP environment. Its insights into population broadcast mechanics provide foundational knowledge for scaling self-improving, multi-agent AI systems.
Paper 1 (FORGE) introduces a more novel and broadly applicable framework—population-based memory evolution for LLM agents without weight updates—addressing a fundamental challenge in agent learning. It demonstrates rigorous evaluation across 4 LLM families with ablations confirming mechanism contributions. Paper 2 (PPR-GDE) addresses diversity collapse in RL for open-ended generation, which is relevant but more incremental, limited to role-playing tasks, and builds on well-established RLHF/preference optimization paradigms. FORGE's approach to emergent agent memory has broader potential impact across agentic AI applications.
Paper 2 (LMAC) addresses a fundamental challenge in multi-agent reinforcement learning—communication under partial observability—with broader applicability across diverse MARL benchmarks. Its contribution of using LLMs to design communication protocols is novel and generalizable, potentially impacting robotics, autonomous systems, and distributed AI. Paper 1 (FORGE), while methodologically interesting, is evaluated only on a single environment (CAGE-2 B-line) with acknowledged limited generalizability. Paper 2's broader experimental validation and wider applicability to the large MARL community give it higher potential impact.
Paper 2 is likely higher impact: it introduces a large, released, standardized benchmark targeting a major industrial domain with proprietary-documentation grounding and end-to-end workflow evaluation, enabling broad, reproducible comparisons and serving as shared infrastructure for many follow-on studies. Its “Execution Wall” finding is a timely, actionable diagnostic for deploying LLM agents in real systems and is likely to influence evaluation methodology across domains. Paper 1 is novel and shows strong gains, but evidence is confined to a single environment/attacker setting, limiting generality and cross-field uptake.
Paper 1 presents a concrete, novel algorithmic protocol (population-broadcast self-evolving memory with no weight updates) with substantial empirical gains on a challenging, stochastic long-horizon cyber-defense benchmark, plus ablations identifying key mechanisms—supporting methodological rigor and near-term applicability. Its approach is timely for agent reliability and can plausibly transfer across agentic tasks where prompt-memory is used. Paper 2 is timely and potentially broadly influential conceptually, but as a position paper it offers limited empirical validation and actionable implementation detail, making near-term scientific/engineering impact less certain.
FORGE introduces a novel, broadly applicable framework for improving LLM agent performance through self-evolving memory without weight updates—a paradigm with wide applicability across agentic AI tasks. Its population-based broadcast mechanism is a genuinely new contribution with clear ablation evidence. Paper 1, while addressing an important gap (logicality in scientific reasoning), is more incremental, focused narrowly on physics QA benchmarks, and primarily contributes a data curation methodology rather than a fundamentally new technique. FORGE's cross-model generality and practical implications for resource-efficient agent improvement give it broader potential impact.
Paper 1 focuses on automatic equation discovery (symbolic regression), a fundamental challenge in 'AI for Science' with broad applicability across physics, biology, and chemistry. Its ability to reliably extract symbolic laws from data promises direct impact on scientific discovery. While Paper 2 presents an interesting population-based memory evolution for agents, its empirical validation is strictly confined to a single cybersecurity POMDP environment, significantly limiting its proven generalizability and immediate cross-disciplinary scientific impact compared to Paper 1.
Paper 2 (FORGE) presents a concrete, empirically validated method for improving LLM agent performance without weight updates, demonstrating clear quantitative gains across multiple model families on a challenging benchmark. Its contributions—population-based memory evolution, broadcast mechanisms, and representation comparisons—are immediately actionable and broadly applicable to the growing field of LLM agents. Paper 1 (SEED) introduces a conceptual framework for experimental design representation that, while intellectually interesting, is more niche, relies on a lightweight feasibility test rather than rigorous empirical validation, and addresses a narrower audience primarily concerned with experimental methodology for AI systems.
Paper 2 has higher potential impact due to a clearer, broadly applicable framing (personalization failures as commitment/constraint management, not recall) and a method that is model-agnostic and evaluable via verifiable guarantees (“zero failures within validator scope”) with explicit trade-offs (availability vs safety, recall limits). The approach generalizes across many personalized LLM products (assistants, agents, long-context, memory tools) and is timely amid deployment concerns. Paper 1 is useful and novel for agent self-improvement, but evidence is confined to a single benchmark/attacker setting and may generalize less beyond agent RL-style tasks.
Paper 1 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs without requiring gradient updates. Its broad evaluation across various model sizes, modalities, and general benchmarks (code, QA) suggests high versatility and ease of adoption. In contrast, Paper 2 presents a prompt-based memory evolution framework evaluated on a single, highly specific cybersecurity environment. Consequently, Paper 1 possesses significantly higher breadth of impact and potential for widespread integration into existing LLM architectures.
Paper 2 likely has higher scientific impact: it proposes a unified, intervention-aware framework for clinical trajectory prediction that connects forecasting, counterfactual estimation, and policy evaluation while explicitly treating treatment/observation feedback—central limitations in current clinical AI. Its potential real-world applications (decision-grade evidence, treatment policy stress-testing, safer learning health systems) are broad and timely, spanning medicine, causal inference, time-series modeling, and health policy. Paper 1 is novel and empirically solid within LLM agent training-without-updates, but its evidence is confined to a single benchmark setting, limiting breadth and external validity.
Paper 2 is a comprehensive survey covering AI for inverse PDE problems—a foundational topic spanning medical imaging, geophysics, materials science, and aerodynamics. Its breadth of impact across multiple scientific and engineering fields, combined with its role as the first unified systematic review of this area, gives it high citation potential and broad relevance. Paper 1, while methodologically interesting, addresses a narrow domain (a single cybersecurity benchmark) with explicitly acknowledged limited generalizability, restricting its broader scientific impact.
Paper 1 introduces a novel, generalizable framework (FORGE) for evolving LLM agent memory without weight updates, demonstrating systematic methodology across multiple LLM families with ablation studies. Its contributions to population-based prompt evolution and memory propagation have broader applicability beyond the specific benchmark. Paper 2, while practically impactful at Baidu Maps scale, is more of an engineering contribution with incremental improvements to an existing production system. Paper 1's methodological novelty, cross-model generalizability, and potential to influence the growing field of LLM agent self-improvement give it higher scientific impact potential.
Paper 1 addresses a core challenge in AI—enhancing LLM agent decision-making and memory without gradient updates—which has broad applicability across complex environments. Its rigorous evaluation across multiple LLM families and substantial performance gains over strong baselines indicate high methodological rigor and significant potential impact. Paper 2, while novel in combining LLMs with Fuzzy Cognitive Maps for political science modeling, focuses on a more niche methodology and application, likely limiting its broader scientific influence compared to the foundational agent improvements in Paper 1.
Paper 1 introduces a novel, weight-free self-improvement protocol (population broadcast + graduation) for LLM agents and demonstrates sizable, consistent gains across multiple model families in a challenging, stochastic long-horizon cyber-defense POMDP. If robust, this could generalize broadly to agent training/operation where finetuning is infeasible, impacting RL/agents, security automation, and prompt-based learning methods. Paper 2 is a valuable benchmark with clear applications, but its impact is more domain-scoped (Chinese gaming vertical) and primarily evaluative rather than proposing a broadly reusable learning mechanism.
Paper 1 presents a highly novel, generalized approach for LLM agent self-improvement without weight updates, demonstrating significant quantitative gains across multiple state-of-the-art models. Its population-based memory broadcast technique has broad implications across the rapidly growing field of AI agents. In contrast, Paper 2, while highly useful for the EDA community, focuses on benchmark maintenance for RTL generation, giving it a much narrower scope and applicability compared to the fundamental algorithmic advancements proposed in Paper 1.
Paper 2 (FORGE) has higher potential scientific impact due to a more novel algorithmic contribution (population-broadcast, self-evolving memory without weight updates) and broader relevance to agent learning, continual adaptation, and robustness across model families. Its results target a challenging, stochastic, long-horizon POMDP domain (cyber defense) with clear performance gains and ablations identifying key mechanisms, suggesting methodological rigor. Paper 1 (PRISM) is highly applicable and timely for enterprise reliability, but is more of a systems/engineering framework with narrower scientific generality and heavier dependence on platform-specific evaluation and LLM-as-judge metrics.
Paper 1 presents a highly impactful approach to automated neural architecture search, demonstrating progress toward recursive self-improvement where AI designs better AI. Its discovery of architectures that outperform strong baselines like Llama 3.2 at the 1B scale has broad implications for foundation model development across all domains. In contrast, Paper 2 offers a valuable but narrower contribution regarding agent memory, explicitly noting its evidence is confined to a single specific benchmark (CAGE-2 B-line). Paper 1's generalizability and potential to shift the paradigm of foundation model design give it significantly higher scientific impact.
Paper 2 targets competitive programming, a highly rigorous and widely recognized benchmark for complex LLM reasoning. By establishing a new state-of-the-art across multiple established datasets using a novel graph-structured knowledge network for continuous learning, it demonstrates broader applicability and stronger methodological validation than Paper 1, which evaluates its approach on a single specific network-defense environment.
Paper 2 is a position paper that sets a broad research agenda for 'Metacognitive AI', addressing fundamental challenges like efficiency, accuracy, and security across multiple AI domains. While Paper 1 offers a strong, specific algorithmic improvement for LLM agents, Paper 2's conceptual framework, backed by a community software tool, has a higher potential for widespread adoption and paradigm-shifting impact across the broader AI landscape.