FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

May 15, 2026

arXiv:2605.16233v1 PDF

cs.AI(primary)cs.CLcs.LGcs.MAeess.SY

#1518of 2292·Artificial Intelligence

#1518 of 2292 · Artificial Intelligence

Tournament Score

1371±36

10501800

44%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor6

Novelty5.5

Clarity7.5

Tournament Score

1371±36

10501800

44%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7 $\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $- 100$ ) to as low as $\sim$ 1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$ 40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FORGE — Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

1. Core Contribution

FORGE introduces a population-based, gradient-free protocol for evolving prompt-injected natural-language memory in LLM agents. The key novelty lies in combining three mechanisms: (1) a Reflexion-style inner loop that converts failed trajectories into reusable knowledge artifacts (Rules, Examples, or Mixed), (2) champion broadcast that propagates the best-performing instance's memory to the entire population between stages, and (3) a graduation criterion that freezes converged instances. The paper explicitly frames this as adapting Population-Based Training (PBT) from weight space to prompt space, which is a conceptually clean and well-motivated transfer.

The central insight — that single-stream verbal self-improvement lacks selection pressure and can accumulate counterproductive artifacts — is well-articulated. The population broadcast mechanism addresses this by providing a form of evolutionary selection over textual artifacts, which is genuinely novel in the context of long-horizon stochastic POMDPs.

2. Methodological Rigor

Strengths: The experimental design is commendable in several respects. The paper includes proper ablations (no-graduation variant, Reflexion baseline, zero-shot baseline), a sensitivity sweep over the failure trigger threshold, and evaluates across four distinct LLM families. The distinction between checkpoint evaluations (used for selection) and post-session evaluations (used for reporting) is methodologically clean. The total scale — 116 experiments, 2,640 episodes, ~12.4B tokens — represents substantial computational investment.

Concerns: The primary weakness is evaluation scope. All evidence comes from a single environment (CybORG CAGE-2), a single attacker type (B-line), and a single horizon (30 steps). The authors are transparent about this limitation, but it fundamentally constrains the strength of claims about the protocol's generality. The non-Gemini models receive only 3-4 FORGE sessions per representation, making cross-family conclusions genuinely "directional" as claimed.

The statistical reporting is adequate but could be stronger — standard errors of the mean are shown in figures, but formal significance tests are absent. Given the high variance in returns (SDs often exceeding 30-50), some of the claimed improvements, particularly for models with fewer sessions, may not be statistically robust.

The failure trigger sensitivity analysis reveals that τ = −11.0 actually outperforms the chosen τ = −1.1, which undermines confidence in the hyperparameter choices and raises questions about how much additional tuning could shift the results.

3. Potential Impact

Practical relevance: The work addresses a real need — adapting LLM agents to stochastic sequential environments without fine-tuning. This is relevant for deployment scenarios where model weights are inaccessible (API-only access) or where fine-tuning is prohibitively expensive. The cyber defense application domain adds practical motivation.

Broader influence: The population broadcast mechanism is domain-agnostic in principle and could transfer to other agentic settings (robotics, game playing, workflow automation). The controlled comparison of memory representations (Rules vs. Examples vs. Mixed) provides actionable guidance: Rules offer ~40% token savings with competitive performance, while Examples achieve slightly better returns. This cost-reliability tradeoff analysis is practically useful.

Limitations on impact: The gap between FORGE's best evaluation mean (~−24.5) and the DRL top score (−3.47) remains large. While a single checkpoint reached −3.60, this appears to be an outlier rather than representative of reliable performance. The protocol requires running 10 parallel instances over 6 stages with 3 attempts each — a substantial compute budget that may limit adoption for cost-sensitive applications.

4. Timeliness & Relevance

The paper addresses a timely bottleneck: how to improve LLM agent performance in complex environments without gradient updates. This sits at the intersection of several active research threads — prompt-only self-improvement, test-time adaptation, and agentic memory systems. The positioning relative to Reflexion, Voyager, ExpeL, CLIN, Dynamic Cheatsheet, and ACE is thorough and well-argued.

The choice of CybORG CAGE-2 as a testbed is both a strength (it provides a rigorous, well-benchmarked stochastic POMDP) and a limitation (it is a niche domain that may limit audience). The finding that weaker models benefit disproportionately is timely given the rapid proliferation of LLMs of varying capability levels.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framework: PBT → prompt space is an intuitive and well-executed transfer

Thorough ablation design isolating broadcast vs. graduation contributions

Multi-model evaluation across four LLM families

Honest and precise scoping of claims (all caveats explicitly stated)

Complete raw data in appendices enabling full reproducibility verification

The finding that broadcast (not graduation) drives performance is a clean, actionable insight

Notable Weaknesses:

Single-domain evaluation severely limits generalizability claims

The hierarchical ReAct agent architecture introduces many design choices (Planner/Analyst/ActionChooser decomposition, prompt structures) whose interactions with FORGE are not disentangled

No comparison with parameter-efficient fine-tuning or TextGrad, acknowledged but still a gap

Champion broadcast is destructive (full memory replacement), discarding potentially valuable diversity — the paper acknowledges this but doesn't explore alternatives

The graduation threshold θ = −15 and other hyperparameters appear chosen without systematic tuning, and the trigger threshold analysis suggests the chosen values are suboptimal

Missing comparisons: The paper would benefit from comparing against simpler baselines like best-of-N sampling (running N independent zero-shot episodes and selecting the best), which would help quantify how much of the improvement comes from memory evolution versus simple selection pressure.

Additional Observations

The paper is well-written with clear figures and comprehensive appendices. The artifact availability (Zenodo + GitHub) supports reproducibility. The token cost analysis is a valuable practical contribution often missing from similar work. However, the paper could benefit from a clearer discussion of failure modes — when does FORGE fail to improve, and why do some sessions produce significantly worse results (e.g., Gemini Rules Session 6 at −50.2 vs. Session 4 at −16.3)?

Rating:5.5/ 10

Significance 5.5Rigor 6Novelty 5.5Clarity 7.5

Generated May 18, 2026

Comparison History (36)

vs. Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

gemini-3.15/19/2026

Paper 2 demonstrates higher potential impact through superior methodological rigor and advancements in autonomous agent memory. While Paper 1 offers an interesting proof-of-concept for KG-based metacognition, its evaluation is limited to a small sample size (90 queries). Paper 2 tackles the critical challenge of test-time compute and agent memory evolution without weight updates, utilizing comprehensive ablations across four LLM families in a complex POMDP environment. Its insights into population broadcast mechanics provide foundational knowledge for scaling self-improving, multi-agent AI systems.

vs. Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

claude-opus-4.65/19/2026

Paper 1 (FORGE) introduces a more novel and broadly applicable framework—population-based memory evolution for LLM agents without weight updates—addressing a fundamental challenge in agent learning. It demonstrates rigorous evaluation across 4 LLM families with ablations confirming mechanism contributions. Paper 2 (PPR-GDE) addresses diversity collapse in RL for open-ended generation, which is relevant but more incremental, limited to role-playing tasks, and builds on well-established RLHF/preference optimization paradigms. FORGE's approach to emergent agent memory has broader potential impact across agentic AI applications.

vs. LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

claude-opus-4.65/19/2026

Paper 2 (LMAC) addresses a fundamental challenge in multi-agent reinforcement learning—communication under partial observability—with broader applicability across diverse MARL benchmarks. Its contribution of using LLMs to design communication protocols is novel and generalizable, potentially impacting robotics, autonomous systems, and distributed AI. Paper 1 (FORGE), while methodologically interesting, is evaluated only on a single environment (CAGE-2 B-line) with acknowledged limited generalizability. Paper 2's broader experimental validation and wider applicability to the large MARL community give it higher potential impact.

vs. TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

gpt-5.25/19/2026

Paper 2 is likely higher impact: it introduces a large, released, standardized benchmark targeting a major industrial domain with proprietary-documentation grounding and end-to-end workflow evaluation, enabling broad, reproducible comparisons and serving as shared infrastructure for many follow-on studies. Its “Execution Wall” finding is a timely, actionable diagnostic for deploying LLM agents in real systems and is likely to influence evaluation methodology across domains. Paper 1 is novel and shows strong gains, but evidence is confined to a single environment/attacker setting, limiting generality and cross-field uptake.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

gpt-5.25/19/2026

Paper 1 presents a concrete, novel algorithmic protocol (population-broadcast self-evolving memory with no weight updates) with substantial empirical gains on a challenging, stochastic long-horizon cyber-defense benchmark, plus ablations identifying key mechanisms—supporting methodological rigor and near-term applicability. Its approach is timely for agent reliability and can plausibly transfer across agentic tasks where prompt-memory is used. Paper 2 is timely and potentially broadly influential conceptually, but as a position paper it offers limited empirical validation and actionable implementation detail, making near-term scientific/engineering impact less certain.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

claude-opus-4.65/19/2026

FORGE introduces a novel, broadly applicable framework for improving LLM agent performance through self-evolving memory without weight updates—a paradigm with wide applicability across agentic AI tasks. Its population-based broadcast mechanism is a genuinely new contribution with clear ablation evidence. Paper 1, while addressing an important gap (logicality in scientific reasoning), is more incremental, focused narrowly on physics QA benchmarks, and primarily contributes a data curation methodology rather than a fundamentally new technique. FORGE's cross-model generality and practical implications for resource-efficient agent improvement give it broader potential impact.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

gemini-3.15/19/2026

Paper 1 focuses on automatic equation discovery (symbolic regression), a fundamental challenge in 'AI for Science' with broad applicability across physics, biology, and chemistry. Its ability to reliably extract symbolic laws from data promises direct impact on scientific discovery. While Paper 2 presents an interesting population-based memory evolution for agents, its empirical validation is strictly confined to a single cybersecurity POMDP environment, significantly limiting its proven generalizability and immediate cross-disciplinary scientific impact compared to Paper 1.

vs. Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

claude-opus-4.65/19/2026

Paper 2 (FORGE) presents a concrete, empirically validated method for improving LLM agent performance without weight updates, demonstrating clear quantitative gains across multiple model families on a challenging benchmark. Its contributions—population-based memory evolution, broadcast mechanisms, and representation comparisons—are immediately actionable and broadly applicable to the growing field of LLM agents. Paper 1 (SEED) introduces a conceptual framework for experimental design representation that, while intellectually interesting, is more niche, relies on a lightweight feasibility test rather than rigorous empirical validation, and addresses a narrower audience primarily concerned with experimental methodology for AI systems.

vs. Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

gpt-5.25/19/2026

Paper 2 has higher potential impact due to a clearer, broadly applicable framing (personalization failures as commitment/constraint management, not recall) and a method that is model-agnostic and evaluable via verifiable guarantees (“zero failures within validator scope”) with explicit trade-offs (availability vs safety, recall limits). The approach generalizes across many personalized LLM products (assistants, agents, long-context, memory tools) and is timely amid deployment concerns. Paper 1 is useful and novel for agent self-improvement, but evidence is confined to a single benchmark/attacker setting and may generalize less beyond agent RL-style tasks.

vs. NGM: A Plug-and-Play Training-Free Memory Module for LLMs

gemini-3.15/19/2026

Paper 1 introduces a training-free, plug-and-play memory module applicable to a wide range of LLMs without requiring gradient updates. Its broad evaluation across various model sizes, modalities, and general benchmarks (code, QA) suggests high versatility and ease of adoption. In contrast, Paper 2 presents a prompt-based memory evolution framework evaluated on a single, highly specific cybersecurity environment. Consequently, Paper 1 possesses significantly higher breadth of impact and potential for widespread integration into existing LLM architectures.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it proposes a unified, intervention-aware framework for clinical trajectory prediction that connects forecasting, counterfactual estimation, and policy evaluation while explicitly treating treatment/observation feedback—central limitations in current clinical AI. Its potential real-world applications (decision-grade evidence, treatment policy stress-testing, safer learning health systems) are broad and timely, spanning medicine, causal inference, time-series modeling, and health policy. Paper 1 is novel and empirically solid within LLM agent training-without-updates, but its evidence is confined to a single benchmark setting, limiting breadth and external validity.

vs. Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

claude-opus-4.65/19/2026

Paper 2 is a comprehensive survey covering AI for inverse PDE problems—a foundational topic spanning medical imaging, geophysics, materials science, and aerodynamics. Its breadth of impact across multiple scientific and engineering fields, combined with its role as the first unified systematic review of this area, gives it high citation potential and broad relevance. Paper 1, while methodologically interesting, addresses a narrow domain (a single cybersecurity benchmark) with explicitly acknowledged limited generalizability, restricting its broader scientific impact.

vs. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

claude-opus-4.65/19/2026

Paper 1 introduces a novel, generalizable framework (FORGE) for evolving LLM agent memory without weight updates, demonstrating systematic methodology across multiple LLM families with ablation studies. Its contributions to population-based prompt evolution and memory propagation have broader applicability beyond the specific benchmark. Paper 2, while practically impactful at Baidu Maps scale, is more of an engineering contribution with incremental improvements to an existing production system. Paper 1's methodological novelty, cross-model generalizability, and potential to influence the growing field of LLM agent self-improvement give it higher scientific impact potential.

vs. Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

gemini-3.15/19/2026

Paper 1 addresses a core challenge in AI—enhancing LLM agent decision-making and memory without gradient updates—which has broad applicability across complex environments. Its rigorous evaluation across multiple LLM families and substantial performance gains over strong baselines indicate high methodological rigor and significant potential impact. Paper 2, while novel in combining LLMs with Fuzzy Cognitive Maps for political science modeling, focuses on a more niche methodology and application, likely limiting its broader scientific influence compared to the foundational agent improvements in Paper 1.

vs. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

gpt-5.25/19/2026

Paper 1 introduces a novel, weight-free self-improvement protocol (population broadcast + graduation) for LLM agents and demonstrates sizable, consistent gains across multiple model families in a challenging, stochastic long-horizon cyber-defense POMDP. If robust, this could generalize broadly to agent training/operation where finetuning is infeasible, impacting RL/agents, security automation, and prompt-based learning methods. Paper 2 is a valuable benchmark with clear applications, but its impact is more domain-scoped (Chinese gaming vertical) and primarily evaluative rather than proposing a broadly reusable learning mechanism.

vs. RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

gemini-3.15/18/2026

Paper 1 presents a highly novel, generalized approach for LLM agent self-improvement without weight updates, demonstrating significant quantitative gains across multiple state-of-the-art models. Its population-based memory broadcast technique has broad implications across the rapidly growing field of AI agents. In contrast, Paper 2, while highly useful for the EDA community, focuses on benchmark maintenance for RTL generation, giving it a much narrower scope and applicability compared to the fundamental algorithmic advancements proposed in Paper 1.

vs. PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

gpt-5.25/18/2026

Paper 2 (FORGE) has higher potential scientific impact due to a more novel algorithmic contribution (population-broadcast, self-evolving memory without weight updates) and broader relevance to agent learning, continual adaptation, and robustness across model families. Its results target a challenging, stochastic, long-horizon POMDP domain (cyber defense) with clear performance gains and ablations identifying key mechanisms, suggesting methodological rigor. Paper 1 (PRISM) is highly applicable and timely for enterprise reliability, but is more of a systems/engineering framework with narrower scientific generality and heavier dependence on platform-specific evaluation and LLM-as-judge metrics.

vs. Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

gemini-3.15/18/2026

Paper 1 presents a highly impactful approach to automated neural architecture search, demonstrating progress toward recursive self-improvement where AI designs better AI. Its discovery of architectures that outperform strong baselines like Llama 3.2 at the 1B scale has broad implications for foundation model development across all domains. In contrast, Paper 2 offers a valuable but narrower contribution regarding agent memory, explicitly noting its evidence is confined to a single specific benchmark (CAGE-2 B-line). Paper 1's generalizability and potential to shift the paradigm of foundation model design give it significantly higher scientific impact.

vs. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

gemini-3.15/18/2026

Paper 2 targets competitive programming, a highly rigorous and widely recognized benchmark for complex LLM reasoning. By establishing a new state-of-the-art across multiple established datasets using a novel graph-structured knowledge network for continuous learning, it demonstrates broader applicability and stronger methodological validation than Paper 1, which evaluates its approach on a single specific network-defense environment.

vs. Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

gemini-3.15/18/2026

Paper 2 is a position paper that sets a broad research agenda for 'Metacognitive AI', addressing fundamental challenges like efficiency, accuracy, and security across multiple AI domains. While Paper 1 offers a strong, specific algorithmic improvement for LLM agents, Paper 2's conceptual framework, backed by a community software tool, has a higher potential for widespread adoption and paradigm-shifting impact across the broader AI landscape.