MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Weinan Zhang, Ying Wen

May 8, 2026

arXiv:2605.08374v1 PDF

cs.AI(primary)

#205of 2292·Artificial Intelligence

#205 of 2292 · Artificial Intelligence

Tournament Score

1517±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7.5

Clarity7.5

Tournament Score

1517±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD( $λ$ ) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γ λ)^{d}$ with DAG depth $d$ , replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code will be available soon.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MemQ

1. Core Contribution

MemQ addresses a genuine and previously underexplored problem: multi-step credit assignment in episodic memory for LLM agents. The key insight is that memories are not independent—they form dependency chains where early memories enable the creation of later ones, which in turn contribute to task success. Existing methods like MemRL update each memory's value in isolation (γ=0, single-step EMA), missing these causal chains entirely.

The paper makes three interconnected contributions: (1) formalizing the setting as an Exogenous-Context MDP (EC-MDP) that cleanly separates the exogenous task stream from the endogenous memory store; (2) introducing the provenance DAG, a data structure tracking which memories were retrieved when each new memory was created; and (3) applying TD(λ) eligibility traces over this DAG structure, where credit decays as (γλ)^d with DAG depth d rather than temporal distance. This is a conceptually clean translation of a well-understood RL mechanism into a novel structural domain.

2. Methodological Rigor

Strengths: The EC-MDP formalization is well-motivated and mathematically sound. The factored transition kernel (Eq. 1) cleanly justifies why the exogenous task stream introduces variance without causal signal, providing principled guidance for parameter selection. The conditional independence result (Eq. 2) is a useful property that simplifies learning. The connection between DAG depth and temporal step count in classical TD(λ) is natural and well-articulated.

Concerns: The independent contribution assumption (Eq. 3) — decomposing set-level Q-values as averages of per-memory Q-values — is a strong approximation that ignores synergies and redundancies between memories. The paper does not empirically validate this assumption or analyze when it might fail. The monotonic memory growth assumption (M_{t+1} ⊇ M_t) is acknowledged as a limitation but sidesteps important practical concerns about memory capacity.

The experimental design covers six benchmarks across diverse domains, which is commendable. However, several issues weaken confidence: (a) hyperparameters vary substantially across benchmarks (γ ranges from 0.3 to 0.5, λ from 0.7 to 0.99), and it's unclear how sensitive results are to these choices beyond the ablation studies on two benchmarks; (b) some improvements are within or near standard error bounds (e.g., GPQA tie, MMMU Pro +0.77 pp); (c) the paper reports 3 seeds, which is minimal for claims of statistical significance.

3. Potential Impact

The work addresses a real bottleneck in the emerging field of self-evolving LLM agents. As agents are increasingly deployed in long-horizon, multi-step settings, the ability to properly credit foundational memories that indirectly contribute to success becomes critical. The provenance DAG concept is general and could be adopted by other memory management systems.

Practical applicability: The method is non-parametric (no gradient updates to the LLM), making it compatible with any frozen LLM backbone. The overhead is bounded (BFS to depth D=4), and the implementation appears straightforward. However, the reliance on per-benchmark hyperparameter tuning may limit out-of-the-box adoption.

Broader influence: The EC-MDP formalism could prove useful beyond MemQ, providing a general framework for analyzing agent systems where external task distributions are independent of internal state evolution. The insight about γ trusting structure while λ distrusts noise is elegant and could guide future work.

4. Timeliness & Relevance

This paper is highly timely. The LLM agent memory management space is rapidly expanding (the related work section alone cites ~15 papers from 2025-2026). The credit assignment problem is a natural next frontier after single-step value estimation methods like MemRL. The paper positions itself well against this evolving landscape, building directly on MemRL while addressing its key limitation.

5. Strengths & Limitations

Key Strengths:

Clean problem identification: the multi-step credit assignment gap is real and clearly articulated

Elegant theoretical framework: the EC-MDP and provenance DAG are natural abstractions

Broad evaluation: six diverse benchmarks with multiple baselines

Interpretable ablations: the γ/λ analysis provides genuine insight about when and why provenance-based credit helps

The correlation between improvement magnitude and provenance chain depth is a strong validation signal

Notable Weaknesses:

The independent contribution assumption (Eq. 3) is unvalidated and potentially brittle for highly correlated memory sets

Hyperparameters are heavily benchmark-specific with no automated tuning mechanism proposed

Marginal gains on some benchmarks (GPQA, MMMU Pro) weaken the universality claim

No analysis of computational overhead or wall-clock time comparisons

Code is not yet available ("will be available soon"), limiting reproducibility assessment

The paper does not explore failure modes or cases where provenance-based credit assignment could hurt performance

The BFS depth D=4 is fixed across all experiments without justification

Missing comparisons: The paper does not compare against Memento 2 (Wang, 2026), which uses the Reflected MDP formalism that the EC-MDP explicitly generalizes. This would have been a natural and informative baseline.

6. Additional Observations

The gains are most convincing on LLAB, LiveCodeBench, ERQA, and BFCL — tasks with genuine multi-step structure. The narrative that "improvements scale with provenance chain depth" is compelling and well-supported. However, the paper could have been stronger with a direct measurement of actual provenance chain depths across benchmarks, correlating depth distributions with improvement magnitudes.

The initial Q-value inheritance scheme (averaging parent Q-values) is sensible but introduces a potential issue: if early memories have poorly calibrated Q-values, this bias propagates to all descendants. The paper does not discuss convergence properties or stability guarantees.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

Generated May 12, 2026

Comparison History (18)

vs. $δ$-mem: Efficient Online Memory for Large Language Models

claude-opus-4.65/16/2026

MemQ introduces a more theoretically novel framework by formalizing memory management as an Exogenous-Context MDP and applying TD(λ) eligibility traces over provenance DAGs—a principled integration of reinforcement learning with memory systems that opens new research directions. It addresses a fundamental limitation (ignoring dependency chains in memory) with a well-grounded formalism, demonstrates broad impact across six diverse benchmarks, and provides principled parameter selection guidance. While δ-mem offers practical engineering value with its compact memory state, MemQ's theoretical contributions and broader applicability across agent paradigms suggest higher long-term scientific impact.

vs. PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting

gemini-3.15/16/2026

Paper 1 addresses a critical bottleneck in coupled spatiotemporal forecasting with profound implications for climate modeling and physical sciences. By significantly reducing compounding errors in long-term global ocean-atmosphere forecasts, it offers high real-world impact for predicting climate evolution. While Paper 2 presents an innovative approach to LLM agent memory, Paper 1's universal framework tackles a fundamental, high-stakes problem in physical simulations with broad interdisciplinary applications and urgent global relevance.

vs. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

gemini-3.15/16/2026

Paper 2 proposes a highly innovative methodological advancement by integrating Q-learning and eligibility traces into LLM memory architectures via provenance DAGs. This foundational contribution to credit assignment in memory agents demonstrates significant empirical gains across diverse domains (OS, code, embodied reasoning). While Paper 1 provides valuable insights into evaluation biases, Paper 2's architectural breakthrough for self-evolving agents offers broader applicability and pushes the boundaries of autonomous agent capabilities.

vs. Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

claude-opus-4.65/16/2026

MemQ introduces a novel theoretical framework (Exogenous-Context MDP) and a principled credit assignment mechanism for memory-augmented LLM agents using TD(λ) over provenance DAGs. It demonstrates strong results across six diverse benchmarks, suggesting broad applicability. Paper 2's cycle-consistency idea for search agents is creative but more incremental—it adapts existing cycle-consistency concepts to a narrower problem and only matches (not exceeds) supervised baselines. MemQ's contribution to the foundational question of how LLM agents should manage and learn from episodic memory has broader long-term impact across agent architectures.

vs. Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

claude-opus-4.65/16/2026

MemQ introduces a novel theoretical framework (Exogenous-Context MDP with provenance DAGs) that fundamentally rethinks how memory systems in LLM agents should work, moving beyond independent memory evaluation to credit assignment through dependency chains. It demonstrates consistent improvements across six diverse benchmarks, suggesting broad applicability. While Paper 2 addresses an important practical problem (realistic user simulation) with solid results, its contribution is more incremental—improving simulator fidelity—whereas MemQ's integration of RL credit assignment into memory architectures opens a new research direction with deeper theoretical foundations and wider potential impact across agent systems.

vs. Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

gpt-5.25/16/2026

Paper 2 likely has higher impact: it introduces a generally applicable, principled RL-style credit assignment mechanism for LLM memory using provenance DAGs (novelty), formalizes the problem (EC-MDP) supporting methodological rigor, and demonstrates consistent gains across six diverse benchmarks, suggesting broad cross-domain utility. Its approach can improve many agentic systems (tool use, coding, embodied tasks), making real-world applications strong and timely. Paper 1 is important for AI safety/market design but is narrower in domain (duopoly simulations) and may be more sensitive to setup/prompting specifics.

vs. Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

claude-opus-4.65/16/2026

Paper 1 addresses a fundamental gap in disease modeling by integrating social determinants of health with multi-organ sensor data through a novel geometric diffusion framework. Its application to UK Biobank with hundreds of thousands of subjects demonstrates significant scale and clinical relevance. The combination of digital twins, generative models, and SDoH proxies for disease trajectory reasoning has broad implications for personalized medicine and public health. Paper 2, while technically interesting in applying RL to memory management in LLM agents, addresses a narrower problem with more incremental gains and less transformative potential across scientific fields.

vs. WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

claude-opus-4.65/16/2026

MemQ introduces a more fundamental and broadly applicable contribution: formalizing memory management in LLM agents as an Exogenous-Context MDP and applying TD(λ) credit assignment over provenance DAGs. This bridges reinforcement learning theory with LLM agent memory in a novel way, with demonstrated impact across six diverse benchmarks. WorldMAP, while strong in its specific domain (vision-language navigation), addresses a narrower problem. MemQ's theoretical framework (EC-MDP formalization, structural credit propagation) has broader potential to influence agent memory research across many domains.

vs. GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

gemini-3.15/16/2026

Paper 1 introduces a highly innovative approach to episodic memory in LLM agents by formalizing memory dependency as a provenance DAG and applying TD-lambda for credit assignment. This novel structural approach addresses a critical bottleneck in long-term agent reasoning. While Paper 2 offers valuable improvements to LLM post-training, Paper 1's conceptual leap in agent architecture and strong empirical performance across diverse, complex benchmarks suggest a higher potential to inspire new research directions in autonomous agents.

vs. ACIArena: Toward Unified Evaluation for Agent Cascading Injection

gpt-5.25/12/2026

Paper 1 proposes a novel learning mechanism (TD(λ) credit assignment over a provenance DAG) with a formal EC-MDP framing and demonstrates consistent performance gains across diverse agent benchmarks, suggesting broad applicability and a reusable algorithmic contribution to agent memory and continual learning. Its methodological rigor (formalization + multi-domain evaluation + parameter analysis) and direct performance impact make it likely to influence future agent architectures. Paper 2 is timely and valuable for security benchmarking, but primarily provides an evaluation framework/dataset; its impact may be narrower and depends on adoption rather than a broadly generalizable algorithmic advance.

vs. Explicit Trait Inference for Multi-Agent Coordination

claude-opus-4.65/12/2026

MemQ introduces a more novel theoretical framework by formalizing memory management as an Exogenous-Context MDP and applying TD(λ) eligibility traces over provenance DAGs—a creative bridge between reinforcement learning and LLM memory systems. It addresses a fundamental limitation (memory interdependency) with principled methodology, demonstrates broad impact across six diverse benchmarks, and provides theoretical grounding that could influence future memory architecture research. While ETI is valuable, its psychological trait-based approach is more incremental and narrower in scope compared to MemQ's foundational contribution to self-evolving agent memory.

vs. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

gpt-5.25/12/2026

Paper 1 is more methodologically rigorous and conceptually novel: it formalizes memory learning as an Exogenous-Context MDP and introduces TD(λ) credit assignment over a provenance DAG, a clear algorithmic contribution with principled analysis (γ, λ interactions) and broad applicability to many agent settings. It reports gains across six diverse benchmarks, suggesting breadth and reproducibility potential. Paper 2 targets an important deployment problem (multi-user skill evolution) with strong real-world relevance, but the abstract is less specific about the learning algorithm, guarantees, and evaluation breadth (single benchmark), making its scientific contribution appear less grounded and generalizable.

vs. ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

claude-opus-4.65/12/2026

MemQ introduces a novel theoretical framework (Exogenous-Context MDP) and a principled method for propagating credit through memory provenance DAGs using TD(λ) eligibility traces—bridging reinforcement learning and episodic memory for LLM agents. It demonstrates broad impact across six diverse benchmarks, offers deeper theoretical contributions (formalization, parameter analysis), and addresses a fundamental limitation in memory-augmented agents. Paper 2, while valuable, introduces a more narrowly scoped benchmark and plug-and-play module for affordance reasoning in embodied agents, with less theoretical novelty and narrower cross-domain applicability.

vs. Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

gemini-3.15/12/2026

Paper 1 addresses the highly active and rapidly growing field of LLM agents, introducing a novel integration of Q-learning to episodic memory over provenance DAGs. Its broad empirical success across diverse, modern benchmarks (e.g., multimodal reasoning, OS interaction) suggests significant applicability and impact in AI. Paper 2, while methodologically rigorous and practically useful, tackles a more specialized niche in many-objective Bayesian optimization, which typically commands a narrower audience and lower overall scientific impact compared to advancements in foundational agentic AI.

vs. Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

gpt-5.25/12/2026

Paper 1 offers a broadly novel and general mechanism—TD(λ) credit assignment over provenance DAGs in self-evolving memory for LLM agents—grounded in a formal EC-MDP formulation and evaluated across six diverse agentic benchmarks, suggesting wide applicability beyond a single domain. Its method is likely reusable for many memory/retrieval-based agent systems and connects cleanly to established RL theory, strengthening methodological rigor and cross-field impact. Paper 2 is timely and practically relevant for AI safety, but is narrower (jailbreaking-specific) and more heuristic, limiting breadth and long-term generalizability.

vs. expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

claude-opus-4.65/12/2026

MemQ introduces a fundamentally novel framework connecting reinforcement learning credit assignment (TD(λ) eligibility traces) with memory management in LLM agents through provenance DAGs and a new formalism (EC-MDP). This represents a deeper conceptual contribution with broad applicability across diverse agent tasks (6 heterogeneous benchmarks). Paper 2, while solid, presents incremental improvements to GRPO (adaptive KL and curriculum sampling) that are relatively straightforward extensions limited to mathematical reasoning. MemQ's novelty in structural credit assignment for memory systems opens new research directions, giving it higher long-term impact potential.

vs. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

gemini-3.15/12/2026

Paper 2 introduces a highly innovative approach to episodic memory in LLM agents by applying Q-learning over provenance DAGs for credit assignment. This fundamentally addresses a major bottleneck in agentic memory systems. Its broad applicability across six diverse domains (OS interaction, coding, QA, embodied reasoning) and strong theoretical formalization (EC-MDP) suggest a significantly wider scientific impact than Paper 1, which is more narrowly focused on efficiency improvements in robotic manipulation.

vs. Offline Policy Optimization with Posterior Sampling

claude-opus-4.65/12/2026

MemQ introduces a novel framework combining Q-learning with memory provenance DAGs for LLM agents, addressing an underexplored area at the intersection of RL and LLM memory systems. Its breadth of evaluation across six diverse benchmarks, the formalization of EC-MDPs, and practical applicability to the rapidly growing LLM agent ecosystem give it high timeliness and broad impact potential. Paper 2, while theoretically sound, addresses a more incremental improvement in offline RL with Bayesian methods—a well-studied area—offering less novelty and narrower cross-field impact.