DMF: A Deterministic Memory Framework for Conversational AI Agents

Matteo Stabile, Enrico Zimuel

Jun 2, 2026

arXiv:2606.03463v1 PDF

cs.AI(primary)cs.CL

#2173of 3355·Artificial Intelligence

#2173 of 3355 · Artificial Intelligence

Tournament Score

1369±44

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor4

Novelty4.5

Clarity7

Tournament Score

1369±44

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $Ω$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $Ω_{\mathrm{eff}}(Δn)$ , governs how relevance evolves as new turns arrive, where $Δ n$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DMF — A Deterministic Memory Framework for Conversational AI Agents

1. Core Contribution

DMF proposes replacing LLM-based memory management in conversational AI agents with a fully deterministic, CPU-first pipeline. The central idea is that memory scoring, pruning, archival, and retrieval can be accomplished without any generative model calls by combining classical NLP features (POS-based information density, VADER sentiment, named entity counts), vector similarity, and a mathematically defined survival score with interaction-count-based exponential decay.

The key novelty lies in the *composition* of these ideas: a logistic-projected survival score Ω combining content, operational, and provenance channels; score-dependent inertia modulating decay rates so high-value memories persist longer; interaction-count (rather than wall-clock) decay for full determinism; and a structured multi-channel recall pipeline with answerability-aware reranking. The framework eliminates LLM calls from the memory management loop entirely, achieving zero token cost for memory operations.

2. Methodological Rigor

Strengths in formulation: The mathematical framework is clearly presented. The survival score derivation, decay law, pruning mechanisms, and calibration examples are well-specified and reproducible. The interaction-count decay choice is well-motivated — it ensures the memory state is a pure function of the conversation sequence, eliminating temporal non-determinism.

Weaknesses in evaluation: The experimental evaluation has significant limitations:

Comparison baseline: Only Mem0 is compared. MemGPT, A-MEM, ReadAgent, MemoryBank, and full-context baselines are discussed in related work but not benchmarked. This makes it difficult to position DMF's accuracy in the broader landscape.

LongMemEval-10 sample size: Using only 10 randomly sampled questions per category (60 total) is statistically underpowered. With binary outcomes and n=10, confidence intervals are extremely wide (~±30%). The reported 7% advantage of Mem0 is well within noise.

LLM-as-judge methodology: The judge prompt is quite lenient (partial credit, paraphrase acceptance, 14-day date tolerance, 50% duration tolerance). While shared across both systems, this leniency may mask quality differences. No inter-annotator agreement or judge reliability analysis is provided.

No statistical significance testing: Neither confidence intervals nor significance tests are reported for any metric.

Single embedding model: All experiments use BAAI/bge-small-en-v1.5. The paper acknowledges this as future work but it limits generalizability claims.

LoCoMo temporal results for Mem0: Mem0 scores 0.15 on temporal reasoning, which seems anomalously low and raises questions about whether the Mem0 baseline was optimally configured.

3. Potential Impact

Token cost reduction is the most compelling practical contribution. The 5–242× reduction in total token usage is substantial and directly translates to cost savings in production deployments. For organizations running conversational agents at scale, this could represent significant operational savings.

Determinism and auditability address a real pain point in production AI systems. The ability to reproduce memory states exactly from conversation sequences is valuable for debugging, compliance, and testing.

CPU-first deployment lowers the infrastructure barrier, enabling memory management on commodity hardware without GPU requirements or API calls.

However, the practical impact may be limited by several factors: (1) the approach is currently English-only; (2) the reliance on rule-based NLP (spaCy POS tags, VADER sentiment, keyword matching) may not capture nuanced semantic content that LLM-based approaches handle naturally; (3) the large number of tunable hyperparameters (α, β, γ, δ, x₀, λ, η, numerous threshold and bonus values) creates a complex configuration surface that may require domain-specific tuning.

4. Timeliness & Relevance

The paper addresses a genuinely important and timely problem. As LLM-based agents move into production, the cost and opacity of memory management become practical bottlenecks. The token cost explosion in long-horizon conversations is a recognized challenge. The push toward deterministic, auditable AI systems aligns with emerging regulatory requirements.

The framing of "zero LLM tokens for memory management" is compelling as a design philosophy, even if the overall system still requires LLM calls for final answer generation. The work contributes to the growing literature on making AI systems more efficient and predictable.

5. Strengths & Limitations

Key Strengths:

Clear, well-specified mathematical framework with full reproducibility potential

Open-source implementation and benchmarks

Dramatic token cost reduction (zero memory-management tokens)

Principled design decisions (interaction-count decay, source-canonical archival, recall-time interpretation)

The social floor heuristic and topic-supersession mechanisms show thoughtful engineering

Notable Limitations:

The feature extraction pipeline (POS ratios, VADER sentiment, entity counts) represents a significant step backward in semantic understanding compared to LLM-based extraction. The paper does not adequately address whether these simple features can capture the nuanced semantics that matter for memory retention in complex conversations.

Many design choices are justified by "design intent" rather than empirical ablation. Why these specific weights? Why these threshold values? No ablation studies are presented.

The benchmark comparison is thin — one baseline, small sample sizes, no statistical rigor.

The paper is very long (21 pages) relative to its empirical contribution, with extensive space devoted to implementation details that could be in supplementary material.

The claim of "comparable accuracy" is overstated given the evaluation limitations. On LongMemEval-10, Mem0 actually outperforms DMF overall.

No analysis of failure modes or cases where deterministic features systematically miss important content.

6. Additional Observations

The paper reads more as a systems paper with a thorough technical specification than as a research paper with rigorous empirical validation. The mathematical framework, while clearly presented, is largely a composition of well-known techniques (logistic regression, exponential decay, cosine similarity, rule-based NLP). The scientific contribution is in demonstrating that this composition can achieve competitive performance at dramatically lower cost, but this claim needs stronger empirical support.

The future work section is extensive and honest about limitations, which is commendable. The shared-memory and multilingual extensions could significantly broaden applicability.

Rating:5/ 10

Significance 5.5Rigor 4Novelty 4.5Clarity 7

Generated Jun 3, 2026

Comparison History (20)

vs. An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

gemini-3.16/6/2026

Paper 1 addresses a critical bottleneck in the rapidly expanding field of AI agents: the high computational cost and non-determinism of LLM-based memory management. By proposing a highly efficient, CPU-first deterministic framework that drastically reduces token usage (up to 242x), it offers immediate, scalable, and highly impactful real-world applications for conversational AI. While Paper 2 presents an interesting interdisciplinary application of LLMs in epidemiology, Paper 1's methodological innovation and broad implications for AI engineering give it a higher potential for widespread scientific and industrial impact.

vs. BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

gpt-5.26/6/2026

Paper 2 is likely to have higher impact: it targets bias mitigation/alignment, a timely and widely relevant problem with broad downstream applications. BiasGRPO offers a clear methodological innovation (group-relative baseline removing critic dependence) that can generalize to other high-variance RLHF settings beyond bias. It also reports improvements over major baselines (DPO, PPO) and releases reusable assets (bias reward model, dataset extension), increasing adoption. Paper 1 is practical and cost-reducing, but its deterministic, classical-NLP memory pipeline may be more incremental and narrower in research reach.

vs. Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable Programming

gemini-3.16/6/2026

Paper 1 integrates AI, physics-based modeling, and neuroscience to address critical healthcare challenges like Alzheimer's and brain tumors. Its hybrid approach offers profound real-world clinical applications and broad multidisciplinary impact. In contrast, Paper 2 provides significant engineering and efficiency improvements for conversational AI, but its scope is narrower and primarily focused on computational cost reduction rather than transformative real-world health applications.

vs. I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

claude-opus-4.66/5/2026

Paper 1 (DMF) addresses a fundamental infrastructure challenge in conversational AI—memory management—with a novel deterministic, token-free approach that dramatically reduces costs (5x-242x fewer tokens) while maintaining accuracy. This has broad practical impact across all conversational AI systems and introduces a paradigm shift away from LLM-dependent memory. Paper 2 tackles meme understanding with a retrieve-and-reason framework, which is more niche in scope. While solid, its impact is narrower, limited to content moderation and meme analysis. DMF's methodological innovation and cost reduction implications make it more broadly impactful.

vs. Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory

claude-opus-4.66/5/2026

Paper 2 (DMF) addresses a broadly relevant problem in conversational AI—memory management for LLM-based agents—with a novel deterministic approach that eliminates LLM calls from the memory loop, achieving dramatic token cost reductions (5x-242x). This has immediate, wide-reaching practical impact given the explosive growth of LLM-based agents. Paper 1 tackles a narrower domain (circular factory reliability for angle grinders) with a competent but incremental combination of existing techniques (CNN-LSTM, FEA, S-N curves). While rigorous, its applicability is limited to specific manufacturing contexts, whereas Paper 2's framework generalizes across all conversational AI applications.

vs. MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

claude-opus-4.66/5/2026

MulFeRL addresses a fundamental limitation in RLVR—sparse, uninformative rewards for failed samples—by introducing verbal feedback in a multi-turn loop with progress credit assignment. This contributes to the rapidly growing and high-impact area of LLM reasoning improvement via RL, with broad applicability across reasoning domains. Paper 2 (DMF) offers a practical engineering contribution for conversational memory management with significant cost savings, but its scope is narrower (memory management optimization) and its novelty is more incremental, replacing LLM summarization with classical NLP techniques. MulFeRL's methodological innovation in combining feedback-guided regeneration with reinforcement learning has broader implications for the field.

vs. Structure Enables Effective Self-Localization of Errors in LLMs

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to greater novelty and breadth: it proposes a generally applicable structured-reasoning and iterative backtracking framework (Thought-ICS) for error localization and self-correction, a central unsolved problem in LLM reliability. The approach is timely, aligns with verification/correction research, and can transfer across tasks (math, logic, tool use) and communities (NLP, alignment, HCI). Paper 1 is practically valuable (deterministic, low-cost memory), but is more engineering-focused and may have narrower conceptual impact beyond agent memory systems.

vs. BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

gemini-3.16/3/2026

Paper 1 introduces a novel, domain-agnostic architectural paradigm that directly addresses critical bottlenecks in conversational AI: cost, scalability, and non-determinism in memory management. By eliminating LLM calls for memory processes and achieving up to 242x token reduction, DMF offers massive, immediate real-world utility across the entire AI agent ecosystem. Paper 2 provides a valuable but domain-specific benchmark (finance). While important for evaluation, Paper 1's foundational methodological shift in agent memory architecture presents a significantly broader and more transformative scientific impact.

vs. Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

gemini-3.16/3/2026

Paper 1 introduces a comprehensive deterministic memory framework that completely removes LLMs from the memory-management loop, drastically reducing token costs while maintaining accuracy. This offers broader utility, scalability, and cost-efficiency for conversational AI agents compared to Paper 2, which focuses on a specific sub-problem (conflict resolution) using a narrower heuristic approach. Paper 1's methodology represents a more substantial innovation with wider real-world applications.

vs. Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

claude-opus-4.66/3/2026

Paper 1 introduces a novel evaluation dimension ('handoff debt') for coding agents that addresses a real gap between benchmarks and practice. The concept of measuring task resumption costs across agents is original and broadly applicable to the growing field of AI-assisted software engineering. Paper 2, while practically useful in reducing token costs for conversational memory, is more incremental—replacing LLM summarization with classical NLP is a known direction. Paper 1's framing and protocol have greater potential to reshape how the community evaluates and designs coding agents, giving it broader methodological impact.

vs. NBQ: Next-Best-Question for Dynamic Profiling

gemini-3.16/3/2026

Paper 2 tackles a critical bottleneck in modern conversational AI: the high token costs and non-determinism of LLM-based memory systems. By proposing a CPU-first, deterministic memory framework that reduces token costs by up to 242x while maintaining comparable accuracy, it offers massive scalability benefits for AI agents across all domains. While Paper 1 presents a solid framework for dynamic profiling and reciprocal matching, Paper 2's fundamental challenge to the prevailing generative memory paradigm has broader applicability, addresses a more urgent cost/efficiency problem in the field, and is likely to see wider adoption in agent architectures.

vs. TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to stronger novelty (agentic, tool-augmented, quantitative TS quality assessment) and broader applicability: time-series quality affects many domains (health, finance, IoT, climate) and can improve downstream modeling/data efficiency. It also contributes a new benchmark (TSQBench) and analyzes core LLM limitations, increasing methodological value and timeliness for LLM evaluation in structured data tasks. Paper 2 is practical and timely for agent systems, but its core ideas rely more on engineering a deterministic pipeline and appears narrower in cross-field scientific reach.

vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

gemini-3.16/3/2026

Paper 2 offers a broader and more paradigm-shifting contribution by challenging the prevalent use of LLMs for memory management. Its deterministic, CPU-first approach achieves staggering token reductions (up to 242x) while maintaining accuracy, which has profound implications for the scalability, cost, and interpretability of all conversational AI agents, far exceeding the narrower cross-lingual coding scope of Paper 1.

vs. WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

claude-opus-4.66/3/2026

Paper 2 (DMF) addresses a fundamental and timely problem in conversational AI—memory management for LLM-based agents—with a novel deterministic approach that eliminates LLM calls from the memory loop, achieving 5x-242x token cost reduction. This has broad applicability across all conversational AI systems, offers a paradigm shift from generative to deterministic memory management, and directly addresses scalability/cost concerns critical to the field. Paper 1 applies existing CNN architectures and standard ensemble/augmentation techniques to WiFi-HAR with incremental improvements on a small-scale 3-class problem, representing more applied engineering than fundamental contribution.

vs. Inducing Reasoning Primitives from Agent Traces

gemini-3.16/3/2026

Paper 2 addresses a fundamental challenge in LLM agents—autonomous skill acquisition—by automatically inducing reusable reasoning primitives from successful traces. This method not only improves reasoning capabilities but also generalizes across multiple tasks, offering a potentially transformative approach to agent self-improvement. Paper 1 offers a highly practical optimization for agent memory costs and determinism, but Paper 2's contribution to automated reasoning decomposition and dynamic tool creation is likely to have a broader and more profound impact on the development of advanced autonomous agents.

vs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

claude-opus-4.66/3/2026

Paper 1 (DMF) addresses a fundamental and widely applicable problem in conversational AI—memory management—with a novel deterministic, cost-efficient approach that eliminates LLM calls from the memory loop. It demonstrates dramatic token cost reductions (5x-242x) while maintaining accuracy, offering immediate practical impact for the rapidly growing AI agent ecosystem. Paper 2 (CASTER/MEDEA) introduces an interesting social resonance evaluation paradigm, but targets a narrower application domain (UGC quality assessment) and relies on more incremental innovations (CoT variants, persona simulation). DMF's contribution is more foundational and broadly applicable across conversational AI systems.

vs. Solipsistic Superintelligence is Unlikely to be Cooperative

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental, paradigm-level challenge in AI safety and alignment—how superintelligent systems can be designed for cooperation rather than solipsistic optimization. This has broad implications across AI safety, multi-agent systems, institutional design, and policy. Its conceptual framework (self-undermining property, non-solipsistic design) could reshape how the field approaches AI development. Paper 2, while practically useful in reducing token costs for conversational memory, is a more incremental engineering contribution with narrower scope. The timeliness of Paper 1's alignment concerns and its breadth of interdisciplinary impact give it substantially higher potential scientific influence.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

gemini-3.16/3/2026

Paper 2 presents a novel, deterministic framework for conversational AI memory that eliminates LLM calls during memory management, addressing critical bottlenecks in cost and scalability. Its massive reduction in token usage (up to 242x) with comparable accuracy offers broad, immediate impact across the entire AI agent industry. In contrast, Paper 1 is a systematic review limited to the specific domain of dental healthcare, offering synthesis rather than fundamental methodological innovation.

vs. PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

gpt-5.26/3/2026

Paper 2 (PropLLM) has higher estimated impact: it introduces a novel propagation-aware, hop-by-hop causal reconstruction paradigm combining LLM reasoning with verifiable KG evidence and a new attention mechanism (TCPA) encoding causal/topological priors. The work targets an important real-world domain (network fault diagnosis) with demonstrated gains on multiple real datasets, including reduced hallucinations—key for deployability. Its ideas (causal tracing, evidence-grounded LLMs, prior-guided attention) are broadly transferable to other diagnosis/monitoring settings. Paper 1 is practical and cost-saving but more incremental, relying on classical deterministic heuristics.

vs. RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

claude-opus-4.66/3/2026

RoleCDE introduces a novel benchmark addressing a previously underexplored problem—role-alignment value conflicts in role-playing agents—and reveals a significant 'Role Value Decoupling' phenomenon with broad implications for AI safety and alignment research. Its large-scale benchmark (24k instances), systematic evaluation across multiple LLMs, and demonstration that fine-tuning can mitigate the identified issues provide both diagnostic and prescriptive contributions. Paper 2, while practically useful in reducing token costs for memory management, addresses a more incremental engineering optimization with narrower conceptual novelty.