MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

May 28, 2026

arXiv:2605.29795v1 PDF

cs.AI(primary)

#1049of 2821·Artificial Intelligence

#1049 of 2821 · Artificial Intelligence

Tournament Score

1437±49

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5.5

Novelty7

Clarity7.5

Tournament Score

1437±49

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MEMENTO

1. Core Contribution

MEMENTO proposes a framework that treats the open web as a persistent learning signal rather than a stateless retrieval interface for LLM agents operating in low-data domains. The key insight is that human experts learn not just facts but also *how to search* through repeated web interaction. The framework has two novel architectural elements: (1) an Adaptive Exploration Tree (AET) for within-session iterative web research with reflection-driven question decomposition, and (2) a dual-channel cross-session memory that separates declarative knowledge (domain facts) from procedural knowledge (search strategies, decomposition rules, web action rules). This separation is grounded in the ACT-R cognitive architecture's declarative/procedural distinction.

The problem addressed—acquiring domain expertise in data-scarce professional settings without model fine-tuning—is practically important. The solution is creative: rather than treating web access as a one-shot retrieval tool (as in RAG or ReAct), MEMENTO accumulates transferable expertise across sessions through human-readable memory artifacts.

2. Methodological Rigor

Strengths in experimental design:

Temporal leakage prevention is carefully handled: web search cutoffs are enforced (6 months for sales, 2 years for legal), and cases where the agent could trivially retrieve outcomes are filtered out.

The baseline hierarchy is well-structured, progressively adding components (closed-book → ICL → ReAct → ReAct+memory → AET → full MEMENTO), enabling clean ablation of contributions.

Cross-model validation using both an open-source (Qwen) and proprietary (GPT-5-mini) backbone strengthens generalizability claims.

Bootstrap confidence intervals are reported (Appendix G).

Weaknesses and concerns:

LLM-as-judge evaluation is used for both tasks. For sales automation, the 0-5 rubric is subjective and potentially noisy; no human evaluation or inter-annotator agreement is reported. For legal research, binary accuracy is more objective but evaluated via LLM judge rather than string matching.

Small evaluation scale: Only 120 test samples per domain, with 60 training samples. While this is intentional (low-data regime), it limits statistical power. Some confidence intervals in Table 8 overlap between methods.

Two domains only: The authors acknowledge this limitation. The sales and legal domains, while structurally different, don't demonstrate breadth across, say, medical, scientific, or technical domains.

Reproducibility concerns: Code is not released, and the system involves complex multi-component orchestration with many moving parts (four memory stores, wave-based decomposition, reflection agents, multiple LLM calls). The prompts are described but not fully provided.

The data leakage observation (zero-shot Qwen matching MEMENTO on legal research at 0.808) is concerning and somewhat undermines clean interpretation of results. The authors acknowledge this but it complicates the narrative.

3. Potential Impact

Practical applications are clear: sales automation, legal research, and potentially any professional domain where labeled data is scarce but web resources are abundant. The training-free nature (no weight updates) makes deployment attractive, and human-readable memory stores improve auditability.

Broader influence: The conceptual framing of "web as learning signal" rather than "web as retrieval tool" is a meaningful paradigm shift for LLM agent design. If validated more broadly, this could influence how agentic systems are designed for professional knowledge work. The procedural/declarative memory separation could be adopted by other agent frameworks.

Efficiency gains are notable: training reduces search queries by ~20%, LLM calls by ~19%, and processing time by ~28% on the legal task, while improving quality—a rare simultaneous improvement in both dimensions.

4. Timeliness & Relevance

The paper is highly timely. Deep Research agents (OpenAI, Google, Perplexity) represent a major industry trend, but all are episodic. MEMENTO directly addresses this limitation with cross-session learning. The low-data regime is a persistent bottleneck for enterprise AI adoption in specialized domains. The paper also connects to the growing interest in memory-augmented and self-improving agents.

5. Strengths & Limitations

Key Strengths:

Compelling cognitive science grounding (ACT-R) for the dual-memory architecture, with empirical validation showing procedural memory carries ~85% of gains—matching the theoretical prediction about production compilation.

The efficiency analysis is convincing: the system simultaneously improves quality and reduces compute, suggesting genuine learning rather than brute-force search expansion.

Clean ablation design isolating AET, procedural memory, and declarative memory contributions.

The finding that procedural knowledge matters far more than declarative knowledge is a genuinely interesting empirical insight with implications for agent design.

Notable Weaknesses:

Computational cost: Training requires ~6-8 hours on 32 A100 GPUs for 60 samples. The system issues hundreds of search queries and LLM calls per sample. This is expensive relative to the baselines.

GPT-5-mini gains are modest: On sales automation, MEMENTO+GPT-5-mini improves over ReAct+GPT-5-mini by only 4.8% (0.547 vs 0.522), far less than the Qwen improvements. This suggests the framework's value may diminish with stronger base models.

Memory freezing at inference: The system cannot continue learning after deployment, which limits its practical utility in dynamic domains.

No comparison to prompt optimization baselines like GEPA or MIPROv2 on the same tasks—these are discussed in related work but not benchmarked against.

Web dependency: Performance is bounded by web coverage, which may be poor for truly niche or confidential domains.

The declarative memory's near-zero contribution (Table 4) raises questions about whether the dual-channel architecture is truly necessary or if procedural memory alone suffices.

Summary

MEMENTO introduces a conceptually appealing framework with a well-motivated cognitive science foundation. The idea of treating web interaction trajectories as learning signals is novel and practically relevant. However, the empirical validation is limited in scope (two domains, small test sets, LLM-judge evaluation), gains with stronger models are modest, and the computational overhead is substantial. The procedural memory finding is the paper's strongest empirical contribution. The work opens an interesting research direction but needs broader validation to establish its generality.

Rating:6/ 10

Significance 6.5Rigor 5.5Novelty 7Clarity 7.5

Generated May 29, 2026

Comparison History (13)

vs. Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

gemini-3.15/29/2026

Paper 1 proposes a fundamental paradigm shift in ASR from single-pass to multi-turn interactive systems, introducing a novel semantic evaluation metric that addresses the critical flaws of traditional token-level metrics like WER. By redefining how speech recognition is evaluated and integrated with LLMs, it offers foundational contributions that could reshape human-computer voice interaction across numerous fields. While Paper 2 provides an effective agentic web-exploration framework, Paper 1's systemic changes to core ASR methodology have a higher potential for broad and lasting scientific impact.

vs. On the Geometry of Games and their Solvers

gemini-3.15/29/2026

Paper 2 addresses the highly timely challenge of learning in low-data regimes by utilizing the web as an active learning signal rather than just a retrieval tool. Its dual-channel memory approach for LLM agents offers immediate, broad real-world applications across various professional domains (e.g., legal, sales). While Paper 1 provides strong theoretical contributions to game theory, Paper 2's framework has a higher potential for rapid adoption and widespread practical impact in the current AI landscape.

vs. Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

gpt-5.25/29/2026

Paper 2 has higher potential impact due to its broader cross-disciplinary relevance (AI alignment, computational social science, psychology), strong timeliness around value alignment and human behavior simulation, and large-scale empirical methodology (5M+ questionnaire items) grounded in validated psychological instruments. Its results could influence evaluation standards, agent design, and policy-facing simulations. Paper 1 is novel and practically useful for low-data professional domains via web-interaction memory, but its impact is likely narrower (agentic retrieval/automation) and may be more incremental relative to rapidly evolving web-agent frameworks.

vs. PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

gemini-3.15/29/2026

Paper 1 presents a highly scalable, training-free approach for agents to learn from the open web, addressing the critical bottleneck of data scarcity in professional domains like law and sales. Its broad applicability to real-world knowledge work and immediate economic relevance gives it higher potential impact compared to Paper 2's focus on embodied learning within a simulated Minecraft environment.

vs. Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

gpt-5.25/29/2026

Paper 1 offers a more novel and rigorous contribution: it formalizes a fundamental failure mode in multi-component probabilistic LLM agent compositions (local coherence not implying global coherence), introduces a computable diagnostic (compositional residual), provides theoretical characterization (product-structure dichotomy), and proposes principled repairs and monitoring (projection method, anytime-valid e-process). This targets a core reliability/safety issue with broad relevance to agentic AI, ensembling, decision-making, and probabilistic reasoning. Paper 2 is practically useful and timely, but is closer to an engineering framework around web interaction and memory with narrower methodological novelty.

vs. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

gemini-3.15/29/2026

Paper 2 addresses a fundamental and pervasive bottleneck in LLMs—memory degradation in long-horizon reasoning—by introducing a novel self-supervised metric (Belief Entropy). Its ability to scale effectively to 1.75M-token contexts offers broader theoretical implications and impact across all LLM applications. In contrast, Paper 1 presents a highly practical but more application-specific framework tailored to data-scarce domains.

vs. Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

gpt-5.25/29/2026

Paper 2 is more likely to have higher scientific impact: it proposes a novel, generalizable agent framework (web-as-learning-signal with session-level adaptive exploration plus cross-session dual-memory) and demonstrates sizable gains in two professional low-data domains, suggesting clear real-world utility. Its methodological contribution can transfer across many tasks involving web interaction and agentic reasoning, broadening cross-field impact. Paper 1 is rigorous and valuable for HCI/LLM-user behavior and dataset bias, but it is primarily descriptive/diagnostic with narrower downstream leverage compared to a reusable algorithmic framework.

vs. It`s All About Speed: AI`s Impact on Workflow in Music Production

gemini-3.15/29/2026

Paper 2 demonstrates higher scientific impact due to its broad applicability, technical innovation, and strong empirical results. While Paper 1 provides a valuable qualitative ethnographic study of AI in a specific domain (music production), Paper 2 introduces a novel, scalable AI agent framework (MEMENTO) that addresses a fundamental challenge in machine learning: reasoning in data-scarce environments. By enabling agents to learn procedural and declarative knowledge directly from web interactions, Paper 2 offers significant advancements in LLM architecture and has widespread, scalable applications across numerous professional fields like law and sales.

vs. Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

claude-opus-4.65/29/2026

MEMENTO introduces a novel paradigm shift—treating the web as a learning signal rather than a retrieval interface—with broad applicability across data-scarce domains. Its dual-channel memory architecture (declarative/procedural) and adaptive exploration tree represent genuinely new ideas with potential impact across NLP, information retrieval, and AI agents research. While SafeDIG addresses an important safety problem with solid methodology, it is more incremental within the narrower T2I safety steering niche. MEMENTO's framework is more generalizable, timely (given the rise of agentic AI), and addresses the fundamental challenge of learning without labeled data.

vs. Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

claude-opus-4.65/29/2026

MEMENTO introduces a more novel and broadly applicable paradigm—treating the web as a continuous learning signal with dual-channel memory—that could reshape how AI agents acquire expertise in low-data domains. Its framework is generalizable across many fields beyond the two evaluated. Paper 2, while solid, addresses a narrower problem (time-series anomaly detection with VLMs) and primarily contributes a benchmark and fine-tuned model. MEMENTO's conceptual innovation of web-as-learning-signal and its agentic memory architecture have greater potential to influence multiple research directions.

vs. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

claude-opus-4.65/29/2026

MEMENTO introduces a more novel paradigm shift—treating the web as a learning signal rather than a retrieval interface—with a creative dual-channel memory architecture separating declarative and procedural knowledge. It addresses the fundamental and broadly relevant problem of learning in low-data regimes with strong empirical results (+25-36% improvements). Paper 2 addresses an important but more incremental security concern in multi-agent systems. While both are well-motivated, MEMENTO's framework has broader applicability across domains and introduces a more transformative conceptual contribution that could influence how AI agents learn from unstructured information sources.

vs. Multi-Adapter Representation Interventions via Energy Calibration

claude-opus-4.65/29/2026

Paper 2 (MARI) addresses a fundamental challenge in LLM alignment—sample-adaptive representation intervention—with a novel energy-based gating and multi-adapter mechanism. It offers broader impact across the LLM safety/alignment community, demonstrates results across diverse model families and scales, and provides a reusable methodology applicable to many alignment objectives. Paper 1 (MEMENTO) presents a creative web-as-learning-signal framework but is evaluated on only two narrow domains with improvements over relatively basic baselines (ReAct), limiting its demonstrated generalizability and broader scientific influence.

vs. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

claude-opus-4.65/29/2026

MEMENTO introduces a novel framework treating the web as a learning signal with dual-channel memory, addressing the fundamental and broadly relevant problem of learning in low-data regimes. It demonstrates substantial improvements (+25.6% and +36.5%) across professional domains and proposes a reusable architectural paradigm (AET + declarative/procedural memory) applicable across many tasks. Paper 2, while addressing the important topic of biosecurity refusal auditing, is preliminary work conducted over a hackathon weekend with limited scope (small prompt sets, primarily Gemma-family models, consumer hardware constraints), and its findings, though interesting, are more diagnostic than transformative in advancing the field.