PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

May 19, 2026

arXiv:2605.19932v1 PDF

cs.AI(primary)cs.CLcs.LG

#418of 2292·Artificial Intelligence

#418 of 2292 · Artificial Intelligence

Tournament Score

1482±45

10501800

72%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity8

Tournament Score

1482±45

10501800

72%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PEEK — Context Map as an Orientation Cache for Long-Context LLM Agents

1. Core Contribution

PEEK introduces the concept of a context map: a bounded, prompt-resident artifact that caches reusable "orientation knowledge" about a recurring external context (e.g., what a corpus contains, how it's organized, key entities and schemas). The key insight is that when an LLM agent repeatedly queries the same large external context across different tasks, it wastes substantial effort re-discovering structural and organizational knowledge each time. PEEK fills what the authors identify as an unoccupied quadrant in the design space: *active external-context state* management — as opposed to active agent/task state (prompt learning), passive external-context access (RAG, compaction), or passive agent state (shared chat, history compaction).

The system implements a three-module programmable cache policy: a Distiller that extracts transferable contextual knowledge from execution trajectories, a Cartographer that translates these into structured edits (ADD/DELETE/REPLACE), and a priority-based Evictor that enforces a fixed token budget. The separation of concerns (extraction vs. editing vs. eviction) is directly validated through ablations.

2. Methodological Rigor

Strengths in evaluation design:

Experiments span two complementary task categories: reasoning/aggregation (OOLONG) and context learning (CL-bench), both natively structured for the target scenario of multiple tasks over the same context.

Comprehensive baselines including RAG, shared chat, compaction (MemAgent), and the SOTA prompt-learning framework ACE, all implemented on the same RLM backbone for fair comparison.

Generalization is tested across three LMs (GPT-5-mini, GPT-5.5, Qwen3-Coder-Next) and two agent architectures (RLM, Codex), which is commendable.

Detailed cost breakdowns (Tables 4-7) provide transparency about where computational resources are spent.

Ablations isolate contributions of eviction, the Distiller-Cartographer separation, and cache size.

Concerns:

The Distiller and Cartographer are themselves LLM calls, meaning PEEK relies on the LLM's ability to extract transferable knowledge from trajectories. This is acknowledged but not deeply analyzed — the quality of distillation is a black box.

The paper acknowledges that existing benchmarks poorly fit the recurring-context scenario (Appendix D details three datasets that didn't work), raising questions about ecological validity. The authors are transparent about this limitation and call for better benchmarks.

Ground truth is stated to be unavailable to the Distiller by default, but the prompts (Appendix E.3) include ground truth fields marked as "not applicable," leaving ambiguity about whether some configurations do use it.

The default token budget B=1024 is not tuned, and the ablation shows B=1024 is not uniformly best (B=512 beats it on AGNews), suggesting some sensitivity to this hyperparameter.

3. Potential Impact

Practical relevance: The scenario of repeatedly querying the same large context (enterprise analytics over feedback corpora, code repositories, legal document sets) is genuinely common in production LLM deployments. PEEK addresses this with a lightweight, model-agnostic mechanism that doesn't require fine-tuning or architectural changes.

Systems perspective: The framing of the context map as a cache with explicit eviction policies, budget constraints, and update mechanisms bridges computer systems concepts with LLM agent design. This could inspire a broader line of work on principled caching abstractions for LLM agents.

Cost implications: The 1.7-5.8× cost reduction compared to ACE while achieving better quality is practically significant, especially at the frontier model price points documented ($5-30/M tokens for GPT-5.5).

Broader influence: The taxonomy in Figure 2 (active/passive × agent-state/context-state) provides a useful conceptual framework for the field. The identification of "orientation knowledge" as a distinct, cacheable category of information is a valuable conceptual contribution.

4. Timeliness & Relevance

This work is highly timely. The proliferation of coding agents (Claude Code, Codex, Cursor) and long-context enterprise applications creates immediate demand for efficient repeated-context interaction. The paper directly engages with production systems (OpenAI Codex) and contemporary models (GPT-5.5, released April 2026). The distinction from KV-cache optimization — which operates at the model level rather than the agent level — is well-articulated and positions PEEK as complementary to, rather than competing with, ongoing systems-level work.

5. Strengths & Limitations

Key Strengths:

Clean conceptual contribution: The context map abstraction and its positioning in the design space is clearly articulated and fills a genuine gap.

Strong empirical results: Consistent improvements across all benchmarks, models, and agent architectures, with meaningful margins (6-34% on OOLONG, 6-14% on CL-bench).

Efficiency: PEEK achieves better results with fewer iterations *and* lower cost, sitting on the Pareto frontier across all benchmarks.

Thorough negative results: Appendix B.2 documents five approaches that didn't work, providing valuable guidance for future researchers.

Reproducibility: Full prompts, cost breakdowns, and model identifiers are provided.

Notable Weaknesses:

Limited benchmark diversity: Only two benchmark families, with the authors themselves acknowledging the need for better-suited benchmarks.

LLM-dependent cache policy: The Distiller and Cartographer are LLM calls themselves, creating a recursive dependency on LLM quality that isn't deeply analyzed.

No analysis of map content quality: There's limited systematic analysis of what the context maps actually learn to contain across different contexts and how content quality degrades or improves.

Scalability questions: With m≤4 evolution steps shown sufficient, it's unclear how the system performs with much longer task sequences (hundreds of queries) or when the external context itself changes over time.

No comparison with structured indexing: The paper doesn't compare against pre-computed structural indexes (e.g., automatically generated table of contents, entity extraction pipelines) which could serve similar orientation purposes without requiring trajectory analysis.

Overall Assessment

PEEK makes a clean, well-motivated contribution to the increasingly important problem of efficient agent interaction with recurring contexts. The conceptual framing is strong, the empirical evidence is thorough within its scope, and the practical implications are significant. The main limitations concern benchmark diversity and depth of analysis of the cache mechanism itself. This work is likely to influence how the community thinks about persistent state management for LLM agents.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 8

Generated May 20, 2026

Comparison History (18)

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gpt-5.25/21/2026

Paper 1 proposes a concrete, novel system (a persistent, budgeted “context map” cache with explicit modules) and reports substantial efficiency and accuracy gains across tasks, models, and even a production coding agent—suggesting strong real-world applicability and methodological rigor with measurable benchmarks. Its ideas could broadly influence long-context agent design, memory/caching strategies, and tooling for recurring corpora. Paper 2 is timely and valuable conceptually for evaluation practice, but is largely a survey/framework with limited empirical depth (small-sample qualitative case study), making near-term scientific impact less certain.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact because a public, auditable benchmark for deep web research can become a widely adopted standard across labs, driving measurable progress and enabling comparative evaluation over time. Its methodology (capability taxonomy, provenance records, disclosure levels, cross-source checks, multi-model analysis) directly addresses a timely gap: existing benchmarks saturating for frontier systems. While Paper 1 is novel and practically useful for recurring-context agents, it is more system-specific and may see narrower adoption than a broadly applicable benchmark shaping evaluation and research agendas.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

gpt-5.25/20/2026

Paper 2 likely has higher impact: it introduces a broadly applicable, reusable system concept (constant-sized “context map” + cache policy) for long-context agents, with clear efficiency/cost gains and demonstrated generalization across LMs and a production-grade coding agent—strong real-world applicability and timeliness as long-context workflows proliferate. Paper 1 is novel and insightful for embodied evaluation, but appears narrower (specific task/setup) and more diagnostic than enabling; impact may be concentrated in embodied AI methodology rather than across many agent deployments.

vs. Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

gemini-3.15/20/2026

While both papers introduce innovative memory mechanisms for LLMs, Paper 1 (PEEK) has higher potential scientific impact due to its broader applicability. PEEK addresses a universal bottleneck in LLM agents: efficiently handling recurring long contexts like codebases and document corpora. Its 'context map' approach yields significant performance gains and up to 5.8x cost reductions over SOTA. While Paper 2 presents a rigorous approach for combinatorial optimization, Paper 1's methodology can be integrated into almost any general-purpose agentic workflow, promising wider adoption across diverse domains and stronger immediate relevance to the growing ecosystem of LLM applications.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gemini-3.15/20/2026

Paper 1 introduces a novel architectural paradigm (context maps) to solve a critical bottleneck in LLM agents: efficiently handling long, recurring contexts. By reducing costs up to 5.8x and significantly improving accuracy, it offers high real-world applicability for scaling agentic systems. Paper 2 presents a rigorous optimization method for agent skills, but Paper 1's contribution to agent memory and context management addresses a more fundamental and widely pressing challenge in the field, likely leading to broader adoption and architectural impact.

vs. Harnessing Agentic Evolution

gemini-3.15/20/2026

Paper 2 addresses a more fundamental challenge in AI: agentic evolution and open-ended optimization. By introducing a meta-agent to refine the evolutionary process itself, it advances the pursuit of self-improving systems. This has broader theoretical implications across scientific discovery and algorithmic optimization compared to Paper 1, which offers a highly practical but more narrowly scoped caching mechanism for long-context LLM interactions.

vs. Learning Quantifiable Visual Explanations Without Ground-Truth

claude-opus-4.65/20/2026

PEEK addresses the timely and high-impact problem of improving LLM agents operating over long contexts, which is central to the rapidly growing field of AI agents. It introduces a novel concept (context maps as orientation caches), demonstrates strong empirical gains across multiple benchmarks and architectures including production systems (OpenAI Codex), and offers practical efficiency improvements (lower cost, fewer iterations). Paper 2 contributes a useful XAI metric and method, but operates in a more mature and narrower subfield with less transformative potential. PEEK's broader applicability to the booming LLM agent ecosystem gives it higher impact potential.

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

claude-opus-4.65/20/2026

Paper 2 makes a stronger theoretical contribution by formalizing interface-constrained semi-MDPs and providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. This foundational result—lifting AIS to multi-agent SMDPs with provable convergence bounds—has broader impact across multi-agent systems, decentralized learning, and LLM pipelines spanning trust boundaries. While Paper 1 presents a practical and effective caching system (PEEK) with solid empirical gains, its contribution is more engineering-focused. Paper 2's novel theoretical framework with clean decomposable error bounds opens new research directions in decentralized multi-agent learning.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

claude-opus-4.65/20/2026

PEEK presents a novel, concrete system with strong empirical results (6-34% improvements, significant cost reductions) addressing a practical problem in LLM agent design. It introduces a well-defined architectural contribution (context maps with Distiller/Cartographer/Evictor modules) validated across multiple benchmarks and agent architectures including production systems. Paper 2 is a vision paper proposing a conceptual framework for trustworthy agent networks without empirical validation. While timely and relevant, vision papers typically have lower immediate scientific impact than systems papers with demonstrated results and reproducible methodology.

vs. NGM: A Plug-and-Play Training-Free Memory Module for LLMs

claude-opus-4.65/20/2026

PEEK introduces a novel conceptual framework—orientation knowledge caching via context maps—that addresses a fundamental and increasingly important problem in LLM agent systems operating over recurring contexts. Its improvements are substantially larger (6-34% vs 0.5-1.2 points), it demonstrates significant cost/efficiency gains (1.7-5.8x lower cost), and it generalizes across architectures including production systems like OpenAI Codex. The concept of reusable orientation knowledge is a fresh abstraction with broad applicability. NGM, while useful, offers modest incremental gains through a relatively straightforward n-gram averaging technique with limited conceptual novelty.

vs. Agentic Trading: When LLM Agents Meet Financial Markets

gemini-3.15/20/2026

Paper 2 introduces a novel, domain-agnostic architectural improvement (context mapping) that addresses a critical bottleneck in LLM efficiency and long-context reasoning. Its broad applicability across various agent architectures and tasks yields higher potential scientific impact than Paper 1, which serves as a valuable but niche critical audit restricted to the intersection of LLMs and financial trading.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

gpt-5.25/20/2026

Paper 1 (PEEK) introduces a clear, novel abstraction—persistent, constant-sized “orientation knowledge” cached as a context map—with a concrete cache policy (distill/cartograph/evict) and strong, efficiency-focused results across models and agent architectures, including a production coding agent. Its methodological contribution is crisp and likely reusable across many long-context, recurring-context applications (codebases, corpora, enterprise knowledge). Paper 2 targets an important area, but resembles an integration of known autonomy components (debate, self-healing loops, HITL) and its impact may be more benchmark- and system-specific.

vs. Evaluating the Utility of Personal Health Records in Personalized Health AI

gemini-3.15/20/2026

Paper 2 proposes a fundamental architectural innovation (PEEK) for LLM agents dealing with long-context environments. Its introduction of a context map as an orientation cache solves a core efficiency and reasoning bottleneck in modern AI systems, offering broad utility across multiple domains like software engineering and information retrieval. While Paper 1 is a valuable empirical evaluation of LLMs in healthcare, Paper 2's methodological advancements in core AI agent design are likely to have a wider, more foundational scientific impact and spur more downstream research.

vs. Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

claude-opus-4.65/20/2026

PEEK introduces a novel and practical system architecture for long-context LLM agents with strong empirical results (6-34% improvements, significant cost reductions) across multiple benchmarks and agent architectures including production systems. It addresses a widely relevant problem in the rapidly growing LLM agent ecosystem. Paper 2 introduces an interesting validity criterion (GEA) for LLM-based assessment, but its scope is narrower (educational assessment), the empirical study is preliminary (single measurement), and the findings (r=0.698, systematic bias) primarily highlight limitations without providing strong solutions. PEEK's broader applicability and demonstrated practical gains suggest higher impact.

vs. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

claude-opus-4.65/20/2026

PEEK introduces a novel and practical architectural concept—context maps as reusable orientation caches for LLM agents—addressing a fundamental gap in how agents interact with recurring contexts. It demonstrates substantial improvements across multiple tasks, architectures, and real production systems (OpenAI Codex), with strong cost efficiency gains. The breadth of applications (document reasoning, code repositories, context learning) and the generalizable framework design give it wider impact potential. Paper 2 contributes a useful evaluation methodology but addresses a narrower problem (LLM capability clustering/ranking) with more limited downstream applications.

vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

gemini-3.15/20/2026

Paper 2 provides fundamental mechanistic insights into the internal workings of multimodal LLMs, specifically addressing the critical issue of modality-conflict hallucinations. By uncovering the causal roles of specific attention heads and proposing a targeted intervention, it advances our theoretical understanding and offers a principled solution. While Paper 1 presents a highly practical and efficient system for long-context agents, Paper 2's focus on interpretability and internal mechanisms is likely to spur deeper, foundational research across the rapidly growing field of multimodal architectures.

vs. Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

gpt-5.25/20/2026

Paper 2 (PEEK) is likely to have higher scientific impact due to stronger timeliness and broad applicability: efficient long-context LLM agents are a central, fast-moving problem with immediate relevance to many domains (software engineering, IR, HCI, NLP systems). The proposed context-map cache is a novel systems idea with clear real-world deployment potential and reported sizable efficiency/accuracy gains across models and agent setups. Paper 1 addresses an important but narrower blockchain-governance niche; its contributions are rigorous and valuable, but the impact is likely more field-specific and less broadly transferable.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in the highly impactful area of reinforcement learning for LLM post-training. By dynamically adapting reward weights in GRPO, it directly improves the efficiency and alignment of foundation models, which is currently a central focus of the field. While Paper 2 offers strong practical benefits for agent workflows, Paper 1 provides a foundational algorithmic improvement to model training that could broadly influence how future reasoning models are aligned.