PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden
Abstract
Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PEEK — Context Map as an Orientation Cache for Long-Context LLM Agents
1. Core Contribution
PEEK introduces the concept of a context map: a bounded, prompt-resident artifact that caches reusable "orientation knowledge" about a recurring external context (e.g., what a corpus contains, how it's organized, key entities and schemas). The key insight is that when an LLM agent repeatedly queries the same large external context across different tasks, it wastes substantial effort re-discovering structural and organizational knowledge each time. PEEK fills what the authors identify as an unoccupied quadrant in the design space: *active external-context state* management — as opposed to active agent/task state (prompt learning), passive external-context access (RAG, compaction), or passive agent state (shared chat, history compaction).
The system implements a three-module programmable cache policy: a Distiller that extracts transferable contextual knowledge from execution trajectories, a Cartographer that translates these into structured edits (ADD/DELETE/REPLACE), and a priority-based Evictor that enforces a fixed token budget. The separation of concerns (extraction vs. editing vs. eviction) is directly validated through ablations.
2. Methodological Rigor
Strengths in evaluation design:
Concerns:
3. Potential Impact
Practical relevance: The scenario of repeatedly querying the same large context (enterprise analytics over feedback corpora, code repositories, legal document sets) is genuinely common in production LLM deployments. PEEK addresses this with a lightweight, model-agnostic mechanism that doesn't require fine-tuning or architectural changes.
Systems perspective: The framing of the context map as a cache with explicit eviction policies, budget constraints, and update mechanisms bridges computer systems concepts with LLM agent design. This could inspire a broader line of work on principled caching abstractions for LLM agents.
Cost implications: The 1.7-5.8× cost reduction compared to ACE while achieving better quality is practically significant, especially at the frontier model price points documented ($5-30/M tokens for GPT-5.5).
Broader influence: The taxonomy in Figure 2 (active/passive × agent-state/context-state) provides a useful conceptual framework for the field. The identification of "orientation knowledge" as a distinct, cacheable category of information is a valuable conceptual contribution.
4. Timeliness & Relevance
This work is highly timely. The proliferation of coding agents (Claude Code, Codex, Cursor) and long-context enterprise applications creates immediate demand for efficient repeated-context interaction. The paper directly engages with production systems (OpenAI Codex) and contemporary models (GPT-5.5, released April 2026). The distinction from KV-cache optimization — which operates at the model level rather than the agent level — is well-articulated and positions PEEK as complementary to, rather than competing with, ongoing systems-level work.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
PEEK makes a clean, well-motivated contribution to the increasingly important problem of efficient agent interaction with recurring contexts. The conceptual framing is strong, the empirical evidence is thorough within its scope, and the practical implications are significant. The main limitations concern benchmark diversity and depth of analysis of the cache mechanism itself. This work is likely to influence how the community thinks about persistent state management for LLM agents.
Generated May 20, 2026
Comparison History (18)
Paper 1 proposes a concrete, novel system (a persistent, budgeted “context map” cache with explicit modules) and reports substantial efficiency and accuracy gains across tasks, models, and even a production coding agent—suggesting strong real-world applicability and methodological rigor with measurable benchmarks. Its ideas could broadly influence long-context agent design, memory/caching strategies, and tooling for recurring corpora. Paper 2 is timely and valuable conceptually for evaluation practice, but is largely a survey/framework with limited empirical depth (small-sample qualitative case study), making near-term scientific impact less certain.
Paper 2 likely has higher scientific impact because a public, auditable benchmark for deep web research can become a widely adopted standard across labs, driving measurable progress and enabling comparative evaluation over time. Its methodology (capability taxonomy, provenance records, disclosure levels, cross-source checks, multi-model analysis) directly addresses a timely gap: existing benchmarks saturating for frontier systems. While Paper 1 is novel and practically useful for recurring-context agents, it is more system-specific and may see narrower adoption than a broadly applicable benchmark shaping evaluation and research agendas.
Paper 2 likely has higher impact: it introduces a broadly applicable, reusable system concept (constant-sized “context map” + cache policy) for long-context agents, with clear efficiency/cost gains and demonstrated generalization across LMs and a production-grade coding agent—strong real-world applicability and timeliness as long-context workflows proliferate. Paper 1 is novel and insightful for embodied evaluation, but appears narrower (specific task/setup) and more diagnostic than enabling; impact may be concentrated in embodied AI methodology rather than across many agent deployments.
While both papers introduce innovative memory mechanisms for LLMs, Paper 1 (PEEK) has higher potential scientific impact due to its broader applicability. PEEK addresses a universal bottleneck in LLM agents: efficiently handling recurring long contexts like codebases and document corpora. Its 'context map' approach yields significant performance gains and up to 5.8x cost reductions over SOTA. While Paper 2 presents a rigorous approach for combinatorial optimization, Paper 1's methodology can be integrated into almost any general-purpose agentic workflow, promising wider adoption across diverse domains and stronger immediate relevance to the growing ecosystem of LLM applications.
Paper 1 introduces a novel architectural paradigm (context maps) to solve a critical bottleneck in LLM agents: efficiently handling long, recurring contexts. By reducing costs up to 5.8x and significantly improving accuracy, it offers high real-world applicability for scaling agentic systems. Paper 2 presents a rigorous optimization method for agent skills, but Paper 1's contribution to agent memory and context management addresses a more fundamental and widely pressing challenge in the field, likely leading to broader adoption and architectural impact.
Paper 2 addresses a more fundamental challenge in AI: agentic evolution and open-ended optimization. By introducing a meta-agent to refine the evolutionary process itself, it advances the pursuit of self-improving systems. This has broader theoretical implications across scientific discovery and algorithmic optimization compared to Paper 1, which offers a highly practical but more narrowly scoped caching mechanism for long-context LLM interactions.
PEEK addresses the timely and high-impact problem of improving LLM agents operating over long contexts, which is central to the rapidly growing field of AI agents. It introduces a novel concept (context maps as orientation caches), demonstrates strong empirical gains across multiple benchmarks and architectures including production systems (OpenAI Codex), and offers practical efficiency improvements (lower cost, fewer iterations). Paper 2 contributes a useful XAI metric and method, but operates in a more mature and narrower subfield with less transformative potential. PEEK's broader applicability to the booming LLM agent ecosystem gives it higher impact potential.
Paper 2 makes a stronger theoretical contribution by formalizing interface-constrained semi-MDPs and providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. This foundational result—lifting AIS to multi-agent SMDPs with provable convergence bounds—has broader impact across multi-agent systems, decentralized learning, and LLM pipelines spanning trust boundaries. While Paper 1 presents a practical and effective caching system (PEEK) with solid empirical gains, its contribution is more engineering-focused. Paper 2's novel theoretical framework with clean decomposable error bounds opens new research directions in decentralized multi-agent learning.
PEEK presents a novel, concrete system with strong empirical results (6-34% improvements, significant cost reductions) addressing a practical problem in LLM agent design. It introduces a well-defined architectural contribution (context maps with Distiller/Cartographer/Evictor modules) validated across multiple benchmarks and agent architectures including production systems. Paper 2 is a vision paper proposing a conceptual framework for trustworthy agent networks without empirical validation. While timely and relevant, vision papers typically have lower immediate scientific impact than systems papers with demonstrated results and reproducible methodology.
PEEK introduces a novel conceptual framework—orientation knowledge caching via context maps—that addresses a fundamental and increasingly important problem in LLM agent systems operating over recurring contexts. Its improvements are substantially larger (6-34% vs 0.5-1.2 points), it demonstrates significant cost/efficiency gains (1.7-5.8x lower cost), and it generalizes across architectures including production systems like OpenAI Codex. The concept of reusable orientation knowledge is a fresh abstraction with broad applicability. NGM, while useful, offers modest incremental gains through a relatively straightforward n-gram averaging technique with limited conceptual novelty.
Paper 2 introduces a novel, domain-agnostic architectural improvement (context mapping) that addresses a critical bottleneck in LLM efficiency and long-context reasoning. Its broad applicability across various agent architectures and tasks yields higher potential scientific impact than Paper 1, which serves as a valuable but niche critical audit restricted to the intersection of LLMs and financial trading.
Paper 1 (PEEK) introduces a clear, novel abstraction—persistent, constant-sized “orientation knowledge” cached as a context map—with a concrete cache policy (distill/cartograph/evict) and strong, efficiency-focused results across models and agent architectures, including a production coding agent. Its methodological contribution is crisp and likely reusable across many long-context, recurring-context applications (codebases, corpora, enterprise knowledge). Paper 2 targets an important area, but resembles an integration of known autonomy components (debate, self-healing loops, HITL) and its impact may be more benchmark- and system-specific.
Paper 2 proposes a fundamental architectural innovation (PEEK) for LLM agents dealing with long-context environments. Its introduction of a context map as an orientation cache solves a core efficiency and reasoning bottleneck in modern AI systems, offering broad utility across multiple domains like software engineering and information retrieval. While Paper 1 is a valuable empirical evaluation of LLMs in healthcare, Paper 2's methodological advancements in core AI agent design are likely to have a wider, more foundational scientific impact and spur more downstream research.
PEEK introduces a novel and practical system architecture for long-context LLM agents with strong empirical results (6-34% improvements, significant cost reductions) across multiple benchmarks and agent architectures including production systems. It addresses a widely relevant problem in the rapidly growing LLM agent ecosystem. Paper 2 introduces an interesting validity criterion (GEA) for LLM-based assessment, but its scope is narrower (educational assessment), the empirical study is preliminary (single measurement), and the findings (r=0.698, systematic bias) primarily highlight limitations without providing strong solutions. PEEK's broader applicability and demonstrated practical gains suggest higher impact.
PEEK introduces a novel and practical architectural concept—context maps as reusable orientation caches for LLM agents—addressing a fundamental gap in how agents interact with recurring contexts. It demonstrates substantial improvements across multiple tasks, architectures, and real production systems (OpenAI Codex), with strong cost efficiency gains. The breadth of applications (document reasoning, code repositories, context learning) and the generalizable framework design give it wider impact potential. Paper 2 contributes a useful evaluation methodology but addresses a narrower problem (LLM capability clustering/ranking) with more limited downstream applications.
Paper 2 provides fundamental mechanistic insights into the internal workings of multimodal LLMs, specifically addressing the critical issue of modality-conflict hallucinations. By uncovering the causal roles of specific attention heads and proposing a targeted intervention, it advances our theoretical understanding and offers a principled solution. While Paper 1 presents a highly practical and efficient system for long-context agents, Paper 2's focus on interpretability and internal mechanisms is likely to spur deeper, foundational research across the rapidly growing field of multimodal architectures.
Paper 2 (PEEK) is likely to have higher scientific impact due to stronger timeliness and broad applicability: efficient long-context LLM agents are a central, fast-moving problem with immediate relevance to many domains (software engineering, IR, HCI, NLP systems). The proposed context-map cache is a novel systems idea with clear real-world deployment potential and reported sizable efficiency/accuracy gains across models and agent setups. Paper 1 addresses an important but narrower blockchain-governance niche; its contributions are rigorous and valuable, but the impact is likely more field-specific and less broadly transferable.
Paper 1 addresses a critical bottleneck in the highly impactful area of reinforcement learning for LLM post-training. By dynamically adapting reward weights in GRPO, it directly improves the efficiency and alignment of foundation models, which is currently a central focus of the field. While Paper 2 offers strong practical benefits for agent workflows, Paper 1 provides a foundational algorithmic improvement to model training that could broadly influence how future reasoning models are aligned.