Nathan Ordonez, Thomas Parnell
Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.
MiniPIC addresses a concrete systems problem in LLM inference: how to efficiently reuse KV cache entries for recurring prompt segments (spans) that appear at different positions across requests. The key insight is that position-independent caching (PIC) can be achieved with minimal engine modifications by (1) storing unrotated K vectors in the KV cache and applying RoPE inside the attention kernel at serving time, and (2) exposing three user-facing special-token primitives (padding, SSEP, PDEP) that modify block-hashing behavior to control cache reuse.
The paper frames this as a separation-of-responsibilities problem: the inference engine provides position-free storage and hash-chain manipulation, while span layout, padding, and scheduling policies are delegated to the user or orchestration layer. This is a clean architectural decision that contrasts with prior approaches (SPNL's CIDRA algorithm, MEPIC's chunk cache coordinator) that embed substantial new subsystems within the engine.
The experimental evaluation is focused but somewhat narrow:
Practical impact (high): This work has immediate practical value for RAG and agentic LLM serving. The open-source implementation on IBM's vLLM fork, combined with the minimal 78 LOC core footprint, makes adoption straightforward. The fact that it composes with vLLM's existing CPU offload infrastructure is a practical advantage over external caching approaches.
Systems design impact (moderate-high): The paper makes a compelling architectural argument that PIC should be decomposed into position-free storage (engine responsibility) and reuse policy (user/orchestrator responsibility). This principle could influence how future inference engines are designed, and could be adopted by SGLang, TensorRT-LLM, and other systems.
Research impact (moderate): MiniPIC provides a common substrate on which multiple PIC algorithms (Block-Attention, EPIC, Prompt Cache) can be expressed through the same primitives. This could accelerate PIC research by lowering the implementation barrier. However, the algorithmic contribution is incremental—the idea of position-free KV caches exists in prior work (MEPIC, DeepSeek-V2).
This paper is highly timely. RAG and agentic workloads are dominant deployment patterns for LLMs, and prefill cost is a major bottleneck as context lengths grow. The gap between academic PIC proposals and production-ready implementations is well-known, and MiniPIC directly addresses this. The growing adoption of vLLM as the de facto open-source inference server makes vLLM-specific contributions particularly impactful.
MiniPIC is a well-executed systems paper that makes a practical and architecturally clean contribution to LLM inference. Its main novelty lies not in any single technique (position-free KV caches, hash-chain manipulation, interleaved scheduling) but in their synthesis into a minimal, composable design that achieves strong performance with remarkably few lines of code. The evaluation, while limited in breadth, demonstrates clear advantages. The work is most impactful as a practical systems contribution rather than a fundamental algorithmic advance.
Generated Jun 12, 2026
Paper 1 introduces a novel conceptual framework by formulating neural model editing as a reinforcement learning problem. While Paper 2 offers a highly practical and timely systems-level optimization for LLM inference, Paper 1's approach has broader scientific implications for AI safety, bias mitigation, and machine unlearning, potentially opening a new subfield of automated model editing that transcends manual algorithm design.
Paper 1 has higher estimated impact: it introduces a highly practical, low-code-change position-independent caching design for vLLM with clear, immediate real-world benefits (large throughput/TTFT gains) and broad relevance to rapidly growing RAG/agentic inference workloads. The approach is novel in its minimalist integration (unrotated K + in-attention RoPE + user primitives) and is likely to be adopted by industry/OSS stacks, amplifying impact. Paper 2 offers interesting geometric analysis and metrics for continual learning, but is primarily explanatory on limited benchmarks and less directly enabling.
MiniPIC addresses a critical practical problem in LLM inference serving—efficient KV cache reuse for retrieval-augmented and agentic workloads—with a minimal, elegant solution (<100 LOC changes to vLLM). It demonstrates significant throughput improvements (49%) and latency reductions (up to 100x for cached spans) on a production-grade system. Its breadth of impact is larger: it unifies multiple PIC methods, integrates with existing infrastructure, and addresses a bottleneck affecting widespread LLM deployment. Paper 2 provides interesting geometric insights for transformer optimization but is narrower in scope (GPT-2 pretraining) and more incremental in its contributions.
Paper 1 addresses a fundamental capability of LLMs (latent reasoning) by bridging on-policy reinforcement learning with mechanistic interpretability. This explores new paradigms in how AI models learn to reason internally, offering high theoretical novelty and broad implications for future reasoning architectures. Paper 2, while offering a highly practical and elegant systems-level optimization for KV caching, represents an engineering contribution with narrower fundamental scientific impact compared to the algorithmic and interpretability advancements of Paper 1.
Paper 1 offers a highly timely and widely applicable solution to a critical bottleneck in LLM deployment (KV caching for RAG and agents). By enabling Position-Independent Caching with minimal code changes in vLLM, it drastically improves throughput and latency for real-world AI workloads. While Paper 2 provides a valuable methodological advance for molecular diffusion in drug discovery, Paper 1's immediate relevance to the massive, fast-growing ecosystem of LLM inference gives it a significantly broader and more immediate potential impact across the AI community.
Paper 1 has higher potential impact due to a novel, broadly applicable systems contribution to LLM inference: position-independent KV reuse with minimal engine changes, clear primitives, and strong reported throughput/latency gains integrated into a major serving stack (vLLM). It is timely given retrieval/agent workloads and could influence both research and production serving across many domains. Paper 2 addresses a valuable Mars-geomorphology application, but appears narrower in scope, with limited innovation beyond applying standard segmentation/GAN augmentation and a negative result (synthetic data not helping), reducing expected cross-field and methodological impact.
Paper 2 addresses a fundamental challenge in AI for Science by enabling the discovery of governing equations from noisy, high-dimensional data. Its theoretical guarantees for identifiability and broad applicability across diverse fields like neuroscience and physics give it a foundational scientific impact. While Paper 1 is highly practical and timely for LLM deployment, it is primarily a systems engineering optimization with a narrower scope of impact.
Paper 1 tackles a critical bottleneck in LLM inference (KV cache reuse in RAG and agentic workflows) with a highly elegant, low-code solution in a dominant framework (vLLM). Given the massive scale of current LLM deployments, its significant performance gains (up to 100x faster time-to-first-token for cached spans) offer immediate, wide-reaching practical and system-level impact. Paper 2 presents a solid, though more incremental, improvement (~5% gain) in spatio-temporal forecasting.
DiffSlack addresses a fundamental and broadly applicable challenge—enforcing nonlinear inequality constraints in neural networks—with a novel differentiable projection layer approach. Its potential impact spans multiple fields (robotics, control, optimization, constrained learning) and tackles a core limitation of neural networks. MiniPIC, while technically sound and practically useful, is more narrowly focused on KV cache optimization for a specific inference engine (vLLM). DiffSlack's methodological contribution (learnable slack variables + Gauss-Newton projection) is more generalizable and addresses a deeper scientific problem with validated real-world experiments.
Paper 2 likely has higher scientific impact: it introduces a highly practical, low-intrusion systems technique (position-independent KV caching in <100 LOC) that can be adopted quickly in widely used inference stacks (e.g., vLLM), directly improving throughput/latency for retrieval-augmented and agentic LLM workloads—an urgent, high-demand area. It demonstrates concrete speedups and integrates with existing features (CPU offload), suggesting strong real-world deployment potential and broad impact across LLM serving, systems, and applications. Paper 1 is novel for curriculum learning but may see narrower uptake and less immediate deployability.