Back to Rankings

Service-Induced Congestion in Memory-Constrained LLM Serving

Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi

Jun 14, 2026arXiv:2606.15555v1
math.OCcs.AIcs.LGstat.ML
Share
#5 of 151 · math.OC
Tournament Score
1576±38
11001650
91%
Win Rate
42
Wins
4
Losses
46
Matches
Rating
8.5/ 10
Significance9
Rigor9.5
Novelty8.5
Clarity8

Abstract

In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes "service-induced congestion" — a novel congestion mechanism in LLM serving where the service process itself generates future capacity pressure through growing KV-cache memory. Unlike classical queueing models where jobs consume fixed resources, LLM requests accumulate GPU memory progressively during decoding. The authors develop a discrete-time dynamical systems model capturing the interaction among admission, memory growth, and eviction under continuous batching.

The central theoretical results are: (1) For homogeneous workloads, the eviction-free equilibrium is unstable under greedy admission, and the system converges (outside a measure-zero set) to a worst-case limit cycle with up to 50% throughput loss (Theorem 2). (2) For heterogeneous workloads, stability is governed by number-theoretic structure — specifically gcd(l₁,₁, ..., l₁,K) — where coprime decoding lengths stabilize the system through completion desynchronization (Theorem 3).

Methodological Rigor

The mathematical analysis is exceptionally thorough. The proof of the homogeneous instability result (Theorem 2) combines multiple novel techniques: spread monotonicity arguments, relative-position representations on circular pipelines, block-end sampling, normalized mass analysis, and support-loss cascades. The authors carefully handle the measure-zero exceptional set where trajectories could be captured by intermediate balanced orbits.

The heterogeneous analysis (Theorem 3) elegantly connects LLM serving stability to number theory through characteristic polynomials of linear recurrences. The proof that gcd > 1 implies instability uses the Implicit Function Theorem to show unit-circle roots of the limiting polynomial drift outside the unit disk for finite input lengths. The global convergence proof for the coprime case constructs an explicit Lyapunov function and bounds eviction perturbations as O(1/l₀).

The finite-input stability threshold (Proposition 4) showing l₀,min ~ l₁,₂³ provides practical quantitative guidance. The 101-page appendix contains complete proofs with careful handling of edge cases, branch analysis, and spectral perturbation theory.

Potential Impact

Theoretical impact: This work opens a new direction in queueing theory — systems where resource consumption grows endogenously during service. The dynamical systems approach, combining spread amplification, spectral analysis, and number-theoretic stability conditions, provides tools applicable beyond LLM serving to any stateful service with growing resource footprints.

Practical impact: The two proposed policies — rate-limited admission and request mixing — are immediately actionable. Rate-limited admission using the analytically derived eviction-free rate x* as a per-iteration cap eliminated evictions entirely in Vidur simulations (Table 3: 2,232 → 0 evictions, 28.3% throughput improvement). Request mixing reduced evictions by 94-98% in both Vidur and real-GPU experiments (Tables 1-2). The design principle that decoding-length coprimality aids stability provides a concrete diagnostic for system operators.

Systems implications: The insight that {128, 256, 512} output tiers (gcd=128) could be adjusted to {127, 255, 509} (gcd=1) to improve stability is a low-overhead engineering lever. The observation that eviction serves as an early warning signal for structural instability transitions is operationally valuable.

Timeliness & Relevance

This work addresses a critical bottleneck in current LLM infrastructure. As context windows grow (100K+ tokens), agentic workflows accumulate state, and GPU memory remains constrained, KV-cache management becomes increasingly important. The paper arrives when production systems (vLLM, SGLang, TensorRT-LLM) are actively engineering around these exact eviction and memory-pressure issues. The theoretical framework provides principled alternatives to ad hoc heuristics.

Strengths

1. Novel conceptual contribution: Service-induced congestion is a genuinely new phenomenon that classical queueing theory does not capture. The formalization is clean and the terminology ("eviction cascade," "survival polynomial") is well-chosen.

2. Mathematical depth with practical relevance: The paper achieves an unusual combination — deep theoretical results (spectral analysis, Lyapunov stability, number-theoretic conditions) that translate directly into actionable policies validated on real hardware.

3. Complete characterization: The homogeneous case is essentially fully characterized — fixed points, all limit cycles, their stability, throughput bounds, and the cascade mechanism. The heterogeneous case provides a sharp dichotomy.

4. Multi-level validation: Results are validated across model-based simulation, Vidur (high-fidelity simulator with <5% error), and real A800 GPU experiments with SGLang, providing strong evidence that the theoretical mechanisms survive practical implementation details.

5. Rich supporting analysis: The throughput envelope (Propositions 6-8), finite-input threshold scaling, and pulse-cycle construction provide additional structural insights.

Limitations

1. Deterministic decoding lengths: Real LLM outputs have highly variable lengths. The authors acknowledge this but the extension to stochastic lengths — which could either help (desynchronization) or hurt (tail risk) — remains open.

2. Large-input scaling regime: The heterogeneous stability results require l₀ sufficiently large. The cubic threshold l₀,min ~ l₁,₂³ means that for long decoding sequences with moderate inputs, the coprime benefit may not manifest.

3. Simplified memory model: The model assumes memory grows by exactly one token per step and abstracts away prefill costs, memory fragmentation, and prefix sharing. PagedAttention's non-contiguous allocation may alter synchronization patterns.

4. Single-GPU scope: Modern deployments use tensor parallelism, pipeline parallelism, and disaggregated serving, which redistribute memory pressure across devices.

5. Proportional admission assumption: The multi-class analysis requires proportional class admission, which may not match practical schedulers that prioritize by deadline or priority class.

Despite these limitations, this paper makes a foundational contribution by identifying and rigorously analyzing a structural instability mechanism that is both theoretically novel and practically consequential for a rapidly growing infrastructure domain.

Rating:8.5/ 10
Significance 9Rigor 9.5Novelty 8.5Clarity 8

Generated Jun 16, 2026

Comparison History (46)

Wonvs. Contextual Robust Optimization for AI Data Center Scheduling with Statistical Guarantees

Paper 1 identifies a fundamental structural instability mechanism ('service-induced congestion') in LLM serving systems, providing rigorous dynamical systems analysis with novel theoretical contributions (limit cycles, stability criteria, coprimality conditions). This characterizes a previously unrecognized phenomenon affecting all memory-constrained LLM serving systems, with broad implications for scheduling design. Paper 2, while practically useful, applies existing robust optimization techniques to data center scheduling with incremental improvements (~5.57% cost reduction). Paper 1's theoretical depth, novelty of the identified phenomenon, and potential to reshape LLM system design give it higher impact potential.

claude-opus-4-6·Jun 17, 2026
Wonvs. Koopman Lifting with Certified Error Bounds for Joint Inference in Nonlinear Networks

Paper 2 addresses a critical, highly timely bottleneck in modern AI: GPU memory management in LLM serving. By applying dynamical systems theory to model KV-cache growth as a source of structural instability, it offers a highly novel perspective on a massive real-world problem. While Paper 1 provides rigorous theoretical advances for nonlinear network inference, Paper 2's potential to directly influence the design of LLM serving infrastructure—saving significant compute and boosting throughput in a booming, resource-constrained industry—gives it broader and more immediate scientific and practical impact.

gemini-3.1-pro-preview·Jun 17, 2026
Wonvs. Accelerated Implicit GDA Schemes: Theoretical Guarantees and Application to Proximal Augmented Lagrangian Methods

Paper 2 addresses a critical and highly timely problem in AI infrastructure: memory bottlenecks and KV cache growth in LLM serving. By formally modeling 'service-induced congestion' and identifying it as a structural instability, it offers theoretical insights with immediate, high-impact practical applications for optimizing large-scale LLM deployment. While Paper 1 provides solid theoretical advancements in optimization, Paper 2's focus on a major, real-world bottleneck in a rapidly growing field gives it greater potential for immediate scientific and industrial impact.

gemini-3.1-pro-preview·Jun 16, 2026
Wonvs. Bidirectional SDDP with dimension-free complexity for solving strongly convex stochastic dynamic programming equations

Paper 2 targets an urgent, high-impact real-world problem in LLM deployment: GPU memory dynamics and throughput collapse under concurrency. It introduces a dynamical-systems model with rigorous stability/limit-cycle results and actionable scheduling principles, likely influencing systems design and operations across ML serving, queueing/control, and applied theory. Its breadth and timeliness in the rapidly growing LLM serving ecosystem suggest strong near-term uptake. Paper 1 is technically novel (dimension-free complexity for BSDDP under strong convexity) but is more specialized to stochastic optimization theory with narrower immediate industrial impact.

gpt-5.2·Jun 16, 2026
Wonvs. Branch and Price for Railway Crew Scheduling: Benchmark Instances and Computational Study

Paper 1 introduces a novel theoretical framework identifying 'service-induced congestion' as a fundamental instability mechanism in LLM serving systems—a timely and rapidly growing field. It provides rigorous dynamical systems analysis with practical scheduling design principles, potentially impacting the massive LLM deployment ecosystem. Paper 2, while methodologically solid, addresses a well-studied optimization problem (railway crew scheduling) with incremental contributions (benchmark instances, computational study of known techniques). Paper 1's novelty, timeliness given the LLM boom, and broader cross-disciplinary relevance (systems, ML, queueing theory) give it higher impact potential.

claude-opus-4-6·Jun 16, 2026
Wonvs. Mean field games with option to buy information

Paper 2 addresses a critical and highly timely bottleneck in Large Language Model (LLM) serving: GPU memory constraints and throughput optimization. Its findings directly translate to significant real-world applications in AI infrastructure, offering actionable scheduling principles to prevent severe throughput losses. Paper 1 offers a solid theoretical advancement in mean field games, but Paper 2 has a much broader immediate relevance, higher industry applicability, and targets a rapidly expanding field with massive compute implications.

gemini-3.1-pro-preview·Jun 16, 2026
Wonvs. Submodular Optimization with Applications to Decision and Control

Paper 2 identifies a novel structural phenomenon—service-induced congestion—in LLM serving, providing rigorous dynamical systems analysis of a highly timely problem. Its results (instability of eviction-free equilibria, 50% throughput losses, scheduling design principles) have immediate practical impact for the rapidly growing LLM deployment ecosystem. Paper 1 is a comprehensive and well-structured survey of submodular optimization in control, but as a survey it synthesizes existing knowledge rather than introducing fundamentally new results. Paper 2's novelty, timeliness given the LLM explosion, and actionable design principles give it higher potential impact.

claude-opus-4-6·Jun 16, 2026
Wonvs. Complete Trajectory Tracking for Polynomial Differential Variational Inequalities: A Moment-SOS Based Structure-Aware Approach

Paper 2 targets an urgent, high-impact real-world problem in LLM infrastructure: GPU memory–driven instability during serving. It proposes a clear dynamical model with provable phase behavior (fixed points vs. limit cycles), stability results, and actionable scheduling principles, with implications for systems design, queueing/control theory, and ML deployment. The topic is highly timely and broadly relevant across AI systems and cloud operations. Paper 1 is mathematically novel for PDVIs and offers a sophisticated SOS-based algorithm, but its applicability is narrower and impact is likely concentrated within optimization/control and nonsmooth dynamics communities.

gpt-5.2·Jun 16, 2026
Wonvs. Sharpness characterizes Hill functions

Paper 1 targets an urgent, fast-evolving bottleneck in LLM deployment: GPU-memory instability under high concurrency. It offers a dynamical-systems model, provable stability/limit-cycle results, and concrete scheduling/design principles with potentially large throughput gains, making it timely with clear real-world applicability across ML systems, networking/queueing, and performance engineering. Paper 2 provides an elegant, rigorous characterization theorem in mathematical biology/control, but its immediate practical leverage and breadth beyond theory are narrower. Overall, Paper 1 is more likely to drive near-term adoption and follow-on work.

gpt-5.2·Jun 16, 2026
Wonvs. Benchmarking Deep Time Series Models for Equity Portfolios

Paper 2 has higher likely impact: it introduces a novel, rigorous dynamical-systems theory for a timely, widely relevant bottleneck in LLM inference (KV-cache memory under continuous batching), with provable stability/limit-cycle results and actionable scheduling principles. Its applications span most large-scale AI deployment stacks and can influence system design across industry and research. Paper 1 is solid and methodologically careful but is primarily a finance benchmarking framework with narrower cross-field reach and limited standalone practical payoff (net Sharpe negative under realistic costs), making its broader scientific impact less likely.

gpt-5.2·Jun 16, 2026