Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi
In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.
This paper identifies and formalizes "service-induced congestion" — a novel congestion mechanism in LLM serving where the service process itself generates future capacity pressure through growing KV-cache memory. Unlike classical queueing models where jobs consume fixed resources, LLM requests accumulate GPU memory progressively during decoding. The authors develop a discrete-time dynamical systems model capturing the interaction among admission, memory growth, and eviction under continuous batching.
The central theoretical results are: (1) For homogeneous workloads, the eviction-free equilibrium is unstable under greedy admission, and the system converges (outside a measure-zero set) to a worst-case limit cycle with up to 50% throughput loss (Theorem 2). (2) For heterogeneous workloads, stability is governed by number-theoretic structure — specifically gcd(l₁,₁, ..., l₁,K) — where coprime decoding lengths stabilize the system through completion desynchronization (Theorem 3).
The mathematical analysis is exceptionally thorough. The proof of the homogeneous instability result (Theorem 2) combines multiple novel techniques: spread monotonicity arguments, relative-position representations on circular pipelines, block-end sampling, normalized mass analysis, and support-loss cascades. The authors carefully handle the measure-zero exceptional set where trajectories could be captured by intermediate balanced orbits.
The heterogeneous analysis (Theorem 3) elegantly connects LLM serving stability to number theory through characteristic polynomials of linear recurrences. The proof that gcd > 1 implies instability uses the Implicit Function Theorem to show unit-circle roots of the limiting polynomial drift outside the unit disk for finite input lengths. The global convergence proof for the coprime case constructs an explicit Lyapunov function and bounds eviction perturbations as O(1/l₀).
The finite-input stability threshold (Proposition 4) showing l₀,min ~ l₁,₂³ provides practical quantitative guidance. The 101-page appendix contains complete proofs with careful handling of edge cases, branch analysis, and spectral perturbation theory.
Theoretical impact: This work opens a new direction in queueing theory — systems where resource consumption grows endogenously during service. The dynamical systems approach, combining spread amplification, spectral analysis, and number-theoretic stability conditions, provides tools applicable beyond LLM serving to any stateful service with growing resource footprints.
Practical impact: The two proposed policies — rate-limited admission and request mixing — are immediately actionable. Rate-limited admission using the analytically derived eviction-free rate x* as a per-iteration cap eliminated evictions entirely in Vidur simulations (Table 3: 2,232 → 0 evictions, 28.3% throughput improvement). Request mixing reduced evictions by 94-98% in both Vidur and real-GPU experiments (Tables 1-2). The design principle that decoding-length coprimality aids stability provides a concrete diagnostic for system operators.
Systems implications: The insight that {128, 256, 512} output tiers (gcd=128) could be adjusted to {127, 255, 509} (gcd=1) to improve stability is a low-overhead engineering lever. The observation that eviction serves as an early warning signal for structural instability transitions is operationally valuable.
This work addresses a critical bottleneck in current LLM infrastructure. As context windows grow (100K+ tokens), agentic workflows accumulate state, and GPU memory remains constrained, KV-cache management becomes increasingly important. The paper arrives when production systems (vLLM, SGLang, TensorRT-LLM) are actively engineering around these exact eviction and memory-pressure issues. The theoretical framework provides principled alternatives to ad hoc heuristics.
1. Novel conceptual contribution: Service-induced congestion is a genuinely new phenomenon that classical queueing theory does not capture. The formalization is clean and the terminology ("eviction cascade," "survival polynomial") is well-chosen.
2. Mathematical depth with practical relevance: The paper achieves an unusual combination — deep theoretical results (spectral analysis, Lyapunov stability, number-theoretic conditions) that translate directly into actionable policies validated on real hardware.
3. Complete characterization: The homogeneous case is essentially fully characterized — fixed points, all limit cycles, their stability, throughput bounds, and the cascade mechanism. The heterogeneous case provides a sharp dichotomy.
4. Multi-level validation: Results are validated across model-based simulation, Vidur (high-fidelity simulator with <5% error), and real A800 GPU experiments with SGLang, providing strong evidence that the theoretical mechanisms survive practical implementation details.
5. Rich supporting analysis: The throughput envelope (Propositions 6-8), finite-input threshold scaling, and pulse-cycle construction provide additional structural insights.
1. Deterministic decoding lengths: Real LLM outputs have highly variable lengths. The authors acknowledge this but the extension to stochastic lengths — which could either help (desynchronization) or hurt (tail risk) — remains open.
2. Large-input scaling regime: The heterogeneous stability results require l₀ sufficiently large. The cubic threshold l₀,min ~ l₁,₂³ means that for long decoding sequences with moderate inputs, the coprime benefit may not manifest.
3. Simplified memory model: The model assumes memory grows by exactly one token per step and abstracts away prefill costs, memory fragmentation, and prefix sharing. PagedAttention's non-contiguous allocation may alter synchronization patterns.
4. Single-GPU scope: Modern deployments use tensor parallelism, pipeline parallelism, and disaggregated serving, which redistribute memory pressure across devices.
5. Proportional admission assumption: The multi-class analysis requires proportional class admission, which may not match practical schedulers that prioritize by deadline or priority class.
Despite these limitations, this paper makes a foundational contribution by identifying and rigorously analyzing a structural instability mechanism that is both theoretically novel and practically consequential for a rapidly growing infrastructure domain.
Generated Jun 16, 2026
Paper 1 identifies a fundamental structural instability mechanism ('service-induced congestion') in LLM serving systems, providing rigorous dynamical systems analysis with novel theoretical contributions (limit cycles, stability criteria, coprimality conditions). This characterizes a previously unrecognized phenomenon affecting all memory-constrained LLM serving systems, with broad implications for scheduling design. Paper 2, while practically useful, applies existing robust optimization techniques to data center scheduling with incremental improvements (~5.57% cost reduction). Paper 1's theoretical depth, novelty of the identified phenomenon, and potential to reshape LLM system design give it higher impact potential.
Paper 2 addresses a critical, highly timely bottleneck in modern AI: GPU memory management in LLM serving. By applying dynamical systems theory to model KV-cache growth as a source of structural instability, it offers a highly novel perspective on a massive real-world problem. While Paper 1 provides rigorous theoretical advances for nonlinear network inference, Paper 2's potential to directly influence the design of LLM serving infrastructure—saving significant compute and boosting throughput in a booming, resource-constrained industry—gives it broader and more immediate scientific and practical impact.
Paper 2 addresses a critical and highly timely problem in AI infrastructure: memory bottlenecks and KV cache growth in LLM serving. By formally modeling 'service-induced congestion' and identifying it as a structural instability, it offers theoretical insights with immediate, high-impact practical applications for optimizing large-scale LLM deployment. While Paper 1 provides solid theoretical advancements in optimization, Paper 2's focus on a major, real-world bottleneck in a rapidly growing field gives it greater potential for immediate scientific and industrial impact.
Paper 2 targets an urgent, high-impact real-world problem in LLM deployment: GPU memory dynamics and throughput collapse under concurrency. It introduces a dynamical-systems model with rigorous stability/limit-cycle results and actionable scheduling principles, likely influencing systems design and operations across ML serving, queueing/control, and applied theory. Its breadth and timeliness in the rapidly growing LLM serving ecosystem suggest strong near-term uptake. Paper 1 is technically novel (dimension-free complexity for BSDDP under strong convexity) but is more specialized to stochastic optimization theory with narrower immediate industrial impact.
Paper 1 introduces a novel theoretical framework identifying 'service-induced congestion' as a fundamental instability mechanism in LLM serving systems—a timely and rapidly growing field. It provides rigorous dynamical systems analysis with practical scheduling design principles, potentially impacting the massive LLM deployment ecosystem. Paper 2, while methodologically solid, addresses a well-studied optimization problem (railway crew scheduling) with incremental contributions (benchmark instances, computational study of known techniques). Paper 1's novelty, timeliness given the LLM boom, and broader cross-disciplinary relevance (systems, ML, queueing theory) give it higher impact potential.
Paper 2 addresses a critical and highly timely bottleneck in Large Language Model (LLM) serving: GPU memory constraints and throughput optimization. Its findings directly translate to significant real-world applications in AI infrastructure, offering actionable scheduling principles to prevent severe throughput losses. Paper 1 offers a solid theoretical advancement in mean field games, but Paper 2 has a much broader immediate relevance, higher industry applicability, and targets a rapidly expanding field with massive compute implications.
Paper 2 identifies a novel structural phenomenon—service-induced congestion—in LLM serving, providing rigorous dynamical systems analysis of a highly timely problem. Its results (instability of eviction-free equilibria, 50% throughput losses, scheduling design principles) have immediate practical impact for the rapidly growing LLM deployment ecosystem. Paper 1 is a comprehensive and well-structured survey of submodular optimization in control, but as a survey it synthesizes existing knowledge rather than introducing fundamentally new results. Paper 2's novelty, timeliness given the LLM explosion, and actionable design principles give it higher potential impact.
Paper 2 targets an urgent, high-impact real-world problem in LLM infrastructure: GPU memory–driven instability during serving. It proposes a clear dynamical model with provable phase behavior (fixed points vs. limit cycles), stability results, and actionable scheduling principles, with implications for systems design, queueing/control theory, and ML deployment. The topic is highly timely and broadly relevant across AI systems and cloud operations. Paper 1 is mathematically novel for PDVIs and offers a sophisticated SOS-based algorithm, but its applicability is narrower and impact is likely concentrated within optimization/control and nonsmooth dynamics communities.
Paper 1 targets an urgent, fast-evolving bottleneck in LLM deployment: GPU-memory instability under high concurrency. It offers a dynamical-systems model, provable stability/limit-cycle results, and concrete scheduling/design principles with potentially large throughput gains, making it timely with clear real-world applicability across ML systems, networking/queueing, and performance engineering. Paper 2 provides an elegant, rigorous characterization theorem in mathematical biology/control, but its immediate practical leverage and breadth beyond theory are narrower. Overall, Paper 1 is more likely to drive near-term adoption and follow-on work.
Paper 2 has higher likely impact: it introduces a novel, rigorous dynamical-systems theory for a timely, widely relevant bottleneck in LLM inference (KV-cache memory under continuous batching), with provable stability/limit-cycle results and actionable scheduling principles. Its applications span most large-scale AI deployment stacks and can influence system design across industry and research. Paper 1 is solid and methodologically careful but is primarily a finance benchmarking framework with narrower cross-field reach and limited standalone practical payoff (net Sharpe negative under realistic costs), making its broader scientific impact less likely.