Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

Seoungbin Bae, Dabeen Lee

Jun 8, 2026arXiv:2606.09668v1

cs.LG

#4479of 5669·cs.LG

#4479 of 5669 · cs.LG

Tournament Score

1322±44

10501750

41%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor8.5

Novelty6.5

Clarity8

Abstract

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$ . In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$ . The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB- $η$ -2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$ -random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$ . We further prove a minimax lower bound of order $Ω(T^{-1/2})$ . The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses contextual queueing bandits—a framework combining queueing theory with contextual bandits for learning to schedule heterogeneous jobs under unknown context-dependent service rates. The main contribution is twofold:

1. Improved upper bound: The authors propose CQB-η-2, a three-phase algorithm achieving Õ(T^{-1/2}) queue length regret, improving upon the prior Õ(T^{-1/4}) rate of Bae et al. (2026a,b).

2. Matching lower bound: They prove a minimax lower bound of Ω(T^{-1/2}), establishing that the T-dependence is tight up to logarithmic factors.

The key algorithmic insight is elegant: random exploration need not persist throughout the entire horizon. Instead, sufficient exploration can be front-loaded, after which a pure UCB rule suffices. This three-phase structure—(i) pure exploration, (ii) mixed exploration+UCB, (iii) pure UCB—is natural but the authors show it yields a qualitative improvement in the rate.

Methodological Rigor

The technical approach is sound and carefully executed. The proof strategy exploits the policy-switching decomposition (from prior work) but applies it in a novel way by splitting the regret at the exploration cutoff point τ₂:

Before cutoff: The departure-rate gap is bounded naively (≤1), but negative drift ensures that queue-length differences created in early rounds decay exponentially by horizon T. This is captured in Lemma 9, which provides an exponentially decaying bound on the "survival probability" of queue-length discrepancies.

After cutoff: The effect of queue-length differences is bounded trivially (≤1), but the Θ(T) random exploration samples collected in Phases 1-2 ensure departure-rate gaps of order T^{-1/2} (Lemma 8).

This decomposition is the heart of the improvement and represents a genuine analytical advance over prior work, which used monotone bounds requiring exploration throughout.

The lower bound construction is also rigorous. Two hard instances (ν⁺_T, ν⁻_T) are constructed with complementary optimal server sets, differing in a single coordinate of the unknown parameter. The gap pT - qT = (1-λ-ε)/(4√T) is calibrated so that the instances are statistically indistinguishable up to round T-1, while a wrong server choice in round T-1 creates a queue-length difference. The coupling argument (Lemma 14) showing the oracle queue is pathwise dominated is clean, and the application of the Bretagnolle-Huber inequality is standard but appropriately adapted to the queueing setting.

Potential Impact

Within queueing bandits: This paper settles the minimax rate (in T) for contextual queueing bandits under stochastic contexts—a clean theoretical resolution. The principle that exploration can be truncated while maintaining optimal performance could influence algorithm design in related queueing-bandit variants.

Broader applicability: The framework is motivated by real systems including cloud computing, call centers, and multi-LLM routing/scheduling. As LLM serving becomes increasingly important (with heterogeneous query types and multiple model backends), the contextual queueing bandit model gains practical relevance. However, the paper's assumptions (single queue, logistic service model, i.i.d. contexts, known traffic slackness ε) may limit direct applicability.

Methodological influence: The idea of splitting regret analysis at an exploration cutoff and using drift-based arguments to control pre-cutoff effects could transfer to other online learning problems with state-dependent dynamics. The lower bound technique—using queue-specific coupling to convert testing error to queue-length regret—is a useful template.

Timeliness & Relevance

The paper is timely given the growing interest in (a) learning-augmented queueing systems, (b) LLM scheduling/routing, and (c) contextual bandits with system-level performance metrics beyond cumulative reward. The contextual queueing bandit model sits at the intersection of these areas. Closing the gap between T^{-1/4} and T^{-1/2} is a natural and important theoretical question.

Strengths

1. Complete characterization: Both matching upper and lower bounds in T, resolving the open question about the optimal rate.

2. Clean algorithmic principle: The three-phase structure with exploration cutoff is intuitive and well-motivated.

3. Insightful proof technique: The split of the policy-switching decomposition at the cutoff point, with different bounding strategies before and after, is the key innovation.

4. Well-written: The paper clearly explains the intuition behind the improvement and how it differs from prior work.

5. Negative drift exploitation: The analysis carefully leverages the queueing structure (negative drift under the oracle) to bound the transient effects of early suboptimal decisions.

Limitations

1. Dependence on problem parameters: The upper bound has complex dependence on d, κ, σ₀, ε, λ. The lower bound only matches in T; the authors acknowledge that matching d and κ dependence remains open.

2. Known parameters: The algorithm requires knowledge of ε (traffic slackness), κ (logistic smoothness), σ₀ (feature diversity), and λ (arrival rate). In practice, these may be unknown.

3. Phase transition calibration: The theoretical choices of τ₁ and τ₂ involve large constants and problem-dependent parameters; the experiments use heuristic values (τ₁ = 0.2T, τ₂ = 0.3T).

4. Limited experimental evaluation: Only synthetic experiments with T=2000 and a small number of runs (10). No real-world validation or comparison at scale.

5. Single queue assumption: The model considers a single queue with K servers; multi-queue extensions would be more practically relevant.

6. Incremental over prior work: While the rate improvement is significant, the algorithmic and analytical framework builds heavily on Bae et al. (2026a,b), with the main innovation being the exploration cutoff idea.

Overall Assessment

This is a solid theoretical contribution that cleanly resolves the minimax rate for contextual queueing bandits. The matching upper and lower bounds represent a complete answer to a natural question, and the proof techniques—particularly the regret decomposition at the exploration cutoff—are insightful. The practical impact is moderate given the restrictive assumptions, but the theoretical contribution is clear and well-executed.

Rating:7/ 10

Significance 7Rigor 8.5Novelty 6.5Clarity 8

Generated Jun 9, 2026

Comparison History (17)

Wonvs. Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

Paper 2 achieves a tight minimax-optimal result (matching upper and lower bounds) for contextual queueing bandits, improving the convergence rate from T^{-1/4} to T^{-1/2}. This is a clean, fundamental theoretical contribution that resolves the optimal rate for a well-defined problem, with broad implications for online learning and queueing theory. Paper 1 offers an incremental improvement to physics-informed neural particle flow by adding population-aware encodings—a useful but relatively niche contribution to Bayesian filtering. Paper 2's methodological rigor, novelty in proving matching bounds, and broader theoretical impact give it the edge.

claude-opus-4-6·Jun 10, 2026

Wonvs. BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

Paper 2 presents a fundamental theoretical contribution to contextual queueing bandits, achieving rate-optimal queue length regret (improving from T^{-1/4} to T^{-1/2}) with matching minimax lower bounds. This closes a gap in the literature with rigorous mathematical analysis and novel algorithmic design. Paper 1, while practically useful, is primarily an engineering tool for checkpoint manipulation—a systems contribution with narrower intellectual impact. Paper 2's theoretical advances in online learning and queueing theory have broader implications across operations research, machine learning theory, and scheduling applications.

claude-opus-4-6·Jun 9, 2026

Lostvs. Safe-RULE: Safe Reinforcement UnLEarning

Paper 2 introduces a highly timely and novel paradigm (safe reinforcement unlearning) addressing critical security vulnerabilities in offline safe RL. Its focus on safety and defense against data poisoning provides broad real-world applicability in safety-critical systems like robotics, offering higher potential impact across fields than Paper 1's rigorous but narrower theoretical improvement in queueing bandits.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. An Agency-Transferring Model-Free Policy Enhancement Technique

Paper 1 likely has higher scientific impact due to a clear, rate-optimal theoretical advance: improving queue-length regret from ~T^{-1/4} to ~T^{-1/2} and matching it with a minimax lower bound, tightly characterizing the problem’s statistical limits. The three-phase algorithm and queue-specific coupling lower-bound technique are novel and methodologically rigorous, and results are broadly relevant to bandits, online learning, and stochastic networks/scheduling. Paper 2 targets an important practical RL setting and shows promising empirical performance, but the idea resembles existing safe/baseline-guided or residual RL paradigms and its theory depends on stronger assumptions, making its incremental novelty and rigor less clear.

gpt-5.2·Jun 9, 2026

Lostvs. PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

PRISM addresses a novel, practical problem (client-level modality deficiency in federated graph learning) with a comprehensive framework validated across six datasets. It combines multimodal learning, federated learning, and graph neural networks—three highly active research areas—giving it broad cross-field impact and strong real-world applicability (e.g., e-commerce, recommendation systems). Paper 2 provides a tight theoretical contribution (matching upper/lower bounds for contextual queueing bandits), which is elegant but narrow in scope, primarily advancing a specific theoretical frontier in online learning/queueing theory with limited immediate practical applications.

claude-opus-4-6·Jun 9, 2026

Lostvs. On the Convergence of Multicalibration Gradient Boosting

Paper 2 likely has higher impact: it provides theoretical convergence guarantees for multicalibration gradient boosting, a practically deployed, web-scale method at the intersection of ML, fairness, and optimization. The results (sublinear and linear convergence under assumptions, plus adaptive variants) are broadly relevant to algorithm design and trustworthy ML, with immediate real-world applicability. Paper 1 is methodologically strong and rate-optimal, but is more specialized to contextual queueing bandits and may affect a narrower community despite its sharp minimax characterization.

gpt-5.2·Jun 9, 2026

Lostvs. SUSD: Structured Unsupervised Skill Discovery through State Factorization

Paper 2 (SUSD) introduces a novel framework for unsupervised skill discovery that addresses fundamental limitations of existing MI-based and distance-maximizing methods through state factorization. It has broader impact potential across reinforcement learning, robotics, and multi-entity control, with practical applications in hierarchical RL and compositional task solving. While Paper 1 achieves a tight minimax-optimal rate for contextual queueing bandits (a significant theoretical contribution), it addresses a narrower problem. Paper 2's approach is more widely applicable, timely given interest in foundation models for RL, and offers both conceptual novelty and practical utility with released code.

claude-opus-4-6·Jun 9, 2026

Lostvs. Symbolic Regression for Shared Expressions: Introducing Partial Parameter Sharing

Paper 1 offers a highly applicable advancement in symbolic regression, directly enhancing AI-driven scientific discovery. By allowing partial parameter sharing, it enables researchers to isolate universal laws from category-specific effects, which has broad implications across physics, biology, and chemistry. While Paper 2 provides a rigorous and optimal theoretical bound for queueing bandits, Paper 1's potential for cross-disciplinary impact and real-world empirical scientific modeling gives it a wider and potentially more transformative scientific footprint.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

Paper 2 presents a fundamental theoretical breakthrough in reinforcement learning and queueing theory by closing the gap in queue length regret bounds. By providing an algorithm that achieves an improved rate of O(T^{-1/2}) and proving a matching minimax lower bound, it offers a complete and rigorous theoretical characterization. This mathematical finality and methodological rigor typically result in a broader and more lasting foundational impact compared to Paper 1, which focuses on an empirical architecture improvement evaluated on a minimal testbed in molecular machine learning.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Escaping the KL Agreement Trap in On-Policy Distillation

Paper 1 addresses a practical and timely problem in LLM distillation, identifying a novel failure mode ('KL agreement trap') and proposing a simple, effective solution (KAT) with strong empirical results across multiple benchmarks. Given the massive current interest in LLM training and distillation, this work has broad applicability and immediate practical impact. Paper 2 makes a rigorous theoretical contribution by achieving rate-optimal queue length regret with matching lower bounds, but its impact is more niche, limited to the queueing bandits community. Paper 1's relevance to the rapidly growing LLM field gives it higher potential impact.

claude-opus-4-6·Jun 9, 2026

#4479of 5669·cs.LG

#4479 of 5669 · cs.LG

Tournament Score

1322±44

10501750

41%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor8.5

Novelty6.5

Clarity8