Online Pandora's Box for Contextual LLM Cascading

Alexandre Belloni, Yan Chen, Yehua Wei

Jun 5, 2026arXiv:2606.07392v1

cs.AIcs.LGecon.EMstat.ML

#899of 3489·Artificial Intelligence

#899 of 3489 · Artificial Intelligence

Tournament Score

1452±46

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor8

Novelty7.5

Clarity7.5

Abstract

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper formulates LLM cascading as an online contextual Pandora's Box problem with a novel two-phase structure: a query phase (which APIs to call and when to stop) and a selection phase (which generated output to deploy). The key structural novelty is the output-mediated feedback: querying a box reveals a stochastic output-cost pair rather than directly revealing a reward, and the downstream reward is observed only for the selected output. This departs from classical Pandora's Box formulations where opening a box directly reveals its value.

The paper's main algorithmic contribution is the COSMOS policy, which combines GMM-type estimation of contextual reservation indices (parameterized as generalized linear functions of context features) with UCB-style confidence bounds. Rather than estimating the full conditional output distributions of each API, the approach directly models and learns the Weitzman reservation indices—a practically motivated and theoretically elegant shortcut that avoids the curse of dimensionality in distribution estimation.

Methodological Rigor

The theoretical analysis is thorough and well-structured. Several aspects deserve highlight:

1. Regret Decomposition (Theorem 1): Under optimism, the per-period regret cleanly separates into reward estimation error (for selected outputs only) and index estimation errors (for queried APIs only). This decomposition is both intuitive and technically powerful, enabling separate treatment of the two learning challenges.

2. M-estimation approach for indices: The paper avoids standard GMM curvature requirements by formulating index estimation as M-estimation, where the loss function's first-order condition recovers the moment equation. Lemma 1 establishes population curvature that depends on the prediction error $\psi(x)^\top(\rho - \rho_a)$ rather than requiring full-rank identification of the parameter vector—a meaningful relaxation.

3. Known vs. Unknown Reward: The staged presentation (Section 4 for known reward, Section 5 for unknown) is pedagogically effective. The unknown-reward case introduces genuine technical challenges: the plug-in perturbation (Lemma 4) creates a mismatch between selected outputs (informing reward learning) and queried outputs (entering index estimation), necessitating Assumption 5 (anti-concentration of reward features).

4. Regret bound: The final $\tilde{O}([d + A(p_A + \sqrt{dp_A})]\sqrt{T})$ bound is dimension-dependent but achieves the optimal $\sqrt{T}$ rate. This improves upon the $\tilde{O}(T^{5/6})$ rate of Atsidakou et al. (2024) for contextual Pandora's Box, though under stronger regularity assumptions (Assumption 2).

Potential Impact

Practical relevance: The LLM cascading motivation is timely and commercially significant. Organizations managing portfolios of LLM APIs face exactly this cost-quality tradeoff, and the paper provides a principled framework with formal guarantees. The model naturally captures token-based pricing (Remark 1) and the shared reward structure across APIs.

Theoretical contributions: The paper bridges several literatures—Pandora's Box, contextual bandits, GMM estimation—in a non-trivial way. The output-mediated feedback structure is a genuine modeling innovation that could influence future work on sequential search with intermediate observations. The connection between Weitzman indices and GMM moment conditions opens new directions for index-based learning in sequential decision problems.

Limitations of practical applicability: The generalized linear parametric assumptions on both the reward function and reservation indices may be restrictive in practice, where LLM output quality and costs may exhibit complex, non-linear context dependencies. The paper acknowledges this and suggests neural extensions as future work.

Timeliness & Relevance

The paper addresses a pressing need in the rapidly growing LLM deployment ecosystem. As organizations increasingly use multiple LLM APIs with heterogeneous cost-quality profiles, principled approaches to adaptive API selection become critical. The formulation captures a real operational decision that existing heuristic cascading methods (FrugalGPT, etc.) address without formal guarantees.

Strengths

1. Clean problem formulation that captures the essential structure of LLM cascading while remaining analytically tractable.

2. Direct index modeling avoids estimating full output distributions—a significant practical and theoretical advantage.

3. Self-correcting exploration: The regret decomposition ensures that frequently queried boxes have shrinking confidence radii, while infrequently queried boxes contribute negligibly to regret.

4. Comprehensive analysis covering both known and unknown reward regimes, with clear identification of where each assumption is needed.

5. The example in Section 2.1 effectively demonstrates why a one-shot API-as-arm bandit formulation incurs linear regret, justifying the sequential search framework.

Limitations

1. Assumption 2 (local mass around reservation thresholds) is non-trivial and may not hold when some APIs consistently dominate others across all contexts, as the probability mass above the reservation index could vanish.

2. Assumption 5 (anti-concentration) is somewhat restrictive—the authors acknowledge it fails for predictable contexts with certain feature structures (Remark 4).

3. No experiments: The paper is purely theoretical. Empirical validation on real LLM cascading scenarios would substantially strengthen the contribution and help assess the practical relevance of the parametric assumptions.

4. Single query per box per period: The restriction that each API is queried at most once per request, while consistent with existing cascading literature, limits applicability to settings where re-querying (e.g., with different prompts) might be valuable.

5. Dimension dependence: The $A \cdot p_A$ factor in the regret bound could be large when managing many APIs with high-dimensional context features.

Overall Assessment

This is a well-executed theoretical paper that establishes a novel and well-motivated connection between LLM cascading and online contextual Pandora's Box problems. The technical contributions—particularly the regret decomposition under optimism and the GMM-based index learning approach—are solid and could influence both the Pandora's Box and LLM systems literatures. The absence of empirical validation is a notable gap, but the theoretical foundations are rigorous and the problem formulation is compelling.

Rating:7.2/ 10

Significance 7.5Rigor 8Novelty 7.5Clarity 7.5

Generated Jun 8, 2026

Comparison History (19)

Lostvs. Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

Paper 2 likely has higher scientific impact: it introduces a scalable next-token prediction objective for multi-modal physiological time series and demonstrates strong empirical gains on large real-world data (20k+ PSGs, 8 modalities), data-efficiency, and cross-domain generalization (e.g., AFib detection). This broad applicability across healthcare domains and timely relevance to foundation models suggests wide uptake. Paper 1 is novel and methodologically rigorous with regret guarantees for LLM cascading, but its impact is narrower (decision-theoretic API querying) and may be more specialized to online learning/operations settings.

gpt-5.2·Jun 9, 2026

Lostvs. Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Paper 1 addresses a critical bottleneck in AI agent capabilities by proposing a method to compile unstructured knowledge into reusable procedural skills, effectively extending RAG. Its highly practical framework and strong empirical results suggest broad applicability and immediate relevance to the rapidly growing field of autonomous agents. While Paper 2 offers strong theoretical contributions to LLM API routing, Paper 1's paradigm shift from declarative retrieval to executable skill installation gives it higher potential for widespread adoption and transformative real-world impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Paper 1 presents a novel theoretical framework combining online learning, Pandora's Box theory, and LLM cascading with rigorous regret bounds. It introduces a new feedback structure (output-mediated), develops GMM-based estimation for reservation indices, and proves √T regret guarantees. This provides foundational algorithmic contributions applicable beyond LLMs to broader sequential decision-making. Paper 2 introduces a useful engineering framework (MechSim) for simulator reasoning but is more incremental, combining existing ideas (neuro-symbolic reasoning, structured schemas) without comparable theoretical depth or generalizable methodological innovation.

claude-opus-4-6·Jun 8, 2026

Wonvs. TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

Paper 1 presents a novel theoretical framework combining online learning, Pandora's Box theory, and LLM cascading with rigorous regret bounds. It introduces a fundamentally new problem formulation (output-mediated feedback in contextual Pandora's Box) with strong methodological contributions (GMM-based reservation index estimation, UCB confidence bounds) and provable guarantees. Paper 2, while addressing a practical problem of multimodal missingness, offers a more incremental contribution—conditional estimation for handling missing modalities—building on existing foundation model pipelines. Paper 1's theoretical novelty and its direct relevance to the rapidly growing LLM deployment ecosystem give it broader and deeper potential impact.

claude-opus-4-6·Jun 8, 2026

Lostvs. Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

Paper 2 addresses a critical and highly timely bottleneck in Large Language Models: the massive memory footprint and computational overhead of reasoning traces (KV cache). Its dynamic token selection method offers immediate, practical improvements for LLM inference efficiency. Paper 1 presents rigorous theoretical work on LLM cascading with proven regret bounds, but Paper 2's direct optimization of LRM deployment is likely to have a broader and more immediate real-world impact across industry and applied AI research.

gemini-3.1-pro-preview·Jun 8, 2026

Wonvs. Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Paper 2 introduces a novel online decision-making formulation tailored to LLM cascading with output-mediated feedback, along with a theoretically grounded algorithm (GMM + UCB) and regret guarantees. This methodological rigor and generality make it likely to influence work in online learning, bandits, operations research, and practical LLM system orchestration (cost/quality trade-offs). Paper 1 is timely and useful as an evaluation benchmark for agentic research behavior, but its impact is more concentrated in benchmarking/assessment and may be less foundational than a new model + provable learning results.

gpt-5.2·Jun 8, 2026

Wonvs. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

Paper 2 offers a more technically novel and rigorous contribution: it introduces a new online contextual Pandora’s Box variant tailored to LLM cascading with output-mediated (bandit) feedback, proposes a concrete learning algorithm (GMM + UCB) and provides regret guarantees. It is timely and directly applicable to real-world LLM routing/cost-quality tradeoffs, with potential impact across online learning, operations research, and systems. Paper 1 is valuable as a unifying perspective and agenda-setting piece, but appears less methodologically concrete and likely yields fewer immediate, citable technical results.

gpt-5.2·Jun 8, 2026

Wonvs. DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Paper 2 has higher estimated impact due to a clearer methodological contribution (a new online learning model with output-mediated feedback and provable ~O(sqrt(T)) regret), stronger rigor via formal assumptions/theory, and broader applicability beyond deep-research agents (e.g., adaptive model/API selection, bandits, decision-making under costs). It is timely for LLM routing/cascading and offers principled guarantees useful across ML, OR, and economics. Paper 1 is application-driven and competitive on benchmarks, but is more system-engineering focused and likely less generalizable without comparable theoretical grounding.

gpt-5.2·Jun 8, 2026

Wonvs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

Paper 2 has higher potential impact: it introduces a timely, practically motivated online learning framework for LLM API cascading with a novel output-mediated feedback structure, and provides methodological rigor via a principled policy (GMM + UCB) with provable regret bounds. Its applications (adaptive model selection, cost-quality tradeoffs) are immediate and broad across ML systems, bandits/online decision-making, and operations research. Paper 1 is insightful and novel diagnostically for LLM-driven program evolution, but is more specialized and primarily descriptive, with narrower direct real-world deployment implications and less formal theoretical grounding.

gpt-5.2·Jun 8, 2026

Lostvs. How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Paper 2 leverages large-scale production data to empirically demonstrate how autonomous AI agents transform knowledge work. Its findings on efficiency, cost reduction, and shifting work scope have broad interdisciplinary impact across HCI, economics, and AI strategy. While Paper 1 provides rigorous theoretical bounds for LLM routing, its impact is confined to a specific algorithmic niche, making Paper 2 the clear winner in terms of broader scientific and societal significance.

gemini-3.1-pro-preview·Jun 8, 2026

#899of 3489·Artificial Intelligence

#899 of 3489 · Artificial Intelligence

Tournament Score

1452±46

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor8

Novelty7.5

Clarity7.5