Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Pierre Boudart, Pierre Gaillard, Alessandro Rudi

May 19, 2026

arXiv:2605.19768v1 PDF

cs.AI(primary)cs.LGmath.STstat.ML

#1088of 2292·Artificial Intelligence

#1088 of 2292 · Artificial Intelligence

Tournament Score

1419±42

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor8

Novelty7.5

Clarity7.5

Tournament Score

1419±42

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$ , measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$ , which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$ , reducing the horizon dependence by a factor $H$ . We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

1. Core Contribution

This paper resolves an open problem in the regret complexity of episodic MDPs with multinomial logistic (MNL) transitions. The authors introduce a problem-dependent constant $\bar{\sigma}_T \leq 1/2$ , measuring the normalized average variance of the optimal downstream value function along the learner's trajectory. They design an algorithm (LIVAROT) achieving $\tilde{O}(dH^2\bar{\sigma}_T\sqrt{T})$ regret and prove a matching $\Omega(dH^2\bar{\sigma}_T\sqrt{T})$ lower bound. This is the first time upper and lower bounds coincide (up to log factors) for MNL mixture MDPs, fully characterizing the regret complexity of this setting.

The key insight is that the non-linearity of the softmax transition model—previously viewed as a curse requiring the problematic constant $\kappa$ —can be turned into an advantage. When transition uncertainty has little effect on downstream rewards (either because optimal dynamics are nearly deterministic or because reachable states have similar values), $\bar{\sigma}_T$ becomes small and the regret improves. This bridges the gap between the $\tilde{O}(dH^2\sqrt{T})$ upper bound of Li et al. (2024) and the $\Omega(dH^{3/2}\sqrt{T})$ lower bound of Park et al. (2024), showing the latter is a special case where $\bar{\sigma}_T = \Theta(H^{-1/2})$ .

2. Methodological Rigor

The paper is technically rigorous with carefully constructed proofs. Several aspects stand out:

Upper bound analysis: The algorithm employs an explore-then-learn paradigm, where the exploration phase constructs confidence sets small enough to leverage self-concordance of the logistic loss at constant cost. The regret analysis involves a sophisticated Taylor expansion of the softmax function, carefully separating variance terms from estimation error terms. The proof handles the non-concavity of the optimistic value function computation and correctly accounts for the martingale structure of the regret decomposition.

Lower bound construction: The authors adapt the hard MDP instance of Park et al. (2024) and compute the resulting $\bar{\sigma}_T$ , showing it equals $\Theta(H^{-1/2})$ on this instance. This elegantly demonstrates that the existing $\Omega(dH^{3/2}\sqrt{T})$ lower bound is a special case of their general result. The computation in Equation (12) using Bernoulli inequality bounds is clean and convincing.

Self-concordance exploitation: The paper carefully leverages the generalized self-concordance property of the logistic loss (Sun and Tran-Dinh, 2019), requiring the initial exploration to ensure confidence sets have diameter at most $1/(3\sqrt{2})$ . This constant-factor control is essential but well-justified.

One methodological concern is that the lower bound relies on a specific instance where $\bar{\sigma}_T$ is computed independently of the learner's trajectory. Since $\bar{\sigma}_T$ in the upper bound depends on the learner's trajectory, there is a subtle asymmetry. The authors acknowledge this limitation and note that under the optimal policy $π^{*} \pi^*$ , $\mathbb{E}_{\pi^*}[\sigma_T^2] \leq 1/(4H)$ (Proposition 7), suggesting potential for further improvement.

3. Potential Impact

Theoretical impact: This paper provides the first complete characterization of the regret complexity of MNL mixture MDPs. The introduction of $\bar{\sigma}_T$ as a problem-dependent complexity measure is a conceptual advance that may influence how researchers think about structured RL problems. The clean extension from bandit non-linearity constants ( $\kappa_*$ ) to MDP variance measures demonstrates a principled transfer of ideas from bandits to RL.

Practical relevance: The application to KL-constrained robust MDPs (Proposition 5) is particularly impactful. Robust MDPs are widely used in practice (robotics, healthcare, operations research), and showing that $\bar{\sigma}_T = O(H^{-1})$ in this setting—yielding a factor- $H$ improvement—is a significant practical insight. The connection between the robust MDP worst-case kernel and the MNL transition model is natural and well-articulated.

Broader influence: The variance-aware framework may inspire similar analyses for other structured RL settings with non-linear function approximation (e.g., neural network-based transitions). The connection to variance-based bounds in online learning (Cesa-Bianchi et al., 2007; Gaillard et al., 2014) enriches the cross-pollination between these fields.

4. Timeliness & Relevance

The paper addresses a current bottleneck: existing algorithms for MNL MDPs are strictly worst-case and do not adapt to problem structure. With growing interest in function approximation beyond linear models and in robust RL, this work is timely. The MNL model is arguably the most natural parametric model for transition probabilities (ensuring non-negativity and normalization), making theoretical characterization of this setting practically important.

5. Strengths & Limitations

Strengths:

Achieves the first matching upper and lower bounds for MNL mixture MDPs

The problem-dependent constant

\bar{\sigma}_T

is intuitive and well-motivated

Clean connection to bandit literature and robust MDPs

Recovers prior worst-case bounds while strictly improving in structured settings

Limitations:

The exploration phase requires

\tau = O(\text{poly}(\kappa, d, B))

episodes, which can be large. Though this only affects lower-order terms, it may limit practical applicability

The non-concave optimization in the optimistic value computation (Equation 5) is solved only approximately via Frank-Wolfe; theoretical guarantees for this approximation are not provided

The trajectory-dependent definition of

\bar{\sigma}_T

is somewhat unusual for a "problem-dependent" constant; an environment-only characterization would be cleaner

Experiments are limited to a specific hard instance with

d = 2

; broader empirical validation on realistic problems is absent

The constant

\nu_{\max}

in lower-order terms requires finite action spaces and finite suboptimality gaps

Rating:7.8/ 10

Significance 8.5Rigor 8Novelty 7.5Clarity 7.5

Generated May 20, 2026

Comparison History (17)

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

claude-opus-4.65/22/2026

Paper 2 establishes minimax optimal regret bounds for MNL mixture MDPs, providing both upper and lower bounds that fully characterize the regret complexity for the first time. This is a fundamental theoretical contribution with lasting impact in reinforcement learning theory. The introduction of the variance-aware constant and matching lower bound represents a clean, definitive result. Paper 1 proposes an incremental engineering improvement to KV cache compression with heuristic components, addressing a practically relevant but narrower problem without the same level of theoretical rigor or generalizability across the field.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

gemini-3.15/22/2026

Paper 1 addresses a highly timely and widely researched problem in AI (search-augmented reasoning in LLMs) by proposing a simplified, effective self-evolution method. Its practical applicability to building competent AI agents and its strong empirical performance on standard benchmarks give it broader appeal and higher potential for immediate real-world impact and citations compared to Paper 2, which offers a robust but specialized theoretical contribution to reinforcement learning.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

claude-opus-4.65/20/2026

Paper 1 makes a fundamental theoretical contribution to reinforcement learning by establishing minimax optimal variance-aware regret bounds for MNL mixture MDPs, with both upper and lower bounds that fully characterize the regret complexity. This resolves an open theoretical question with rigorous proofs and has broad implications across RL theory. Paper 2 introduces a useful benchmark for LLM delegation workflows, but benchmarks tend to have more transient impact, and its key finding—that quality is indistinguishable across conditions—limits its immediate practical influence. Paper 1's theoretical results are more enduring and foundational.

vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

gpt-5.25/20/2026

Paper 1 is more likely to have higher near-term scientific and real-world impact: it tackles a timely, widely encountered industry problem (optimizing system prompts under aggregate-only metrics) and proposes a novel LLM-in-the-loop Bayesian optimization pipeline with adaptive, interpretable representations, which is broadly applicable across LLM deployment, HCI, and optimization. Paper 2 is methodologically very rigorous and theoretically valuable (minimax-optimal regret with variance-aware constants), but its impact is narrower to specialized RL theory for MNL MDPs and may translate more slowly to practice.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

claude-opus-4.65/20/2026

Paper 1 establishes minimax optimal regret bounds for MNL mixture MDPs with a matching lower bound, fully characterizing the regret complexity of this problem class for the first time. This represents a fundamental theoretical contribution to reinforcement learning theory with clean, definitive results. Paper 2 presents a practical but incremental improvement to rubric-based reward aggregation in RLHF/RLVR, addressing a narrower engineering problem. While useful, Paper 1's theoretical completeness (tight upper and lower bounds) and its connections across structured MDPs give it broader and more lasting scientific impact.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gemini-3.15/20/2026

Paper 1 demonstrates a high potential for immediate, real-world impact by bridging AI and environmental engineering. Its focus on safety-critical digital twins for wastewater treatment directly addresses pressing sustainability and climate challenges (energy efficiency, N2O reduction). By validating its explainable, self-falsifying AI pipeline on full-scale real-world plants, it offers immense practical value. While Paper 2 provides excellent fundamental theoretical contributions to RL (closing a regret bound gap), Paper 1's broader interdisciplinary applicability and timely environmental focus give it a higher estimated overall scientific and societal impact.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to a novel end-to-end, closed-loop workflow automation system (trace-to-graph construction, LLM-guided execution, and reinforcement-based self-evolution) demonstrated in multiple real production services with strong expert-evaluated gains—indicating immediate real-world applicability and broad relevance across AIOps, enterprise automation, knowledge graphs, and LLM agents. Paper 2 is methodologically rigorous and timely for theory RL, but its impact is narrower (specialized MNL-structured MDPs) and primarily theoretical without demonstrated downstream deployment leverage.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gemini-3.15/20/2026

Paper 1 makes fundamental theoretical contributions to reinforcement learning by establishing minimax optimal regret bounds and fully characterizing the regret complexity for MNL mixture MDPs. This rigorous mathematical advancement has broad implications for RL algorithm design. In contrast, Paper 2 is a narrower case study on a specific AI theorem proving attempt, highlighting current limitations rather than providing a generalized, foundational breakthrough.

vs. CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

gemini-3.15/20/2026

Paper 1 presents a highly impactful interdisciplinary application of multimodal large language models to materials science, specifically targeting the discovery and design of catalytic materials. Unifying property prediction and inverse design addresses a major bottleneck in AI-driven material discovery. Its potential for real-world applications in clean energy and chemistry gives it a broader and more tangible scientific impact compared to Paper 2, which, while mathematically rigorous, focuses on highly specialized theoretical bounds within reinforcement learning.

vs. TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact due to its strong real-world applicability and timeliness: an executable, tool-integrated, omni-modal benchmark with closed-loop verification directly targets a central bottleneck in evaluating and improving agentic systems. Such benchmarks often become shared infrastructure, enabling broad, cross-field impact (LLMs/agents, HCI, software engineering, multimodal learning, evaluation). Paper 1 is methodologically rigorous and novel in regret characterization for MNL MDPs, but is more specialized with narrower immediate adoption outside RL theory.

vs. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

claude-opus-4.65/20/2026

Paper 2 establishes minimax optimal regret bounds for MNL mixture MDPs, introducing a variance-aware constant that tightens existing bounds and proving matching lower bounds. This is a fundamental theoretical contribution that fully characterizes the regret complexity of an important RL problem class for the first time. Its methodological rigor (matching upper and lower bounds) and broad applicability to structured MDPs (e.g., robust MDPs) give it lasting theoretical impact. Paper 1, while practically useful, is primarily an engineering benchmark/framework contribution with more incremental novelty and narrower impact scope.

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact: it provides a new variance-aware, problem-dependent regret bound with a matching lower bound, yielding a first full characterization of regret complexity for MNL mixture MDPs. This is methodologically rigorous, theoretically novel, and broadly relevant across RL, bandits, and structured MDPs, with durable value as a foundational result. Paper 1 shows strong applied impact in ad bidding and impressive real-world gains, but its techniques are more domain-specific and may generalize less broadly than a minimax-optimal theory result.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact: it delivers a new variance-aware regret bound with a matching lower bound, establishing (near) minimax-optimal regret for MNL logistic MDPs and fully characterizing complexity—an enduring theoretical contribution broadly relevant to RL, bandits, and decision-making under structured models. Its methodological rigor (upper/lower bounds) and generality suggest wide reuse and follow-on work. Paper 1 is timely and useful for auditing chess-LLMs and verifier-in-the-loop evaluation, but its domain-specific empirical focus may limit breadth and longevity compared to a foundational RL theory result.

vs. Efficient Elicitation of Collective Disagreements

gemini-3.15/20/2026

Paper 2 establishes minimax optimal regret bounds and fully characterizes the regret complexity for MNL mixture MDPs for the first time. Resolving fundamental theoretical bounds in reinforcement learning typically yields high scientific impact due to the field's size and the foundational nature of the contribution. While Paper 1 introduces an interesting framework for voting theory, Paper 2's rigorous mathematical guarantees and tight lower bounds represent a more definitive and impactful advancement in a highly active research area.

vs. Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact: it provides a principled theoretical advance (variance-aware regret bound with matching lower bound) that fully characterizes regret complexity for MNL mixture MDPs and is minimax optimal. This is methodologically rigorous and broadly reusable across RL theory, bandits, and structured/robust MDPs, with clear timeliness given growing interest in generalized linear/structured RL. Paper 1 is practically relevant for LLM alignment and diversity, but appears more incremental and domain-specific, with impact depending on empirical adoption and evaluation robustness.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

gpt-5.25/20/2026

Paper 1 offers a theoretically novel and rigorous contribution: variance-aware regret bounds for MNL MDPs with a problem-dependent constant, plus a matching lower bound establishing (near) minimax optimality and fully characterizing regret complexity—high impact within RL theory and potentially downstream algorithm design. Paper 2 is timely and applied (LLM logicality in physics) with open code, but the contribution appears more methodological/dataset-centric and its generality and rigor may be harder to validate or transfer broadly beyond evaluation/training practices.

vs. WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability: it provides a scalable, end-to-end benchmark for coding agents focused on delivered application behavior, with real-browser runtime evaluation and partial human validation. This can directly influence how agentic coding systems are trained, compared, and deployed across academia and industry, affecting multiple subfields (LLM agents, software engineering, HCI, evaluation). Paper 1 is technically rigorous and novel in theory (variance-aware minimax regret for MNL MDPs) but its impact is narrower and more incremental within RL theory.