Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
Pierre Boudart, Pierre Gaillard, Alessandro Rudi
Abstract
We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of (Li et al., 2024), where is the feature dimension, the episode length, and the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant , measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of , which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, , reducing the horizon dependence by a factor . We further establish a matching lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
1. Core Contribution
This paper resolves an open problem in the regret complexity of episodic MDPs with multinomial logistic (MNL) transitions. The authors introduce a problem-dependent constant , measuring the normalized average variance of the optimal downstream value function along the learner's trajectory. They design an algorithm (LIVAROT) achieving regret and prove a matching lower bound. This is the first time upper and lower bounds coincide (up to log factors) for MNL mixture MDPs, fully characterizing the regret complexity of this setting.
The key insight is that the non-linearity of the softmax transition model—previously viewed as a curse requiring the problematic constant —can be turned into an advantage. When transition uncertainty has little effect on downstream rewards (either because optimal dynamics are nearly deterministic or because reachable states have similar values), becomes small and the regret improves. This bridges the gap between the upper bound of Li et al. (2024) and the lower bound of Park et al. (2024), showing the latter is a special case where .
2. Methodological Rigor
The paper is technically rigorous with carefully constructed proofs. Several aspects stand out:
Upper bound analysis: The algorithm employs an explore-then-learn paradigm, where the exploration phase constructs confidence sets small enough to leverage self-concordance of the logistic loss at constant cost. The regret analysis involves a sophisticated Taylor expansion of the softmax function, carefully separating variance terms from estimation error terms. The proof handles the non-concavity of the optimistic value function computation and correctly accounts for the martingale structure of the regret decomposition.
Lower bound construction: The authors adapt the hard MDP instance of Park et al. (2024) and compute the resulting , showing it equals on this instance. This elegantly demonstrates that the existing lower bound is a special case of their general result. The computation in Equation (12) using Bernoulli inequality bounds is clean and convincing.
Self-concordance exploitation: The paper carefully leverages the generalized self-concordance property of the logistic loss (Sun and Tran-Dinh, 2019), requiring the initial exploration to ensure confidence sets have diameter at most . This constant-factor control is essential but well-justified.
One methodological concern is that the lower bound relies on a specific instance where is computed independently of the learner's trajectory. Since in the upper bound depends on the learner's trajectory, there is a subtle asymmetry. The authors acknowledge this limitation and note that under the optimal policy , (Proposition 7), suggesting potential for further improvement.
3. Potential Impact
Theoretical impact: This paper provides the first complete characterization of the regret complexity of MNL mixture MDPs. The introduction of as a problem-dependent complexity measure is a conceptual advance that may influence how researchers think about structured RL problems. The clean extension from bandit non-linearity constants () to MDP variance measures demonstrates a principled transfer of ideas from bandits to RL.
Practical relevance: The application to KL-constrained robust MDPs (Proposition 5) is particularly impactful. Robust MDPs are widely used in practice (robotics, healthcare, operations research), and showing that in this setting—yielding a factor- improvement—is a significant practical insight. The connection between the robust MDP worst-case kernel and the MNL transition model is natural and well-articulated.
Broader influence: The variance-aware framework may inspire similar analyses for other structured RL settings with non-linear function approximation (e.g., neural network-based transitions). The connection to variance-based bounds in online learning (Cesa-Bianchi et al., 2007; Gaillard et al., 2014) enriches the cross-pollination between these fields.
4. Timeliness & Relevance
The paper addresses a current bottleneck: existing algorithms for MNL MDPs are strictly worst-case and do not adapt to problem structure. With growing interest in function approximation beyond linear models and in robust RL, this work is timely. The MNL model is arguably the most natural parametric model for transition probabilities (ensuring non-negativity and normalization), making theoretical characterization of this setting practically important.
5. Strengths & Limitations
Strengths:
Limitations:
Generated May 20, 2026
Comparison History (17)
Paper 2 establishes minimax optimal regret bounds for MNL mixture MDPs, providing both upper and lower bounds that fully characterize the regret complexity for the first time. This is a fundamental theoretical contribution with lasting impact in reinforcement learning theory. The introduction of the variance-aware constant and matching lower bound represents a clean, definitive result. Paper 1 proposes an incremental engineering improvement to KV cache compression with heuristic components, addressing a practically relevant but narrower problem without the same level of theoretical rigor or generalizability across the field.
Paper 1 addresses a highly timely and widely researched problem in AI (search-augmented reasoning in LLMs) by proposing a simplified, effective self-evolution method. Its practical applicability to building competent AI agents and its strong empirical performance on standard benchmarks give it broader appeal and higher potential for immediate real-world impact and citations compared to Paper 2, which offers a robust but specialized theoretical contribution to reinforcement learning.
Paper 1 makes a fundamental theoretical contribution to reinforcement learning by establishing minimax optimal variance-aware regret bounds for MNL mixture MDPs, with both upper and lower bounds that fully characterize the regret complexity. This resolves an open theoretical question with rigorous proofs and has broad implications across RL theory. Paper 2 introduces a useful benchmark for LLM delegation workflows, but benchmarks tend to have more transient impact, and its key finding—that quality is indistinguishable across conditions—limits its immediate practical influence. Paper 1's theoretical results are more enduring and foundational.
Paper 1 is more likely to have higher near-term scientific and real-world impact: it tackles a timely, widely encountered industry problem (optimizing system prompts under aggregate-only metrics) and proposes a novel LLM-in-the-loop Bayesian optimization pipeline with adaptive, interpretable representations, which is broadly applicable across LLM deployment, HCI, and optimization. Paper 2 is methodologically very rigorous and theoretically valuable (minimax-optimal regret with variance-aware constants), but its impact is narrower to specialized RL theory for MNL MDPs and may translate more slowly to practice.
Paper 1 establishes minimax optimal regret bounds for MNL mixture MDPs with a matching lower bound, fully characterizing the regret complexity of this problem class for the first time. This represents a fundamental theoretical contribution to reinforcement learning theory with clean, definitive results. Paper 2 presents a practical but incremental improvement to rubric-based reward aggregation in RLHF/RLVR, addressing a narrower engineering problem. While useful, Paper 1's theoretical completeness (tight upper and lower bounds) and its connections across structured MDPs give it broader and more lasting scientific impact.
Paper 1 demonstrates a high potential for immediate, real-world impact by bridging AI and environmental engineering. Its focus on safety-critical digital twins for wastewater treatment directly addresses pressing sustainability and climate challenges (energy efficiency, N2O reduction). By validating its explainable, self-falsifying AI pipeline on full-scale real-world plants, it offers immense practical value. While Paper 2 provides excellent fundamental theoretical contributions to RL (closing a regret bound gap), Paper 1's broader interdisciplinary applicability and timely environmental focus give it a higher estimated overall scientific and societal impact.
Paper 1 likely has higher scientific impact due to a novel end-to-end, closed-loop workflow automation system (trace-to-graph construction, LLM-guided execution, and reinforcement-based self-evolution) demonstrated in multiple real production services with strong expert-evaluated gains—indicating immediate real-world applicability and broad relevance across AIOps, enterprise automation, knowledge graphs, and LLM agents. Paper 2 is methodologically rigorous and timely for theory RL, but its impact is narrower (specialized MNL-structured MDPs) and primarily theoretical without demonstrated downstream deployment leverage.
Paper 1 makes fundamental theoretical contributions to reinforcement learning by establishing minimax optimal regret bounds and fully characterizing the regret complexity for MNL mixture MDPs. This rigorous mathematical advancement has broad implications for RL algorithm design. In contrast, Paper 2 is a narrower case study on a specific AI theorem proving attempt, highlighting current limitations rather than providing a generalized, foundational breakthrough.
Paper 1 presents a highly impactful interdisciplinary application of multimodal large language models to materials science, specifically targeting the discovery and design of catalytic materials. Unifying property prediction and inverse design addresses a major bottleneck in AI-driven material discovery. Its potential for real-world applications in clean energy and chemistry gives it a broader and more tangible scientific impact compared to Paper 2, which, while mathematically rigorous, focuses on highly specialized theoretical bounds within reinforcement learning.
Paper 2 likely has higher scientific impact due to its strong real-world applicability and timeliness: an executable, tool-integrated, omni-modal benchmark with closed-loop verification directly targets a central bottleneck in evaluating and improving agentic systems. Such benchmarks often become shared infrastructure, enabling broad, cross-field impact (LLMs/agents, HCI, software engineering, multimodal learning, evaluation). Paper 1 is methodologically rigorous and novel in regret characterization for MNL MDPs, but is more specialized with narrower immediate adoption outside RL theory.
Paper 2 establishes minimax optimal regret bounds for MNL mixture MDPs, introducing a variance-aware constant that tightens existing bounds and proving matching lower bounds. This is a fundamental theoretical contribution that fully characterizes the regret complexity of an important RL problem class for the first time. Its methodological rigor (matching upper and lower bounds) and broad applicability to structured MDPs (e.g., robust MDPs) give it lasting theoretical impact. Paper 1, while practically useful, is primarily an engineering benchmark/framework contribution with more incremental novelty and narrower impact scope.
Paper 2 likely has higher scientific impact: it provides a new variance-aware, problem-dependent regret bound with a matching lower bound, yielding a first full characterization of regret complexity for MNL mixture MDPs. This is methodologically rigorous, theoretically novel, and broadly relevant across RL, bandits, and structured MDPs, with durable value as a foundational result. Paper 1 shows strong applied impact in ad bidding and impressive real-world gains, but its techniques are more domain-specific and may generalize less broadly than a minimax-optimal theory result.
Paper 2 likely has higher scientific impact: it delivers a new variance-aware regret bound with a matching lower bound, establishing (near) minimax-optimal regret for MNL logistic MDPs and fully characterizing complexity—an enduring theoretical contribution broadly relevant to RL, bandits, and decision-making under structured models. Its methodological rigor (upper/lower bounds) and generality suggest wide reuse and follow-on work. Paper 1 is timely and useful for auditing chess-LLMs and verifier-in-the-loop evaluation, but its domain-specific empirical focus may limit breadth and longevity compared to a foundational RL theory result.
Paper 2 establishes minimax optimal regret bounds and fully characterizes the regret complexity for MNL mixture MDPs for the first time. Resolving fundamental theoretical bounds in reinforcement learning typically yields high scientific impact due to the field's size and the foundational nature of the contribution. While Paper 1 introduces an interesting framework for voting theory, Paper 2's rigorous mathematical guarantees and tight lower bounds represent a more definitive and impactful advancement in a highly active research area.
Paper 2 likely has higher scientific impact: it provides a principled theoretical advance (variance-aware regret bound with matching lower bound) that fully characterizes regret complexity for MNL mixture MDPs and is minimax optimal. This is methodologically rigorous and broadly reusable across RL theory, bandits, and structured/robust MDPs, with clear timeliness given growing interest in generalized linear/structured RL. Paper 1 is practically relevant for LLM alignment and diversity, but appears more incremental and domain-specific, with impact depending on empirical adoption and evaluation robustness.
Paper 1 offers a theoretically novel and rigorous contribution: variance-aware regret bounds for MNL MDPs with a problem-dependent constant, plus a matching lower bound establishing (near) minimax optimality and fully characterizing regret complexity—high impact within RL theory and potentially downstream algorithm design. Paper 2 is timely and applied (LLM logicality in physics) with open code, but the contribution appears more methodological/dataset-centric and its generality and rigor may be harder to validate or transfer broadly beyond evaluation/training practices.
Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability: it provides a scalable, end-to-end benchmark for coding agents focused on delivered application behavior, with real-browser runtime evaluation and partial human validation. This can directly influence how agentic coding systems are trained, compared, and deployed across academia and industry, affecting multiple subfields (LLM agents, software engineering, HCI, evaluation). Paper 1 is technically rigorous and novel in theory (variance-aware minimax regret for MNL MDPs) but its impact is narrower and more incremental within RL theory.