Linfeng Cao, Ming Shi, Ness B. Shroff
Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.
This paper introduces a framework for integrating proactive user queries (e.g., "cheap and clean hotel") into personalized multi-objective multi-armed bandits (MO-MAB). The key contributions are threefold: (1) formalizing proactive query feedback via a Plackett-Luce (PL) subset choice model, (2) identifying a fundamental shift-invariance barrier showing that query-only preference learning is insufficient for optimal decision-making (Theorem 1), and (3) designing MO-PQUCB, a hybrid algorithm that resolves this barrier by combining PL-based preference anchoring with bandit utility feedback through carefully designed shift-invariant regularization.
The shift-invariance insight is the paper's cleanest theoretical contribution: since PL models are invariant to additive shifts of the parameter vector, query-derived estimates recover preferences only up to a constant—but when arms have different total reward sums, this unidentified constant changes arm rankings. This is an elegant observation that directly motivates the hybrid design.
The theoretical framework is comprehensive and technically sound. The regularization matrix U_λ = I - (1/D)11^⊤ + λI is well-motivated: it restricts regularization to the subspace orthogonal to the all-ones vector (where QE provides information) while leaving the shift direction for bandit feedback to resolve. This geometric alignment between the estimator structure and the identifiability gap is a noteworthy design choice.
The regret analysis yields O(N√T log T), improving by a factor of √log T over the prior PRUCB method's O(N√T log T). While modest, this improvement is cleanly attributed to the proactive QE signal. The proof machinery combines standard tools (matrix Bernstein inequality, self-normalized bounds) with PL-specific analysis (comparison graph Laplacians, spectral gap bounds).
The corruption analysis (Section 6) is a valuable extension. The lower bound (Theorem 3) identifies a sharp phase transition at ε = 1/2, the group-wise Lasso estimator is well-suited to the structured corruption model, and the regret degrades gracefully as O(√(εT log T)). The observation that setting α = O(ε + log⁻¹(T)) automatically shifts reliance from corrupted queries to bandit feedback is practically appealing.
Practical relevance: The framework directly addresses how modern LLM-mediated conversational systems can be formally integrated into sequential decision-making. The LLM experiments (Section 7.4), though preliminary, demonstrate feasibility with multiple production models (Gemini, GPT-OSS, Llama, Qwen, DeepSeek).
Broader influence: The paper bridges conversational AI and bandit theory in a principled way. The proactive query paradigm—where users initiate preference signals rather than responding to system prompts—is more aligned with modern conversational interfaces than prior passive/system-driven approaches. This could influence recommendation systems, dialogue planning, and interactive decision support.
Cross-field connections: The PL-based preference elicitation connects to the RLHF literature, and the corruption model relates to Byzantine-robust learning, potentially seeding cross-pollination.
Highly timely. LLM-based conversational agents are ubiquitous, and formalizing how structured preference signals extracted from natural language can accelerate sequential learning addresses a genuine gap. The paper positions itself well at the intersection of multi-objective optimization, preference learning, and conversational AI.
This paper makes a solid theoretical contribution by formalizing proactive conversational queries for preference-aware MO-MAB, identifying a clean identifiability barrier, and designing an algorithm that provably benefits from structured query feedback. The theoretical treatment is thorough and the corruption analysis adds practical value. However, the quantitative improvement is modest (√log T factor), and the gap between the formal PL model and actual natural language processing remains significant.
Generated Jun 9, 2026
Paper 1 addresses a critical bottleneck in time series foundation models by introducing a highly efficient, CPU-deployable 7M-parameter model. Its ability to incorporate covariates and perform zero-shot forecasting offers massive real-world utility across diverse industries. While Paper 2 provides rigorous theoretical advancements in bandit learning, Paper 1's combination of efficiency, broad applicability, and timeliness in the foundation model landscape gives it a higher potential for widespread scientific and practical impact.
Paper 2 has higher estimated scientific impact due to its stronger methodological rigor (formal framework, impossibility result, regret guarantees, robustness under corrupted queries) and broad applicability across recommender systems, conversational AI, online learning, and preference elicitation. The proactive query signal is a timely, realistic interaction modality, and the work advances theory and practice with clear performance guarantees. Paper 1 is practically valuable for efficient speech-to-LLM adaptation, but its main contribution is an engineering/distillation recipe with impact concentrated in multimodal LLM deployment, and it is more likely to be overtaken by fast-moving model-integration trends.
Paper 2 likely has higher scientific impact due to stronger timeliness and broader cross-field relevance: leveraging foundation models for drift detection/diagnosis in online task-free continual learning connects continual learning, OOD/drift monitoring, and multimodal LLM-based system orchestration. Its modular “detect + diagnose + adapt strategy” framing is readily applicable to real-world streaming ML deployments. Paper 1 is methodologically rigorous with provable regret improvements, but its impact is narrower (multi-objective bandits with conversational preference signals) and more specialized, despite clear novelty.
Paper 2 introduces a foundational benchmark for a highly active research area: autonomous LLM agents. Benchmarks like iOSWorld drive the direction of the field, serving as standard evaluation metrics that accumulate citations rapidly from both academia and industry. While Paper 1 offers strong theoretical contributions to multi-objective bandits, its impact will likely remain within a narrower subfield of reinforcement learning. Paper 2's focus on personalization and multi-app interaction on a native mobile simulator addresses an immediate, widespread bottleneck in agentic AI evaluation, granting it significantly higher potential for broad scientific impact.
Paper 1 introduces a novel framework combining proactive conversational queries with multi-objective bandits, addressing a fundamental shift-invariance barrier with theoretical guarantees. It bridges conversational AI and bandit optimization—two highly active fields—creating broader interdisciplinary impact. The robustness analysis under corrupted queries adds practical relevance. Paper 2 makes solid contributions to non-stationary contextual bandits with constraints but operates within a more incremental extension of existing linear bandit frameworks. Paper 1's novelty in formalizing query-based preference learning and its applicability to recommendation/dialogue systems gives it higher potential impact.
Paper 1 likely has higher scientific impact due to its cross-disciplinary novelty: a unifying framework (PCPL) for learning driven by physical perturbation contrasts, spanning and extending equilibrium/frequency propagation, and demonstrated in distinct physical substrates (mechanical networks, photonic circuits) including an analog computation primitive. This opens broad real-world applications in autonomous hardware learning, neuromorphic/photonic computing, and inverse problems, with potential impact across physics, ML, and engineering. Paper 2 is methodologically rigorous with provable regret improvements, but is more domain-specific to bandits and preference elicitation.
Paper 1 bridges reinforcement learning and conversational AI, a highly timely intersection given the rise of interactive LLM agents. By incorporating proactive conversational queries into multi-objective bandits, it offers a novel, highly applicable solution to personalized recommendation systems. While Paper 2 provides rigorous theoretical advancements for conformal risk control, Paper 1 has broader appeal, more immediate real-world applications across user-facing AI systems, and aligns perfectly with current trends in human-AI interaction.
Paper 2 likely has higher scientific impact due to timeliness and broad applicability: it demonstrates a practical, reproducible path to training a 120B sparse MoE on a single 8-GPU node via an integrated recipe (reversible recurrence, state-preserving scaling, and optimizer-state reduction). This can materially lower barriers to large-model research and deployment across many domains. Paper 1 is methodologically rigorous and novel in bandits with proactive preference queries, but its impact is narrower (multi-objective decision-making). Paper 2’s open release and systems-level scalability advances are poised for wider, faster adoption.
Paper 1 demonstrates higher scientific impact potential due to its direct clinical applicability—using routine laboratory data already collected during cancer treatment to predict 162 complications weeks to months before onset, without requiring additional infrastructure. The large-scale validation (3,905 patients, 2.7M measurements) across multiple cancers and external datasets (MIMIC-IV, MMRF CoMMpass) shows robustness and generalizability. This addresses a pressing real-world clinical need in oncology. Paper 2, while theoretically rigorous in advancing multi-objective bandit theory with proactive queries, addresses a narrower algorithmic problem with less immediate broad-field impact.
Paper 2 addresses a critical bottleneck in RLVR training of LLMs—the computational cost of long-context rollouts—which is highly timely given the rapid growth of reasoning LLMs. It provides a principled framework (sparse-to-dense mismatch analysis) with practical speedups (2-2.4x) validated across multiple model scales and domains. The breadth of applicability to the booming LLM training ecosystem gives it wider immediate impact. Paper 1, while theoretically rigorous and novel in its multi-objective bandit formulation, addresses a more niche problem with narrower potential adoption.