Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo
Abstract
We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information"
1. Core Contribution
This paper formalizes a specific failure mode in large reasoning models (LRMs): the detection-to-abstention gap, where models recognize that a question lacks sufficient information during intermediate reasoning but nonetheless proceed to generate an unsupported answer. The key insight is that abstention failure is not purely a detection problem—models often *detect* missing premises but fail to *act* on that detection. The authors propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that inserts an explicit answerability judgment gate (`` block) at the beginning of the reasoning trajectory. This decomposes inference into a binary judgment phase (ANSWERABLE vs. UNANSWERABLE) followed by conditional continuation or early termination. The framework is instantiated through supervised warm-up on JTS-formatted trajectories and GRPO-based reinforcement learning with a multi-component reward (format, consistency, task, and conditional length shaping).
2. Methodological Rigor
Strengths in methodology:
Weaknesses:
3. Potential Impact
Direct applications: The medical AI use case (MedIQ evaluation) is compelling and timely. In clinical decision support, generating a confident but unsupported answer when patient information is incomplete could lead to real harm. JTS-style abstention provides a concrete mechanism for safer deployment.
Broader implications:
Limitations on impact:
4. Timeliness & Relevance
This paper addresses a highly relevant problem. The concurrent release of AbstentionBench (cited as [17]) and the Missing-Premise dataset [8] demonstrate growing community interest in abstention reliability. The "overthinking" phenomenon in reasoning models has been widely observed but poorly formalized. This work arrives at a time when reasoning models (DeepSeek-R1, Qwen3, o1/o3) are being rapidly deployed, making the safety implications immediate.
The framing as a "reasoning control" problem rather than purely a classification problem aligns with emerging interest in controllable generation and inference-time compute allocation.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional Observations
The paper's framing contribution may be more impactful than the method itself. The detection-to-abstention gap is a clean, measurable concept that can be adopted by the broader community regardless of whether JTS becomes the standard solution. The conditional length shaping reward is a practical contribution that could be applied to other reasoning control problems beyond abstention.
The auxiliary finding about reduced self-reflection on hard problems deserves deeper investigation but is appropriately positioned as preliminary.
Generated May 28, 2026
Comparison History (15)
Paper 1 identifies a fundamental flaw in LLM reasoning (the detection-to-abstention gap) and proposes a novel, generalizable methodological framework (Judge-Then-Solve) to address it. While Paper 2 provides high practical value in the medical domain, Paper 1's contributions have broader implications for the foundational development of safe, reliable, and efficient reasoning models across all high-stakes domains.
Paper 1 targets a broadly observed, safety-critical failure mode in reasoning LMs (recognize insufficiency yet answer anyway) and introduces a general, model-agnostic control framework (Judge-Then-Solve) with reinforcement-learning shaping that improves abstention reliability and inference efficiency. Its applications span many high-stakes domains (medical, legal, decision support) and align with timely concerns about trustworthy reasoning. Paper 2 is rigorous with useful theory, but is more domain-specific (ad auctions) and likely narrower in cross-field impact, making Paper 1’s potential scientific impact higher.
CaMBRAIN introduces a fundamentally new architecture paradigm for EEG processing—the first causal SSM for real-time continuous EEG inference—addressing critical scalability limitations of attention-based models. It achieves SOTA across 3 datasets with >10x throughput gains, enabling practical real-time clinical monitoring of variable-length signals. This has broad impact across neuroscience, clinical medicine, and BCI applications. Paper 1, while addressing an important safety concern (detection-to-abstention gap), is more incremental in scope, primarily refining LLM reasoning behavior. CaMBRAIN's architectural innovation and direct clinical applicability give it broader and more transformative potential impact.
Paper 2 likely has higher impact: it identifies a robust, previously under-measured failure mode (in-group bias via targeting rather than action type) across multiple model families, with clear statistical evidence and strong implications for deploying LLM agents in multi-agent, persistent settings. Its findings generalize across architectures and connect to social psychology, fairness, auditing, and AI governance—broad cross-field relevance and timeliness as agentic systems proliferate. Paper 1 is valuable for safety/reliability in reasoning models, but its contribution is more narrowly scoped to abstention control and may be closer to incremental training/control refinements.
Paper 2 introduces a fundamentally novel insight—that reasoning models inherently function as context compressors—which reframes two active research areas (chain-of-thought reasoning and context compression) under a unified lens. This conceptual bridge has broader impact across NLP, enabling practical inference acceleration without specialized modules. The strong empirical gains (17-23% improvements) and the paradigm's simplicity increase adoption potential. Paper 1 addresses an important but narrower problem (abstention under insufficient information) with a well-engineered but more incremental solution, limiting its cross-field impact compared to Paper 2's broader theoretical contribution.
Paper 2 addresses a fundamental and critical issue in large reasoning models—knowing when to abstain under insufficient information. This has broad, sweeping implications for AI safety, reliability, and efficiency across numerous high-stakes domains, including medicine. Paper 1 presents a solid methodological improvement for a specific clinical task (IBD prediction from ICD codes), but its scope and potential impact across different fields are much narrower compared to the foundational AI safety advancements proposed in Paper 2.
Paper 2 addresses a critical safety and reliability issue in large reasoning models: knowing when to abstain. Given the widespread and rapid deployment of LLMs in high-stakes domains like medicine, solving the detection-to-abstention gap offers broader real-world applications, higher timeliness, and cross-disciplinary impact compared to the narrower control theory advancements presented in Paper 1.
Paper 2 likely has higher scientific impact due to a broadly applicable and timely problem—reliable abstention under insufficient information—central to safe deployment of reasoning models across high-stakes domains. It introduces a clear failure mode (detection-to-abstention gap), formalizes a measurable objective (A@D), and proposes a general training/control framework (Judge-Then-Solve) with supervised + RL components validated on multiple model classes. The contribution is not tied to a specific modality or benchmark, offering wider cross-field relevance and practical safety/efficiency implications than Paper 1’s more specialized audio-visual multi-hop setting.
Paper 2 addresses a specific, well-defined failure mode in reasoning models with a concrete, testable solution (JTS framework) and demonstrates empirical results. It has immediate practical implications for AI safety in high-risk domains like medical AI. While Paper 1 tackles the important topic of AGI measurement, it is more of a conceptual/taxonomic framework without demonstrated empirical validation. Paper 2's methodological rigor, actionable contributions (reinforcement learning approach, measurable metrics like A@D), and direct relevance to current deployed systems give it higher near-term scientific impact and citability.
Paper 2 addresses a critical safety and reliability flaw in LLM reasoning, with broad applicability across high-stakes domains like healthcare and law. Its Judge-Then-Solve framework improves both safety and inference efficiency for general AI systems. While Paper 1 introduces a highly rigorous and valuable benchmark for industrial Text-to-CAD, Paper 2's focus on fundamental AI reasoning control offers significantly wider multidisciplinary impact and timely relevance to the broader AI community.
Paper 1 introduces a foundational benchmark for a rapidly growing field (interactive multimodal agents). By exposing a severe performance ceiling (19.43% average accuracy) in current state-of-the-art models, EgoBench is highly likely to become a standard testbed that drives future research in agentic AI, robotics, and tool use. While Paper 2 addresses an important safety alignment problem (abstention), benchmarks that define capability bottlenecks typically have a broader and longer-lasting scientific impact across the community.
Paper 1 introduces a foundational programming paradigm for LLM agents by bridging programming language concepts (type-checking, compiler diagnostics) with agentic execution. This system-level approach to agent safety and control flow has a broader potential impact on how future AI agent frameworks are built compared to Paper 2, which offers a valuable but narrower alignment technique for improving model abstention.
Paper 2 has higher likely impact due to strong timeliness and real-world relevance (safe deployment, medical/high-stakes settings), a clearly defined and broadly applicable failure mode (detection-to-abstention gap), and a general control framework (Judge-Then-Solve) that can transfer across model families and tasks. Its methodological contribution (trajectory-level control with RL objectives and efficiency gains) targets a central problem in reasoning LLM reliability. Paper 1 is solid and technical for dialogue RL, but its impact is narrower (multi-turn dialogue + simulator alignment) and more contingent on simulator quality and deployment setting.
Paper 1 targets a broadly observed, safety-critical failure mode in reasoning LMs—detecting missing information yet still answering—and formalizes it (detection-to-abstention gap) with a general control framework (Judge-Then-Solve) applicable beyond medicine. It offers a clear, widely transferable mechanism (explicit answerability commitment + RL shaping) with efficiency benefits and likely relevance across many deployments of reasoning models. Paper 2 is timely and rigorous but more domain- and setting-specific (medical tool ensembles/selection), making its cross-field breadth and generality somewhat narrower.
Paper 1 addresses a critical safety and reliability flaw in large reasoning models, which has broad implications across high-stakes fields like medicine. Its focus on LLM reasoning control offers higher timeliness, broader cross-disciplinary impact, and wider real-world applicability compared to the specialized neural combinatorial optimization focus of Paper 2.