Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Honghan Wu
Abstract
Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper identifies and rigorously documents a previously uncharacterized failure mode in chain-of-thought (CoT) distillation: answer-level metrics (accuracy, calibration) can improve while the factual reliability of intermediate reasoning steps simultaneously degrades. The authors term this the "accuracy–reasoning split." Using a Qwen3-8B student distilled from a DeepSeek-V3-family teacher on MedQA-USMLE, they show SC@64 accuracy rising from 74.7% to 84.4% while step-level error rates (judged by Kimi-K2.6) increase from 30.6% to 50.3%. The central insight is that in domains where compact answer labels under-constrain rich rationales—medical diagnosis being the paradigmatic case—distillation teaches students to produce expert-like discourse form without reliably grounding each local factual claim.
This contribution addresses a genuine blind spot in how distilled reasoning models are evaluated and deployed, particularly relevant as CoT traces are increasingly treated as readable justifications, safety monitoring surfaces, and recursive training data.
Methodological Rigor
The experimental design is exceptionally thorough in its controls and robustness checks. The authors systematically address nearly every plausible confound:
Style confounds: A style-blind judge prompt instructs evaluation of factual content only, ignoring tone and confidence markers. A tone-swap control (rewriting base steps as prose and distilled steps as outlines) preserves the +18.6pp gap. Alternative segmentation strategies (sentence chunks, fixed word windows) maintain the direction.
Judge dependence: A panel of three LLM judges (Kimi-K2.6, GLM-4-32B, Hunyuan-A13B) all show the same directional effect, with effect sizes ranging +7.7 to +19.8pp. A 150-step blinded clinical expert audit reproduces the ordering. Inter-judge agreement is moderate (κ=0.434), appropriately acknowledged.
Answer-key visibility: An answer-blind judge variant preserves the overall gap (+14.2pp), addressing concerns that the gold answer biases step-level judgment.
Conditioning on correctness: Even restricting to questions where both models answer correctly, the step-error gap persists (+12.1pp, p=1.9×10⁻³⁰).
Boundary diagnostics: The authors carefully delineate where the effect does and doesn't appear. GSM8K and ARC-Challenge serve as negative controls, and Mistral-7B (a weaker base) reverses the pattern—SFT helps both answers and traces when the student lacks baseline competence.
The mechanistic investigations (answer-first rationalization check, hedge-direction localization, internal probes) add depth without overreaching. The answer-commitment trajectory analysis convincingly rules out simple post-hoc rationalization. The authors are appropriately cautious about their probe results (AUROC ≈0.82), noting position confounds.
A notable weakness is the reliance on LLM judges for the primary step-level labels. While the 150-step human audit is valuable, it is small relative to the thousands of judged steps. The absolute error rates should be interpreted cautiously, though the paired before-after comparison is more robust. The paper acknowledges this limitation explicitly.
Potential Impact
Immediate practical implications: This work has direct consequences for how medical AI systems are evaluated and deployed. Organizations releasing distilled medical reasoning models—or using CoT traces for oversight, training data curation, or clinical decision support—should not rely solely on answer-level metrics.
Training data contamination: Perhaps the sharpest implication is for recursive distillation pipelines where one model's traces become another's supervision. Even answer-filtered traces carry substantially noisier rationales, creating a potential degradation spiral.
Process supervision research: The finding challenges assumptions in the process-reward model (PRM) literature, which treats steps as meaningful objects for supervision. If distillation degrades step quality while improving answers, naively generated step-level labels from distilled models could poison PRM training.
AI safety and monitoring: The paper connects directly to recent work on CoT monitoring as a safety mechanism (Korbak et al., 2025; Baker et al., 2025). If distillation can produce fluent but factually degraded traces, monitoring schemes that rely on trace quality become less reliable precisely when they look most successful by answer metrics.
Broader ML methodology: The concept of "process-answer coupling"—how strongly answer supervision constrains rationale quality—could influence evaluation practices beyond medicine in any domain where justifications matter (legal reasoning, scientific analysis, financial risk assessment).
Timeliness & Relevance
This paper is exceptionally timely. CoT distillation is the dominant paradigm for creating compact reasoning models (DeepSeek-R1 distillation, Qwen3, etc.), and medical AI deployment is accelerating. The gap between how these models are evaluated (answer metrics) and how they are consumed (traces read by clinicians) represents a pressing safety concern. The recent focus on CoT monitoring for AI safety (cited 2025 preprints) makes this work immediately relevant to ongoing policy and technical discussions.
Strengths
1. Comprehensive experimental coverage: The sheer number of controlled comparisons—across judges, student scales/families, teacher strengths, medical benchmarks, segmentation rules, style controls—makes the core finding very robust.
2. Precise boundary delineation: Rather than overclaiming universality, the paper carefully identifies the regime where the split occurs (capable student, compact answer, rich rationale domain) and where it doesn't.
3. Mechanistic depth: The answer-gain decomposition (Section 5) elegantly explains *how* accuracy improves via vote-boundary redistribution while traces degrade, connecting the two metrics conceptually.
4. Intellectual honesty: The limitations section is substantive, and boundary diagnostics are presented with equal rigor as positive findings.
Limitations
1. LLM judge dependence: Despite extensive controls, the primary signal comes from LLM judges whose absolute calibration is uncertain. The 150-step human audit is too small for subgroup analyses.
2. Single teacher family: The DeepSeek-V3 teacher family is the primary source; varying teacher architecture more broadly would strengthen the claim.
3. No mitigation proposed: The paper diagnoses but does not solve the problem. While this is acknowledged, even preliminary mitigation experiments would increase practical value.
4. Multiple-choice limitation: All medical benchmarks are MCQ-based. The finding may differ with open-ended clinical generation, where answer-level evaluation is itself richer.
5. Causal mechanism remains incomplete: While two deflationary explanations are ruled out, the paper does not fully explain *why* distillation degrades traces at the training-dynamics level.
Overall Assessment
This is a carefully executed diagnostic study that identifies a consequential failure mode at the intersection of knowledge distillation, medical AI safety, and LLM evaluation methodology. The finding that standard metrics systematically miss trace degradation is important for the field. The work is notable for its thoroughness of controls rather than methodological novelty per se—the experimental design is the contribution. It should influence evaluation practices for distilled reasoning models, particularly in high-stakes domains.
Generated May 28, 2026
Comparison History (16)
Paper 2 exposes a critical flaw in current LLM evaluation paradigms by revealing that improved final-answer accuracy in CoT distillation can mask degrading reasoning quality, especially in high-stakes domains like medicine. This finding challenges standard metrics and has broad implications for AI safety, evaluation methodology, and clinical applications. While Paper 1 presents a useful efficiency improvement for reasoning tasks, Paper 2's fundamental critique of how reasoning traces are assessed is likely to drive more significant shifts in how the field evaluates and deploys reasoning models.
Paper 1 exposes a critical flaw in Chain-of-Thought distillation, revealing that improved accuracy can mask degrading reasoning quality. This challenges fundamental assumptions in LLM evaluation and AI safety, extending its impact far beyond the medical domain. While Paper 2 offers an excellent domain-specific tool for drug discovery, Paper 1 addresses a core, highly timely methodological issue in foundation model training. Its exposure of the accuracy versus reasoning divergence will likely force a paradigm shift in how the broader AI community evaluates and distills reasoning models, giving it higher widespread scientific impact.
Paper 1 addresses a critical and underexplored failure mode of chain-of-thought distillation: that improved answer accuracy can mask degraded reasoning quality. This finding has broad implications for AI safety, trustworthiness, and deployment in high-stakes domains like medicine. The rigorous multi-dimensional evaluation (multiple evaluators, scales, benchmarks, clinical expert validation) and the counterintuitive finding that accuracy and reasoning quality diverge makes it highly impactful. Paper 2, while solid, offers a more incremental contribution to retrieval-augmented reasoning with a planning mechanism. Paper 1's warning about relying on answer-level metrics alone is more broadly consequential.
Paper 2 addresses a fundamental and timely problem in AI safety for healthcare: chain-of-thought distillation can improve answer accuracy while degrading reasoning quality. This has broad implications for deploying LLMs in high-stakes domains, challenges widely-used evaluation practices, and could reshape how the community evaluates distilled models. The finding that accuracy and reasoning faithfulness can diverge is a critical insight with immediate practical consequences. Paper 1, while technically sound, represents an incremental improvement to adversarial attack efficiency rather than revealing a fundamentally new problem, limiting its broader impact.
Paper 2 has higher impact potential because it identifies a timely, broadly relevant failure mode in chain-of-thought distillation: improved answer metrics can coincide with degraded step-level factuality. The work is methodologically rigorous (multiple evaluators, controls, model families/scales, benchmarks, and clinician audit) and has immediate real-world implications for medical QA safety, evaluation, and deployment norms across LLM applications. Paper 1 is a solid incremental optimization framework within multimodal sentiment analysis, but its scope and cross-field implications are narrower.
Paper 2 reveals a fundamental and broadly consequential flaw in chain-of-thought distillation: answer accuracy can improve while reasoning quality degrades. This finding challenges core assumptions in the rapidly growing field of knowledge distillation and LLM reasoning, with critical implications for safety-sensitive domains like medicine. Its methodological rigor (multiple evaluators, clinical expert validation, extensive controls) and generalizable insight make it likely to influence evaluation practices across the entire LLM community. Paper 1, while practically valuable as an engineering contribution at Uber, is more narrowly scoped to enterprise AI security and MCP-based agent monitoring.
Paper 2 is likely higher impact: it introduces a principled latent-space framing (refusal suppression as probe evasion), yielding a generalizable methodology and a stronger attack (controlled evasion past the boundary) validated across 15 diverse models with state-of-the-art results. Its real-world relevance is immediate for AI safety, red-teaming, and defense design, with cross-field impact spanning interpretability, adversarial ML, and alignment. Paper 1 is valuable and timely but is more domain-specific (medical QA/CoT auditing) and primarily diagnostic rather than offering broadly applicable new mechanisms or tools.
Paper 2 exposes a critical flaw in LLM evaluation, demonstrating that standard accuracy metrics in CoT distillation mask deteriorating reasoning quality in high-stakes domains. This challenges fundamental assumptions in AI safety and evaluation, offering broad implications across the entire deep learning community. While Paper 1 presents an innovative and highly efficient SSM approach for EEG monitoring, Paper 2's findings address a pressing, widespread issue in the rapidly expanding field of LLM reasoning, likely triggering a broader paradigm shift in model evaluation.
Paper 1 addresses a fundamental and timely problem in AI safety and evaluation: that chain-of-thought distillation can improve answer accuracy while degrading reasoning quality. This finding has broad implications across all domains using CoT distillation, especially safety-critical fields like medicine. The methodological rigor (multiple evaluators, clinical expert validation, boundary checks, extensive controls) and the counterintuitive nature of the finding make it highly impactful. Paper 2, while technically solid, is more incremental—combining existing techniques (edge-cloud splitting, curriculum learning) for a specific engineering problem with narrower conceptual impact.
Paper 1 is more likely to have higher impact: it proposes a novel, actionable LLM-agent framework grounded in simulated patient dynamics with a clear training curriculum (SFT + behavior cloning + RL) and demonstrates improved safety/value on real ICU trajectories, making it closer to deployable clinical decision support and relevant to broader agentic/world-model research. Paper 2 is timely and rigorous as an auditing study, but it is primarily diagnostic (highlighting a failure mode) rather than offering a new method that enables new capabilities, so its downstream practical impact may be narrower.
Paper 1 offers higher scientific impact by exposing a critical, counter-intuitive flaw in modern LLM training: CoT distillation improves final accuracy but degrades actual reasoning quality. This empirical discovery fundamentally challenges current evaluation paradigms for LLMs, especially in high-stakes domains like medicine. Its rigorous methodology, including clinical expert audits, ensures robustness. In contrast, Paper 2 proposes a valuable but primarily administrative governance framework for AI deployment. Paper 1's findings will likely drive immediate, foundational shifts in how researchers evaluate, train, and distill reasoning models across the broader AI community.
Paper 2 offers a more fundamental mechanistic insight into why chain-of-thought works, revealing that local co-occurrence rather than logical derivation drives much of the gain. This challenges core assumptions about CoT reasoning and has broad implications across all CoT applications and model families. Paper 1, while important for medical AI safety, addresses a narrower domain-specific concern about distillation quality. Paper 2's findings could reshape how the field thinks about prompting, reasoning evaluation, and model interpretability, giving it broader and more transformative potential impact.
Paper 2 likely has higher scientific impact because it identifies a safety-critical, counterintuitive failure mode: chain-of-thought distillation can improve medical QA accuracy and calibration while degrading step-level factual correctness of the rationale, validated across models, benchmarks, evaluators, controls, and a clinician audit. This challenges common evaluation practice and has immediate implications for deployment, auditing standards, and policy around releasing/reusing rationales. Paper 1 is a useful engineering framework for multi-agent RL optimization, but its impact is narrower and more incremental relative to rapidly evolving LLM training/tooling ecosystems.
Paper 2 exposes a critical vulnerability in Chain-of-Thought distillation evaluation, particularly in high-stakes medical domains. By demonstrating that improved final-answer accuracy can mask severely degraded reasoning quality, it challenges standard AI evaluation paradigms. This has profound implications for AI safety, interpretability, and the deployment of LLMs in healthcare. While Paper 1 offers a strong technical method for reasoning calibration, Paper 2's findings have broader real-world relevance, wider applicability across domains, and a more urgent impact on how we assess LLM reliability.
Paper 2 identifies a critical and counterintuitive flaw in Chain-of-Thought distillation—that final answer accuracy can improve while reasoning step factuality degrades. This challenges fundamental assumptions in LLM evaluation and distillation methodologies, especially in high-stakes domains like medicine. While Paper 1 introduces a valuable benchmark for agent personalization, Paper 2's findings have broader theoretical and safety implications for how the field assesses and trusts LLM reasoning traces, giving it higher potential scientific impact.
Paper 1 addresses a fundamental and timely problem in AI safety and reliability: showing that chain-of-thought distillation can improve answer accuracy while degrading reasoning quality. This 'opposite directions' finding has broad implications for the rapidly growing field of model distillation and deployment, especially in high-stakes medical domains. The rigorous multi-dimensional evaluation (multiple models, benchmarks, clinical expert validation) and the actionable warning about relying solely on answer-level metrics make it highly impactful. Paper 2 contributes a useful benchmark for travel planning agents but addresses a narrower application domain with more incremental advancement.