Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen
Abstract
Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
1. Core Contribution
This paper introduces CIE-Scorer, a framework for detecting unfaithful chain-of-thought (CoT) reasoning in LLMs by comparing the model's internal computational process (obtained via circuit tracing) with its externally displayed reasoning trace. The key insight is that faithful reasoning should show alignment between internal circuits and external text, while unfaithful reasoning should exhibit discrepancy. The framework operates through three stages: (1) selecting informative tokens via entropy and counterfactual necessity scores, (2) constructing compact sentence-level circuits and external reasoning graphs, and (3) measuring their discrepancy using Fused Gromov-Wasserstein (FGW) distance.
The novelty lies in the explicit bridging of mechanistic interpretability (circuit tracing) with faithfulness evaluation — a connection that has been conceptually motivated but not previously operationalized at this level. The use of FGW distance to jointly measure node-feature and structural discrepancy between internal and external reasoning graphs is a technically interesting formulation.
2. Methodological Rigor
Strengths in methodology:
Methodological concerns:
3. Potential Impact
This work addresses a genuinely important problem: as CoT reasoning becomes central to LLM deployment, verifying that reasoning traces actually reflect model computation (rather than being post-hoc rationalizations) is critical for AI safety and trustworthiness. The framework could influence:
However, the practical impact is currently limited by the white-box requirement (acknowledged in limitations) and reliance on transcoder-based circuit tracing infrastructure that exists only for select models.
4. Timeliness & Relevance
This paper is highly timely. The explosion of reasoning-focused LLMs (o1, o3, DeepSeek-R1, Gemini 2.5 Pro) has made CoT faithfulness a pressing concern. Simultaneously, Anthropic's circuit tracing work (cited as [2]) has recently made mechanistic interpretability more accessible. CIE-Scorer sits at this intersection, making it relevant to both the interpretability and safety communities.
The work also arrives at a moment when there's growing skepticism about CoT as explanation (multiple 2024-2025 papers questioning CoT faithfulness), making detection tools particularly needed.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The paper is generally well-written and clearly structured. The formalization of the problem and the pipeline components are presented with sufficient detail for reproduction. The code availability strengthens reproducibility. The FGW distance, while technically sophisticated, adds complexity that may limit adoption compared to simpler approaches. The coupling matrix visualization (Figure 6) provides intuitive evidence but only shows single examples.
Generated May 26, 2026
Comparison History (24)
Paper 2 likely has higher scientific impact due to greater novelty and breadth: it bridges mechanistic interpretability (circuit tracing) with practical CoT faithfulness detection via a scalable, instance-level internal–external discrepancy metric (FGW distance). This connects to pressing concerns about LLM reliability and evaluation, with applications across domains using CoT. Paper 1 is strong and useful (span-level input ambiguity attribution via Shapley values) but is a more incremental extension of established attribution concepts and is narrower in scope (input ambiguity/UQ) compared to broad relevance of CoT faithfulness and interpretability-informed auditing.
Paper 1 proposes a foundational rethinking of data management for AI agent memory, formalizing a new workload (GEM) with correctness conditions and proving limitations of existing paradigms. This has broader impact across databases, AI agents, and systems research, defining a new research area. Paper 2 makes a solid but more incremental contribution to CoT faithfulness detection using circuit tracing. While technically interesting, it addresses a narrower problem. Paper 1's vision-setting nature, formal foundations, and identification of a new data-management workload class give it higher potential for long-term cross-disciplinary impact as AI agents become prevalent.
Paper 2 introduces a fundamentally new evaluation paradigm (AgingBench) addressing an overlooked but critical problem—long-term reliability degradation of deployed AI agents. This opens an entirely new research direction (agent lifespan engineering) with broad practical implications for real-world AI deployment. Its taxonomy of aging mechanisms and diagnostic framework are highly novel and timely as persistent AI agents proliferate. Paper 1, while technically solid and combining mechanistic interpretability with CoT faithfulness detection in a novel way, addresses a narrower problem within an already active research area, limiting its breadth of impact.
Paper 2 addresses a critical bottleneck in LLM trustworthiness—Chain-of-Thought faithfulness—by bridging mechanistic interpretability with external outputs. Its computationally efficient circuit-tracing approach using Fused Gromov-Wasserstein distance offers high methodological rigor. While Paper 1 introduces a valuable multimodal benchmark, Paper 2's foundational contribution to AI safety and alignment has broader theoretical implications and higher potential impact across the widespread deployment of reasoning models.
Paper 1 addresses a fundamental issue in foundational LLMs (CoT unfaithfulness) by innovatively combining mechanistic interpretability with external traces. Its methodological novelty and broad applicability to AI safety and alignment across all fields give it a much higher potential scientific impact than Paper 2, which primarily applies standard fine-tuning techniques to create a domain-specific model for Geology.
Paper 2 likely has higher scientific impact due to stronger real-world applicability and broader relevance: it addresses deployment-critical efficiency for large multimodal VLMs while explicitly preserving CoT reasoning, a central capability in current systems. Its contributions (pivot-token awareness, cross-modal activation considerations, global budgeted structured pruning) are timely and useful across many VLM deployments and tasks. Paper 1 is novel in combining mechanistic interpretability with unfaithfulness detection, but its immediate applications are narrower and depend on circuit-tracing assumptions and benchmarks.
Paper 1 is more likely to have higher near-term scientific impact: it introduces a concrete, novel method combining mechanistic interpretability (compact circuit tracing) with external rationale signals and a principled discrepancy metric (Fused Gromov–Wasserstein), and reports SOTA results on multiple benchmarks with efficiency gains—suggesting methodological rigor and immediate applicability to LLM reliability/safety. Paper 2 is a valuable conceptual synthesis, potentially broad in philosophical influence, but is less methodologically testable and offers fewer actionable, evaluable contributions, making impact less certain and typically slower to materialize.
Paper 2 addresses a critical, timely issue in AI governance by empirically auditing over 2 million models to reveal the systemic decay of ethical constraints across open-weight AI lineages. Its formalization of the 'governance horizon' and actionable policy insights give it profound, cross-disciplinary implications for AI regulation, policy, and safety, offering broader real-world impact compared to the technical improvements in LLM interpretability presented in Paper 1.
Paper 1 is more scientifically novel and broadly impactful: it introduces a new instance-level detector combining mechanistic interpretability (compact circuit tracing) with external rationale structure via a principled discrepancy metric (Fused Gromov–Wasserstein). It targets a timely, central reliability issue in LLMs and is evaluated on multiple benchmarks with clear methodological contributions. Paper 2 is highly useful engineering for geospatial scalability, but appears more application/workflow-focused with narrower cross-field impact and less methodological novelty (agentic translation + tooling adaptations) than Paper 1.
Paper 1 introduces a novel framework (CIE-Scorer) that bridges mechanistic interpretability with CoT faithfulness detection—a fundamental problem in LLM trustworthiness. Its combination of circuit tracing with graph-based discrepancy measurement (Fused Gromov-Wasserstein distance) is methodologically innovative and addresses scalability challenges. Paper 2 contributes a useful benchmark for coding agents but is more incremental, as multi-turn evaluation is a natural extension of existing benchmarks. Paper 1's approach has broader impact potential across interpretability, alignment, and safety research, whereas Paper 2's impact is more narrowly scoped to coding agent evaluation.
Paper 1 addresses a critical and timely problem in AI safety/alignment—detecting unfaithful chain-of-thought reasoning in LLMs—with a novel approach combining mechanistic interpretability (circuit tracing) with external reasoning signals using Fused Gromov-Wasserstein distance. This is highly innovative, methodologically rigorous, and relevant to the rapidly growing field of LLM trustworthiness. Paper 2 presents a systematic but relatively incremental feature-based analysis for speech-based mental health assessment using established methods (XGBoost, SHAP, LIME) and known acoustic features. While clinically relevant, it lacks the novelty and broad impact potential of Paper 1.
Paper 2 addresses a fundamental bottleneck in AI safety and alignment—the faithfulness of LLM reasoning. By combining mechanistic interpretability with external traces, it offers a scalable, rigorous approach to understanding model internals. While Paper 1 is highly timely and practically useful for the emerging field of Generative Engine Optimization (GEO), Paper 2 has deeper theoretical implications and broader scientific impact across core AI research, safety, and trustworthiness.
Paper 2 addresses a critical and highly timely issue in LLM reasoning (Chain-of-Thought unfaithfulness) by innovating at the intersection of mechanistic interpretability and optimal transport. This offers broad, foundational impact for AI safety and reliability. In contrast, Paper 1 presents a solid but more incremental algorithmic combination (MARL and evolutionary algorithms) targeted at a specific, narrower application domain (air combat).
Paper 2 likely has higher impact: it tackles a central, timely LLM reliability problem (CoT faithfulness) with a novel hybrid of mechanistic interpretability (compact circuit tracing) and graph discrepancy (Fused Gromov–Wasserstein), and demonstrates scalable improvements across multiple benchmarks. Its applications span safety, evaluation, alignment, and deployment across domains using LLMs, giving broad cross-field relevance. Paper 1 is innovative neuro-symbolic search over KGs with clear biomedical utility, but its impact may be more domain-specific and dependent on KG coverage/quality.
Paper 1 addresses a critical and timely problem—detecting unfaithful chain-of-thought reasoning in LLMs—with a novel approach combining mechanistic interpretability (circuit tracing) with external reasoning signals. It introduces a principled framework (CIE-Scorer) with rigorous methodology using Fused Gromov-Wasserstein distance and demonstrates SOTA results. This has broad implications for AI safety and trustworthiness. Paper 2 offers an interesting empirical finding about context injection in multi-agent systems, but its scope is narrower (software design tasks) and the contribution is more observational than methodological, limiting its broader impact across fields.
Paper 2 likely has higher impact due to broader applicability and nearer-term real-world deployment: executable “skill programs” that intervene in agent loops can improve performance across web search, math, and coding, and can be used at inference, post-training, and self-improvement. This makes it timely for agentic LLM systems and widely transferable across domains. Paper 1 is novel and rigorous in mechanistic interpretability for CoT faithfulness, but its impact is narrower (focused on unfaithful CoT detection) and depends on circuit-tracing infrastructure and specific benchmarks.
Paper 2 offers higher scientific impact due to its deep methodological innovation in bridging mechanistic interpretability with external behavioral analysis. While Paper 1 presents a useful behavioral evaluation framework, Paper 2 tackles the critical AI safety problem of unfaithful Chain-of-Thought by examining internal computational circuits rather than just behavioral proxies. Its novel use of Fused Gromov-Wasserstein distance to measure internal-external discrepancy provides a highly rigorous, scalable solution to a fundamental bottleneck in LLM alignment and transparency, likely influencing both theoretical interpretability research and practical safety evaluations.
Paper 2 addresses a fundamental and broadly relevant problem in LLM trustworthiness—detecting unfaithful chain-of-thought reasoning—by innovatively combining mechanistic interpretability (circuit tracing) with external reasoning signals. This bridges two active research areas (interpretability and CoT faithfulness) with a principled mathematical framework (Fused Gromov-Wasserstein distance). The problem has wide implications for AI safety and reliability across all LLM applications. Paper 1, while technically solid, addresses a narrower problem (proactive task scheduling) with more incremental contributions combining existing techniques (GRPO, LoRA, existing RL methods).
Paper 1 (BET) addresses a fundamental and timely problem in LRM efficiency with broad practical impact. Its framework for adaptive compute allocation based on solvability rather than difficulty alone is novel and well-validated across seven benchmarks and three models, with strong zero-shot transfer. The ~55% token reduction with performance improvements has immediate real-world deployment implications. Paper 2 (CIE-Scorer) contributes meaningfully to CoT faithfulness detection using mechanistic interpretability, but addresses a narrower problem with less immediate practical applicability and is evaluated on fewer benchmarks from a single benchmark suite.
Paper 2 (Echo) demonstrates higher potential scientific impact due to its practical applicability and validated real-world results in a production environment, showing a significant 10% improvement in code completion acceptance rates. It addresses the fundamental and timely challenge of continuous learning from deployment data, which is broadly applicable across AI agent systems. While Paper 1 makes a solid contribution to CoT faithfulness detection using mechanistic interpretability, it addresses a more niche problem. Paper 2's framework for leveraging user refinement as training signals has broader implications for how AI systems are trained and improved at scale.