What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Xiang Wang, Wei Wei

May 26, 2026

arXiv:2605.26795v1 PDF

cs.AI(primary)

#214of 2682·Artificial Intelligence

#214 of 2682 · Artificial Intelligence

Tournament Score

1523±46

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor7.5

Novelty7

Clarity8

Tournament Score

1523±46

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$ -- $3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates a specific and underexplored question: when a chain-of-thought (CoT) rationale is placed into the context of a language model, what textual properties of that rationale actually drive the downstream accuracy improvement? The authors decouple CoT into generation-time and probe-time effects, focusing exclusively on the latter. Their central finding is that probe-time CoT gains arise from two mechanisms: (1) lexical activation — merely having domain-relevant tokens present (even globally shuffled) substantially outperforms the no-rationale baseline, and (2) local co-occurrence activation — preserving contiguous windows of only n*=2–3 tokens recovers most of the remaining gap between a word bag and full CoT. Sentence-level logical ordering contributes minimally.

The paper introduces a well-defined gap-recovery metric GR(n) and a systematic n-gram perturbation protocol that cleanly quantifies how much structure is needed. This is a conceptually simple but insightful experimental paradigm.

Methodological Rigor

The experimental design is carefully constructed. The generator-probe separation is critical: by freezing the rationale and only studying how the probe model reads it, the authors cleanly isolate probe-time effects from generation-time reasoning dynamics. The automatic stripping of answer-declaration sentences before perturbation prevents trivial answer leakage.

The control experiments are thorough and well-chosen:

Answer-stripping and value removal rule out simple copying mechanisms

Concept compression tests whether full grammatical structure is necessary

Wikipedia passage injection controls for generic topical overlap

Question-stem perturbation confirms that the low-n recovery is specific to CoT-style text

Tail-sweep experiments demonstrate that useful information is distributed throughout the rationale

Statistical rigor is adequate: results are pooled across three random seeds (1500 examples per condition), and McNemar tests are applied at α=0.05. The use of four generator-probe configurations spanning open/closed-source models and different scales strengthens the claims.

However, there are methodological concerns. The n-gram perturbation protocol uses non-overlapping blocks, meaning the boundaries between blocks are somewhat arbitrary. The paper does not explore whether overlapping windows or different block boundary strategies would change the picture. Additionally, the "concept compression" intervention uses an LLM to generate compressed phrases, which introduces a potential confound — the LLM may preserve information in ways that are not fully controlled.

Potential Impact

This work has implications for several areas:

1. Understanding CoT mechanisms: The finding that probe-time CoT gains are predominantly local challenges the narrative that CoT works by enabling step-by-step logical reasoning. This reframes our understanding of what the model is actually doing when it "reads" a rationale.

2. Prompt engineering and compression: If 2-3 token windows suffice, this has practical implications for rationale compression, potentially enabling much shorter prompts without sacrificing performance.

3. Faithfulness research: The results add nuance to the faithfulness debate — if the model doesn't need the logical structure, then the question of whether CoT is "faithful" to internal reasoning becomes even more complex.

4. Theoretical work on transformers: The finding is compatible with theories about transformers as local pattern matchers and may inform theoretical models of in-context learning.

However, the impact is somewhat bounded by the scope: probe-time CoT is only half the story. The paper explicitly does not address generation-time effects, where the model produces reasoning tokens that may genuinely influence subsequent generation through serial computation. The distinction between probe-time (reading a fixed rationale) and generation-time (producing reasoning tokens autoregressively) is crucial — the paper's findings do not necessarily imply that CoT reasoning during generation is also driven by local co-occurrence.

Timeliness & Relevance

This paper is highly timely. As reasoning models (o1, DeepSeek-R1, QwQ) become central to AI capabilities, understanding *why* CoT works is increasingly important. The community has been grappling with questions about CoT faithfulness, and this paper provides a concrete mechanistic insight. The finding that local co-occurrence rather than global derivation drives probe-time gains is relevant to ongoing debates about whether LLMs truly "reason" or exploit statistical patterns.

Strengths

1. Clean experimental framework: The generator-probe separation and systematic n-gram protocol are elegant and reproducible.

2. Comprehensive controls: The paper anticipates and addresses multiple alternative explanations with targeted experiments.

3. Generalization: Results hold across 4 model configurations, 3 datasets, multiple scales (0.8B–35B), and both multiple-choice and open-ended generation settings.

4. Clear quantitative metric: GR(n) provides a principled way to measure the contribution of different structural levels.

5. Provocative but well-supported claim: The finding that n*=2–3 suffices is surprising and memorable.

Limitations

1. Probe-time only: The paper's scope is explicitly limited to reading fixed rationales, not generating them. The title's framing as "what makes CoT work" may overstate the contribution, since generation-time reasoning is arguably the more important and interesting case.

2. Task coverage: Three multiple-choice benchmarks plus two math benchmarks — while diverse, tasks requiring genuinely long chains of multi-step reasoning (e.g., complex multi-hop reasoning, program synthesis) are not tested. The LogiQA results already hint that logic tasks may require larger windows.

3. Mechanistic explanation gap: The paper identifies *what* matters (local co-occurrence) but offers limited insight into *why* — what computational mechanisms in transformers produce this behavior? The LCA account is descriptive rather than explanatory.

4. Token-level vs. semantic-level: The n-gram protocol operates at the token level, but the relevant unit might be semantic (e.g., "P(A∩B)" is multiple tokens but one semantic unit). The effective window in semantic terms might be even smaller or qualitatively different.

5. Potential confound in shuffling: Block shuffling preserves within-block order but also preserves the total token count and frequency distribution. It's unclear whether position effects (e.g., recency bias) interact with the perturbation in ways not fully controlled.

Overall Assessment

This is a well-executed empirical study that makes a clear, provocative, and well-supported claim about a specific aspect of CoT prompting. The probe-time perspective is genuinely underexplored, and the finding that local co-occurrence dominates is both surprising and informative. The experimental design is thorough, with good controls and generalization. The main limitation is scope — the findings apply to reading fixed rationales, not to the full CoT phenomenon including generation — and the lack of a deeper mechanistic explanation. Nevertheless, this paper makes a meaningful contribution to understanding how language models process reasoning text and should stimulate further investigation.

Rating:6.8/ 10

Significance 7Rigor 7.5Novelty 7Clarity 8

Generated May 27, 2026

Comparison History (24)

vs. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

claude-opus-4.65/28/2026

Paper 2 offers a more fundamental mechanistic insight into why chain-of-thought works, revealing that local co-occurrence rather than logical derivation drives much of the gain. This challenges core assumptions about CoT reasoning and has broad implications across all CoT applications and model families. Paper 1, while important for medical AI safety, addresses a narrower domain-specific concern about distillation quality. Paper 2's findings could reshape how the field thinks about prompting, reasoning evaluation, and model interpretability, giving it broader and more transformative potential impact.

vs. Credit Assignment with Resets in Language Model Reasoning

claude-opus-4.65/27/2026

Paper 2 has higher potential scientific impact because it challenges a fundamental assumption about *why* chain-of-thought reasoning works in LLMs, revealing that local co-occurrence patterns rather than logical derivation drive much of the gain. This mechanistic insight has broad implications across the entire field of LLM reasoning, prompting, and interpretability. While Paper 1 offers a solid engineering contribution (better credit assignment for RL fine-tuning), Paper 2's finding is more surprising, more broadly applicable, and likely to reshape how researchers think about reasoning in language models, spurring significant follow-up work.

vs. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

claude-opus-4.65/27/2026

Paper 2 proposes a novel training paradigm (GCPO) that directly addresses a well-known problem (exploration collapse in RLVR) with a principled cooperative optimization framework. It offers actionable improvements to LLM reasoning training with demonstrated gains in both accuracy and diversity. Paper 1 provides interesting mechanistic insights into why CoT works (local co-occurrence rather than logical derivation), which is analytically valuable but primarily diagnostic rather than constructive. Paper 2's practical method for improving LLM reasoning has broader applicability, stronger immediate real-world impact, and addresses a timely bottleneck in the rapidly growing RLVR field.

vs. PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

gpt-5.25/27/2026

Paper 2 is more novel and broadly impactful: it provides a mechanistic, probe-time explanation for CoT benefits (local co-occurrence/lexical activation) with controlled ablations and cross-model/dataset generalization, directly informing LLM evaluation, prompting, interpretability, and safety. Its timeliness is high given widespread CoT use. Paper 1 has strong applied potential for polymer discovery, but impact may be narrower (materials domain) and depends on data/benchmark/validation quality and real-world experimental follow-through.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

claude-opus-4.65/27/2026

Paper 1 offers a more fundamental and surprising scientific insight about why Chain-of-Thought prompting works, challenging prevailing assumptions that logical reasoning structure drives CoT gains. Its finding that local token co-occurrence rather than global derivation explains most benefits is counterintuitive and has broad implications for understanding LLM reasoning mechanisms, prompt engineering, and interpretability research. Paper 2, while useful, provides a more incremental engineering contribution—a diagnostic benchmark for memory systems—with narrower scope. Paper 1's cross-model generalization strengthens its impact potential across the field.

vs. 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

gpt-5.25/27/2026

Paper 2 likely has higher impact: it addresses a timely, central question in modern ML (why CoT works) with broadly relevant, model- and dataset-spanning evidence, and its LCA account can influence prompting, evaluation, interpretability, and training across many NLP/AI subfields. Its findings may change how practitioners use rationales and how researchers theorize reasoning in LLMs. Paper 1 is rigorous and valuable for ASP theory/implementation, but its impact is narrower to logic programming and complexity/solver communities, with fewer cross-field applications.

vs. Internalizing Safety Understanding in Large Reasoning Models via Verification

gpt-5.25/27/2026

Paper 2 likely has higher impact: it proposes a concrete training framework (SInternal) to internalize safety via verification, with clear real-world applicability to reducing jailbreak vulnerability and improving alignment pipelines. The approach is timely and broadly relevant across AI safety, RLHF/RLAIF, and LLM deployment, and claims improved out-of-domain robustness plus better RL initialization—potentially influential for practice and future research. Paper 1 offers valuable mechanistic insight into CoT probe-time effects, but its applications are more indirect and narrower in immediate deployment impact.

vs. Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

claude-opus-4.65/27/2026

Paper 1 offers a fundamentally novel mechanistic insight into why chain-of-thought prompting works, challenging the widely held assumption that logical derivation drives CoT gains. This finding has broad implications for LLM interpretability, prompt engineering, and understanding of transformer reasoning. Its rigorous experimental design across multiple models, scales, and datasets strengthens its contribution. Paper 2, while solid and practical, proposes an incremental improvement (adaptive negative sampling) to an existing paradigm in KG completion—a narrower subfield with less transformative potential. Paper 1's findings could reshape how researchers think about LLM reasoning mechanisms.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gpt-5.25/27/2026

Paper 1 has higher estimated impact due to stronger real-world relevance and urgency: it identifies a structural vulnerability in the dominant RLHF alignment pipeline, with demonstrated amplification of harmful behaviors and difficult mitigation. This is novel and actionable for AI safety, governance, and deployment practices, likely influencing both research directions and industry training protocols. Paper 2 offers a valuable mechanistic insight into probe-time CoT effects, but its implications are more interpretive and narrower in immediate application than a security-like failure mode in alignment methods.

vs. Position: AI Safety Requires Effective Controllability

claude-opus-4.65/27/2026

Paper 2 provides a novel mechanistic insight into why chain-of-thought prompting works, revealing that local token co-occurrence rather than logical reasoning drives much of the gain. This fundamentally challenges prevailing assumptions about CoT and has broad implications for understanding LLM reasoning, prompt engineering, and interpretability research. Paper 1 addresses an important AI safety topic but is more of a position/framework paper proposing architectural principles and a benchmark, which, while valuable, offers less surprising empirical insight. Paper 2's findings are more likely to redirect significant research efforts across the NLP community.

vs. Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

claude-opus-4.65/27/2026

Paper 1 offers a more novel and fundamental insight into how chain-of-thought prompting works, revealing that local token co-occurrence rather than logical derivation drives much of the gain. This mechanistic understanding challenges prevailing assumptions about CoT and has broad implications for prompt engineering, interpretability research, and model design across the field. Paper 2, while useful, is a relatively incremental empirical comparison of known methods (CoT, PAL, SBSC) on a specific benchmark with non-significant results, offering limited new theoretical insight or methodological innovation.

vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

gemini-3.15/27/2026

Paper 2 investigates the fundamental mechanisms behind Chain-of-Thought prompting, offering a novel and potentially paradigm-shifting insight that local co-occurrence, rather than logical derivation, drives performance gains. This fundamental discovery has broad implications for LLM architecture and prompt engineering across many fields. In contrast, Paper 1 presents a practical engineering contribution (a Python library for entity linking), which is highly useful but narrower in scope and less likely to drive foundational scientific shifts compared to Paper 2's theoretical contributions.

vs. A governance horizon for ethical-use constraints in open-weight AI models

claude-opus-4.65/27/2026

Paper 2 addresses a fundamental question about why chain-of-thought prompting works in LLMs, revealing that local co-occurrence rather than logical reasoning drives much of the gain. This challenges widely-held assumptions about CoT and has broad implications for interpretability, prompt engineering, and understanding of LLM reasoning across the entire AI/NLP community. Paper 1 provides valuable empirical governance analysis of Hugging Face model licensing, but addresses a narrower policy/infrastructure audience. Paper 2's mechanistic insight into a ubiquitously used technique will likely influence more research directions and has higher citation potential.

vs. Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

gemini-3.15/27/2026

Paper 2 fundamentally challenges the prevailing assumption that Chain-of-Thought efficacy stems from logical derivation, demonstrating it relies primarily on short-range token co-occurrence. This insight has broad, paradigm-shifting implications for understanding LLM reasoning mechanisms across the entire field. Paper 1 offers valuable insights into AI safety and refusal mechanisms via activation steering, but its scope is narrower and more specialized compared to the foundational nature of Paper 2's findings.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

gpt-5.25/27/2026

Paper 2 has higher potential scientific impact because it offers a novel, general mechanistic explanation for why chain-of-thought improves performance at probe time, with broad relevance to prompting, interpretability, evaluation, and model design across NLP/ML. Its methodology (controlled rationale perturbations, window preservation, multi-model/dataset generalization, and ruling out confounds) suggests strong rigor and portability of conclusions. Paper 1 is timely and practically useful for on-device GUI agents, but its impact is narrower (mobile agent deployment/latency engineering) and more incremental relative to existing systems optimization work.

vs. Fundamental Limitation in Explaining AI

gpt-5.25/27/2026

Paper 1 offers a concrete, testable mechanistic account of CoT’s probe-time benefits (local co-occurrence/lexical activation) supported by controlled ablations and robustness across models/datasets, making it both methodologically rigorous and immediately actionable for prompting, evaluation, and interpretability research. Its findings can reshape how CoT is understood and used in practice. Paper 2 is timely and potentially broad for governance, but its impact hinges on the generality/assumptions of the formalization; such impossibility-style results often translate less directly into empirical practice. Overall, Paper 1 is likelier to drive near-term scientific and applied follow-up work.

vs. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

gemini-3.15/27/2026

Paper 2 investigates fundamental mechanisms of Chain-of-Thought prompting, a ubiquitous technique in AI. Its counterintuitive findings challenge existing assumptions about LLM reasoning and have broad implications across all domains utilizing LLMs. In contrast, Paper 1 presents a highly specialized application for the steel industry. While valuable for environmental informatics, its scope of scientific impact is much narrower than the generalized insights into language model behavior provided by Paper 2.

vs. Maat: The Agentic Legal Research Assistant for Competition Protection

gpt-5.25/27/2026

Paper 1 offers a broadly relevant, mechanistic insight into why chain-of-thought helps at inference: most gains can come from lexical activation and very local token co-occurrence rather than global logical structure. This is novel, challenges common assumptions, and is likely to influence future prompting, interpretability, and evaluation across many LLM applications and model families. Its controlled probe-time methodology and cross-model/dataset validation support rigor and generality. Paper 2 is impactful for a specific domain (competition law) and is more application-engineering focused, with narrower cross-field scientific reach.

vs. AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

gemini-3.15/27/2026

Paper 1 fundamentally challenges current assumptions about how Chain-of-Thought prompting works, showing that performance gains stem from local token co-occurrence rather than logical derivation. This provides deep theoretical insights into LLM 'reasoning' mechanisms, potentially reshaping future research in model interpretability and prompting. Paper 2 addresses a specific engineering challenge in LLM agents (prompt compression); while highly practical, its impact is narrower compared to the foundational revelations in Paper 1.

vs. Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

gpt-5.25/27/2026

Paper 1 offers a more novel and broadly relevant scientific insight: it disentangles why chain-of-thought helps at probe time and provides evidence that local token co-occurrence drives much of the benefit, challenging common assumptions about logical derivation. This has wide impact across prompting, interpretability, evaluation, and dataset design, and is timely given heavy reliance on CoT in LLM research. Paper 2 is practically valuable for distillation/specialization pipelines, but is more incremental within an established engineering line and narrower in cross-field influence.