Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin
Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.
SWITCH introduces a deceptively simple but effective idea: bracket latent reasoning blocks with explicit discrete boundary tokens (`<swi>` / `</swi>`), enabling two previously elusive capabilities for hidden-state-recurrence latent chain-of-thought: (1) compatibility with standard on-policy RL (GRPO), and (2) mechanistic interpretability via the same anchors. The key insight is that because latent positions emit no tokens, they have no policy density, making GRPO undefined inside latent blocks. By making the *entry and exit* decisions discrete tokens while leaving the interior continuous, the policy ratio becomes well-defined at all decision points. The paper packages this into a three-phase training pipeline (SFT for boundary placement, curriculum for latent replacement, Switch-GRPO for RL optimization) and delivers a thorough mechanistic analysis.
Training pipeline. The three-phase approach is well-motivated. Phase 1 uses entropy-based annotation (from SwiReasoning) to identify where latent blocks should go. Phase 2's parallel curriculum schedule is an interesting design choice, with a clear explanation of why the sequential alternative fails (the model can satisfy the loss without computing in latent space). Phase 3's Switch-GRPO is technically sound — the factorization of rollout likelihood over text positions only, with latent positions contributing via frozen KV cache, is a clean solution.
Experimental design. The paper re-implements all baselines on the same Qwen3-8B base model with matched data and decoding settings, which is commendable. The +25.7 point improvement over CoLaR (the strongest Coconut-style baseline) on MATH-500 is substantial. However, there's an important nuance: SWITCH uses 1721 average visible tokens while Coconut-style baselines use ~10 tokens. This is not a fair token-efficiency comparison — SWITCH retains substantial visible CoT outside the latent blocks, making it more of a hybrid system than a pure latent reasoning approach.
Mechanistic analysis. The three-pronged analysis (teacher-forced statistics, linear probing, causal intervention) is the paper's strongest methodological contribution. The causal intervention showing 66.7-point accuracy collapse when zeroing latent states on the diagnostic subset, versus only 9.5 points for same-norm random vectors, convincingly establishes that the latent computation is specific and task-relevant. The finding that computation concentrates at the first hidden-state transition (with exit probability ≈1.0 at all subsequent steps) is mechanistically revealing, though it also raises questions about whether multi-step latent blocks are genuinely useful.
Immediate impact. The paper resolves a concrete technical barrier: prior work had argued hidden-state recurrence was incompatible with on-policy RL, pushing the field toward vocabulary mixtures. SWITCH provides a constructive counterexample, potentially reopening the hidden-state recurrence direction for RL-based optimization.
Interpretability contribution. The boundary tokens as interpretability anchors is a useful paradigm. The finding that RL sharpens the switching policy (halving switch rate while boosting latent-conditional accuracy by 12.6 points) provides rare insight into *what* RL changes mechanistically.
Limitations on breadth of impact. The work is restricted to 8B models on math benchmarks (MATH-500, GSM8K). The generalization to other reasoning domains, larger scales, or different base architectures is untested. The mechanistic finding that useful computation concentrates at one transition step suggests the current latent blocks may be underutilized, limiting the compression advantage.
This paper addresses a timely gap. The latent CoT space is rapidly evolving (Coconut, CODI, Latent-GRPO, SofT-GRPO all from 2025-2026), and the question of whether hidden-state recurrence can be RL-trained has been explicitly raised by competitors. The interpretability angle also arrives at the right moment — as latent reasoning methods proliferate, understanding *what* happens inside latent blocks becomes increasingly important.
The paper's framing positions SWITCH as advancing hidden-state recurrence latent reasoning, but in practice it's closer to a hybrid explicit/latent system where most reasoning happens in visible text. The mechanical insight that latent computation collapses to one useful transition is important but somewhat undermines the paper's own premise about "recurrent" computation. The reward hacking phenomenon (Fig. 9) is a valuable contribution to the RL-for-reasoning literature.
Generated Jun 12, 2026
Paper 1 introduces a highly novel, tangible framework (SWITCH) that solves critical optimization and interpretability bottlenecks in latent chain-of-thought reasoning using on-policy RL. While Paper 2 provides valuable empirical insights into RL post-training, Paper 1 proposes a fundamental methodological innovation (Switch-GRPO) that enables new capabilities in hidden-state recurrence, which is currently at the frontier of LLM reasoning research. Its combination of a new architectural approach with mechanistic interpretability gives it higher potential for widespread adoption and subsequent follow-up research.
Paper 1 introduces a fundamentally new theoretical framework (simulatable processes) that broadens the foundational PAC learning model, addressing a core open question about learning under dependent data. It provides novel connections between VC dimension, Kolmogorov complexity, and computational complexity, with broad implications across learning theory. Paper 2 presents a solid engineering contribution to latent reasoning in LLMs, but is more incremental—improving on existing hidden-state recurrence methods with boundary tokens. Paper 1's conceptual breadth and theoretical depth give it higher long-term scientific impact across multiple fields.
Paper 1 addresses a fundamental capability of LLMs (latent reasoning) by bridging on-policy reinforcement learning with mechanistic interpretability. This explores new paradigms in how AI models learn to reason internally, offering high theoretical novelty and broad implications for future reasoning architectures. Paper 2, while offering a highly practical and elegant systems-level optimization for KV caching, represents an engineering contribution with narrower fundamental scientific impact compared to the algorithmic and interpretability advancements of Paper 1.
Paper 2 introduces a novel framework (SWITCH) that addresses fundamental challenges in latent reasoning for LLMs—making hidden-state recurrence compatible with standard on-policy RL and interpretable through mechanistic analysis. This has broader impact across AI/ML, touching reasoning, interpretability, and RL training paradigms. Paper 1, while valuable as a data resource for chemical perturbation transcriptomics, is more incremental (harmonizing existing datasets) and serves a narrower community. Paper 2's methodological innovation and mechanistic insights into latent computation have wider applicability and timeliness given the current focus on reasoning in LLMs.
Paper 2 addresses the highly active area of latent reasoning in LLMs, combining practical RL training improvements with mechanistic interpretability. The SWITCH framework offers immediate applicability to the rapidly growing field of reasoning models, with clear engineering contributions (boundary tokens enabling standard on-policy RL) and scientific insights (mechanistic analysis of hidden-state recurrence). Paper 1, while theoretically rigorous in providing certified prediction horizons for equivariant world models, addresses a narrower niche. Paper 2's broader relevance to the LLM reasoning community, timeliness, and dual practical/interpretability contributions give it higher estimated impact.
Paper 1 has higher impact potential due to a more novel and timely contribution: making hidden-state recurrence latent reasoning compatible with standard on-policy RL via explicit boundary tokens, while also enabling mechanistic causal analysis. This targets a central, fast-moving problem in LLM reasoning efficiency and interpretability, with broad applicability across RLHF-style training, agentic reasoning, and safety/interpretability research. The methodological framing (well-defined policy ratios, curriculum, Switch-GRPO, causal probes) suggests strong rigor and generalizability. Paper 2 is useful but more incremental within established unsupervised GNN clustering/self-training paradigms and has narrower cross-field reach.
Paper 2 addresses fundamental challenges in AI by improving the trainability and interpretability of latent chain-of-thought reasoning in large language models. Given the massive scale and rapid advancement of LLM research, methodologies that enhance reasoning efficiency and provide mechanistic insights have broader, more transformative implications across the field compared to the domain-specific ecological application presented in Paper 1.
Paper 2 introduces SWITCH, a novel framework that solves two fundamental problems in latent reasoning—RL trainability and interpretability—with a single elegant mechanism (boundary tokens). It bridges latent chain-of-thought reasoning with standard on-policy RL, opening a new research direction with broad implications for efficient reasoning in LLMs. The mechanistic analysis adds scientific depth. Paper 1, while offering useful empirical insights about module-specific manifold constraints, is more incremental—it characterizes geometry preferences for a specific optimizer (Manifold Muon) on GPT-2, with narrower scope and applicability.
Paper 2 proposes a general unifying theory (SIM) for interpretable machine learning grounded in Lagrangian mechanics, which addresses a fundamental gap across the entire interpretability field. Its breadth of impact spans traditional, concept-based, and mechanistic interpretability, offering both theoretical foundations and practical design principles. While Paper 1 makes solid contributions to latent reasoning with RL and mechanistic analysis, it addresses a narrower problem. Paper 2's potential to unify a fragmented discipline, inform curricula, and reshape how interpretability methods are designed gives it broader and longer-lasting scientific impact.
Paper 1 is more novel and broadly impactful: it introduces a simple but powerful interface (boundary tokens) that simultaneously resolves an optimization barrier (on-policy RL ratios for latent recurrence) and enables mechanistic/causal analysis of latent reasoning. This bridges RLHF-style training, interpretability, and reasoning efficiency—areas with wide cross-field relevance and strong timeliness. Paper 2 is methodologically solid and useful for diffusion inference, but the reported gains are modest and the contribution is more incremental/engineering-focused, with narrower impact compared to a general framework for RL-trainable latent reasoning.