Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
Sharan Ramjee
Abstract
Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the "planning" phase of latent reasoning.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models"
1. Core Contribution
This paper addresses a genuinely important emerging problem: as LLMs move from interpretable chain-of-thought reasoning to continuous latent-space reasoning (e.g., Coconut, CODI, Huginn), the ability to monitor reasoning for misalignment disappears. The paper makes two primary contributions:
First, a dual-trigger backdoor paradigm that cleverly decouples the *arming* of misaligned latent reasoning ([T]) from its *behavioral expression* ([O]). This transforms an interpretability problem into a behavioral one—rather than needing to decode continuous representations, ground truth about misalignment is established through behavioral tests. This is an elegant methodological innovation that sidesteps the core difficulty of interpreting continuous representations.
Second, the MoralChain benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths, providing ecologically valid test cases for misalignment studies in continuous thought models.
The key finding is that linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer with high accuracy (89.4% at z₁) to detecting armed-but-benign states ([T] vs baseline), where outputs are identical. This demonstrates that misalignment is linearly encoded in latent space and detectable without interpretation.
2. Methodological Rigor
The experimental design is thoughtful and well-controlled. The four-condition framework (baseline, armed, control, release) is carefully constructed to isolate specific causal factors:
The probe training strategy is particularly well-designed. Training on [T][O] vs [O] (both containing [O]) ensures the probe cannot rely on detecting trigger token embeddings, forcing it to learn the latent state difference induced by [T]. Transfer to [T] vs baseline (neither containing [O]) then validates that the probe captures genuine misalignment signatures.
However, there are notable methodological concerns:
3. Potential Impact
This paper opens an important research direction. As continuous thought models scale (Huginn and successors), the safety community will need monitoring tools that work without interpretable reasoning traces. The specific findings have concrete implications:
The practical impact depends critically on whether these findings generalize beyond GPT-2 and beyond explicitly-trained backdoors. If they do, the early-token monitoring recommendation could become standard practice for deploying continuous thought models.
4. Timeliness & Relevance
This paper is exceptionally timely. Continuous thought models are an active research frontier (Coconut: Dec 2024, CODI: Feb 2025, Huginn: Feb 2025), and none include safety evaluations. The paper fills a clear gap: it is the first safety-focused evaluation of continuous thought models. As these architectures move toward deployment, the safety questions raised here become urgent. The paper was published at ICLR 2026, placing it at the leading edge of this concern.
The connection to the broader CoT faithfulness literature (Turpin et al., 2023; Chen et al., 2025) and sleeper agents work (Hubinger et al., 2024) is well-articulated, positioning this as a natural extension of existing safety research to a new architectural paradigm.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Overall Assessment
This is a well-executed, timely paper that opens an important research direction at the intersection of continuous thought models and AI safety. The dual-trigger methodology is clever, the experimental design is careful, and the findings are actionable. However, the reliance on GPT-2 and artificially-induced backdoors limits the strength of claims about real-world applicability. The paper's greatest contribution may be methodological—providing tools and frameworks that the community can apply as continuous thought models scale—rather than the specific empirical findings, which need validation at larger scales.
Generated May 5, 2026
Comparison History (29)
Paper 2 likely has higher impact due to greater novelty and timeliness: it targets continuous-thought (latent) reasoning models, a rapidly emerging and safety-critical paradigm where interpretability is harder. It contributes a sizeable benchmark (MoralChain), a clear threat model (dual-trigger backdoor separating latent misalignment from outputs), and concrete detection results (geometric separation, transferable linear probes, early-token signals) that generalize to monitoring methods. Applications span AI safety, interpretability, security, and evaluation. Paper 1 is valuable but more incremental (reward shaping for trajectory-length neutrality) and narrower in scope.
Paper 2 likely has higher impact due to greater novelty (safety for continuous-thought/latent-reasoning models), a substantial new benchmark (MoralChain, 12k scenarios), and a concrete, generalizable detection methodology (dual-trigger backdoor setup, geometric separation, transferable linear probes, early-token localization). Its applications span interpretability, mechanistic monitoring, alignment, and secure deployment of emerging latent-reasoning architectures. Paper 1 is valuable for shutdownability via DReST and shows cross-domain results (RL+LLMs), but the impact is narrower and more incremental relative to broader, timely concerns about opaque latent reasoning.
Paper 2 addresses a highly critical and cutting-edge problem in AI safety: interpreting and aligning continuous (latent space) thought models. By introducing a new benchmark, novel training paradigms, and empirical findings on probing latent representations for misalignment, it offers deep methodological rigor and significant implications for safe AI development. Paper 1 provides a useful documentation tool for prompt engineering, but its scientific contribution is more structural than empirical, making Paper 2 substantially more impactful for advancing AI research.
Paper 2 is more novel and timely, addressing a core emerging safety problem: detecting misaligned reasoning in continuous (latent) thought models where interpretability is limited. It contributes a sizable benchmark (MoralChain), a concrete threat model (dual-trigger backdoor), and empirical evidence (geometric separation, probe transfer, early-token signals) that can shape future safety monitoring methods and evaluations. Its applications span alignment, interpretability, and security across model classes. Paper 1 is useful infrastructure/standardization, but likely yields narrower scientific impact and depends on community adoption.
Paper 2 is more timely and immediately actionable, addressing a central AI safety problem (detecting hidden misaligned reasoning) in an emerging model class (continuous thought). It contributes concrete artifacts (MoralChain benchmark, dual-trigger backdoor setup) and empirically supported detection methods (latent geometry evidence, probe transfer, early-token localization) that can be adopted and extended by the community, with clear real-world implications for deployment monitoring. Paper 1 is conceptually ambitious and cross-disciplinary, but its impact depends more on broad acceptance and extensive follow-on validation across domains.
Paper 2 likely has higher impact due to broader applicability and timeliness: long-horizon agent reliability is a central bottleneck across many real-world deployments. HORIZON provides a cross-domain benchmark, substantial trajectory dataset, and a validated, scalable failure-attribution methodology (LLM-as-judge with strong human agreement), enabling standardized comparison and iterative progress. Paper 1 is novel and important for safety in continuous-thought models, but its impact depends on adoption of that specific modeling paradigm and the relevance of its backdoor setup beyond controlled conditions.
Paper 2 is more novel and timely by directly addressing an emerging safety gap: monitoring misalignment in latent-space “continuous thought” reasoning where interpretability is limited. It contributes a new benchmark (MoralChain), a clear experimental paradigm (dual-trigger backdoor), and concrete, transferable detection results (linear probes, early-token localization) with broad relevance to alignment, mechanistic interpretability, and security. Paper 1 is impactful for agent training infrastructure, but resembles ongoing environment-scaling trends and may be more incremental; its applications are strong yet narrower than foundational safety advances.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: scalable environment synthesis plus continuous self-evolving training directly addresses a central bottleneck for real-world LLM agents, with results across 23 benchmarks and clear scaling analyses. Its approach can generalize across domains (tool use, RL, curriculum generation, lifelong learning) and may influence both academic and industrial agent training pipelines. Paper 1 is novel and important for safety of latent-reasoning models, but its impact is narrower (specific to continuous thought safety diagnostics) and depends on adoption of such architectures.
Paper 2 addresses a critical, highly timely bottleneck in AI safety: interpreting and aligning continuous (latent) thought models. As models move toward uninterpretable latent reasoning, guaranteeing alignment becomes an urgent foundational challenge. The introduction of a novel benchmark, dual-trigger training, and linear probing techniques provides significant methodological innovation. While Paper 1 offers a valuable framework for human-AI collaboration, Paper 2's focus on the safety and interpretability of frontier AI systems gives it broader applicability, higher urgency, and a larger potential impact across the machine learning community.
Paper 2 addresses a fundamental methodological challenge—unbiased prevalence estimation under covariate shift—that spans public health, social sciences, and AI. By providing theoretical guarantees and practical solutions using multicalibration, it offers a broader cross-disciplinary impact compared to Paper 1, which focuses on a niche, albeit important, problem within AI safety and continuous thought models.
Paper 1 is more novel and timely by targeting safety/interpretability for continuous-thought (latent reasoning) models, a growing direction with high stakes. It introduces a dedicated benchmark (MoralChain), a dual-trigger backdoor paradigm, and probe-based detection that generalizes to “armed-but-benign” states, offering a concrete methodology for monitoring latent planning phases. The potential real-world impact spans AI safety, security, and governance across many applications of LLMs. Paper 2 is useful and broadly applicable for spatial reasoning, but its impact is narrower and less urgent than latent misalignment detection.
Paper 1 addresses a critical and timely AI safety problem—detecting misaligned reasoning in latent/continuous thought models—which is increasingly relevant as models move beyond interpretable chain-of-thought. The dual-trigger paradigm and MoralChain benchmark offer novel tools for safety research, and the finding that linear probes can transfer to detect hidden misalignment has broad implications for AI alignment. Paper 2 presents a creative contribution on spatial reasoning via ASCII construction training, but its scope is narrower. Paper 1's impact spans AI safety, interpretability, and alignment—fields of growing urgency—giving it higher potential impact.
Paper 2 likely has higher impact due to strong timeliness and cross-field relevance: it targets safety for latent/continuous-thought LLMs, a rapidly emerging paradigm with broad implications for AI alignment, interpretability, and security. It contributes a new benchmark (MoralChain), a novel dual-trigger backdoor setup, and empirically supported detection via transferable probes—methods that can generalize to monitoring and auditing future reasoning architectures. Paper 1 is solid and applied with clear real-world utility, but its scope is more domain-specific and incremental relative to the fast-moving, high-stakes LLM safety landscape.
Paper 1 offers a broadly applicable theoretical result: multicalibration is sufficient for unbiased prevalence estimation under covariate shift, connecting fairness theory to a ubiquitous measurement/quantification problem across sciences. It includes proofs, simulations, and two real-world empirical case studies, suggesting methodological rigor and near-term deployability in public health, social science, and trust/safety. Paper 2 is timely and innovative for latent-reasoning safety, but its impact is narrower (focused on continuous thought models) and relies on a constructed benchmark/backdoor setup whose external validity to real deployed systems is less certain.
Paper 1 presents a groundbreaking mechanistic investigation into emotion representations in a frontier LLM (Claude 4.5), demonstrating causal links between internal emotion concepts and alignment-critical behaviors like reward hacking and sycophancy. This has immediate, broad implications for AI safety, interpretability, and understanding LLM behavior at scale. Paper 2 addresses an important but more niche safety concern for continuous thought models—a paradigm still in early adoption. While rigorous, its impact is narrower and more anticipatory. Paper 1's findings are actionable now for deployed systems and span interpretability, alignment, and cognitive science.
Paper 2 addresses a critical, emerging problem in AI safety—interpreting and aligning continuous thought models that reason in latent space. As AI models move towards non-textual latent reasoning for efficiency, detecting hidden misaligned reasoning becomes paramount. Its novel dual-trigger backdoor paradigm and latent space probing techniques offer highly timely and impactful contributions to AI alignment and interpretability, likely influencing future next-generation model architectures more profoundly than the social bias study in Paper 1.
Paper 2 has higher potential impact due to its timeliness and broad relevance to AI safety and interpretability in emerging continuous-thought (latent-reasoning) LLMs. It introduces a new benchmark (MoralChain) and a concrete threat model (dual-trigger backdoor) with empirically supported detection signals (geometric separation, probe transfer, early-token encoding), offering actionable monitoring implications that could influence safety practices across many LLM applications. Paper 1 is methodologically strong and valuable for offline MARL efficiency/coordination, but its domain is narrower and less cross-cutting than latent-reasoning safety.
Paper 2 offers concrete empirical contributions, a new benchmark, and addresses a highly timely and critical problem in AI safety regarding continuous thought models. Its actionable insights into latent space geometry and linear probing provide an immediate foundation for future technical research. Paper 1, while conceptually interesting, is a position paper lacking empirical validation, making its direct scientific and methodological impact less immediate and measurable.
Paper 1 has higher potential impact due to its novelty and timeliness in addressing safety/monitoring for continuous-thought (latent reasoning) LLMs, a rapidly emerging paradigm. It introduces a sizable benchmark (MoralChain), a dual-trigger backdoor setup, and concrete, transferable detection results via probes with insights about early “planning” states—directly relevant to AI safety and deployment. Paper 2 is methodologically solid and useful for explainability in BSNNs, but is narrower (MNIST-scale, specialized model class) and likely to affect fewer domains than latent-reasoning alignment monitoring.
Paper 1 addresses a critical and highly timely AI safety issue: detecting hidden misalignment in emerging continuous (latent) thought models. By providing a novel benchmark and demonstrating that misaligned latent reasoning can be probed geometrically before harmful outputs occur, it tackles a fundamental bottleneck in AI alignment. Paper 2, while offering a solid co-evolutionary framework for LLM agents, represents a more incremental advance in a crowded subfield (game-playing agents), making Paper 1's foundational safety contributions more likely to achieve broad scientific impact.