Dewi Gould, Francis Rhys Ward, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown
Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the -task-completion time horizon (TH): the human time required for tasks a model completes with success rate. We complement this with a reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with success rate. We find that the no-CoT TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.
This paper introduces a systematic methodology for tracking the latent reasoning capabilities of frontier AI models—their ability to solve tasks *without* externalizing reasoning in chain-of-thought tokens. The central contribution is threefold: (a) a benchmark suite of 43 tasks (30,000+ questions) spanning diverse domains; (b) two complementary metrics—a 50% task-completion time horizon (TH) anchored to human solve times, and a 50% reasoning token horizon anchored to o3-mini's minimum reasoning tokens; and (c) a temporal trend analysis showing no-CoT THs have doubled approximately every 373 days over six years.
The paper directly addresses a critical question for AI safety: if CoT monitoring is a key safety mechanism, how much reasoning can models perform *without* CoT, and how fast is this capability growing? This framing is original and practically important.
Direct impact on AI safety policy: This work provides quantitative evidence that models' ability to reason without observable CoT is growing predictably. This directly informs the viability of CoT monitoring as a safety strategy. The comparison showing no-CoT THs doubling at roughly half the rate of with-CoT THs (373 vs. 182 days) is policy-relevant: it suggests a growing but bounded gap between what models can do overtly versus covertly.
Benchmark contribution: The 43-benchmark suite in the Inspect framework, covering math, coding, puzzles, safety-relevant tasks (steganography, scheming detection, sabotage), and novel synthetic tasks (causal reasoning, hash chains), is a substantial resource. The inclusion of safety-relevant benchmarks is particularly valuable.
Scaling analysis: The open-weight analysis (35 models) revealing that doubling TH requires 4.2× total parameters or 1.3× layers, with MoE models scaling much more slowly than dense models (8.1× vs. 2.2× parameter increase per TH doubling), provides actionable architectural insights.
This paper is exceptionally timely. CoT monitoring is currently one of the primary proposed safety mechanisms for frontier AI systems, and multiple labs are investing in it. Understanding its limitations—specifically, how much reasoning models can perform without observable CoT—is among the most pressing questions in AI safety. The paper's recommendation that frontier labs explicitly track no-CoT capabilities is practical and well-motivated.
Comparison to prior art: This work substantively extends Kwa et al.'s TH methodology by adding uncertainty modeling for human times, introducing the reasoning token anchor, and focusing specifically on the no-CoT regime. The comparison between the two trend lines (Figure 2) is one of the paper's most valuable contributions.
This is a well-executed empirical study addressing a timely and important question. While the wide confidence intervals limit precise forecasting, the directional finding—that no-CoT capabilities are growing exponentially and have reached non-trivial levels—is robust and consequential. The work's greatest contribution is establishing a measurement framework and baseline that can be tracked over time, rather than any single projection.
Generated Jun 8, 2026
While Paper 1 offers a strong algorithmic improvement for handling LLM knowledge conflicts, Paper 2 has broader potential impact due to its critical implications for AI safety and oversight. By quantifying the scaling of no-chain-of-thought reasoning, it exposes a fundamental vulnerability in current monitoring paradigms. Its predictive time horizons and large-scale empirical analysis across numerous benchmarks provide essential insights for future AI alignment research, policy-making, and understanding the trajectory of frontier models.
Hypnos introduces a novel foundation model paradigm (next-token prediction) for multi-modal physiological signals, demonstrating strong results across sleep medicine, cardiology, and neurology with significantly less labeled data. It offers broad healthcare applications, methodological innovation (RQ-Transformer with residual vector quantization across 8 modalities), and cross-domain generalization. Paper 2, while timely for AI safety, is primarily a measurement/benchmarking study with trend extrapolation rather than a methodological contribution, and its impact is narrower, focused on AI safety monitoring policy rather than advancing scientific methodology.
Paper 1 introduces a genuinely novel paradigm—using images as the primary reasoning medium for LLMs—which challenges fundamental assumptions about how reasoning should be represented. It demonstrates practical benefits (reduced token usage, maintained/improved performance) across multiple benchmarks and opens a new research direction with broad implications for multimodal AI efficiency and reasoning. Paper 2, while timely and relevant to AI safety, is primarily a measurement/forecasting study that documents trends rather than introducing a new methodology, limiting its potential to spawn follow-up research and applications.
Paper 1 addresses a critical AI safety concern—models reasoning without chain-of-thought, undermining oversight mechanisms—with a rigorous large-scale evaluation (30,000+ questions, 43 benchmarks) and introduces novel metrics (time horizons, reasoning token horizons). Its findings have broad implications for AI governance, safety policy, and alignment research, with actionable recommendations for frontier developers. Paper 2, while practical, presents an incremental engineering contribution to RAG systems with domain-specific results. Paper 1's timeliness, safety relevance, and cross-disciplinary impact give it substantially higher potential scientific influence.
Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a scalable evaluation framework (no-CoT task-completion time horizons) across 43 benchmarks and 30k+ questions, with implications for AI safety governance, monitoring, and capability forecasting across many model families. Its findings (doubling trends, projections) are actionable for policymakers and developers. Paper 1 is innovative and rigorous systems work with clear practical value for RAG latency, but its impact is narrower (RAG prefill optimization on specific hardware/serving stacks) and more incremental relative to the wide cross-field significance of Paper 2.
Paper 2 introduces a comprehensive, expertly-crafted benchmark that addresses a universal bottleneck in AI: the gap between standard benchmark performance and real-world economic utility. By providing a standardized evaluation framework for long-horizon agentic tasks, it is likely to become a foundational metric driving both research and industry applications. While Paper 1 addresses a critical AI safety concern regarding internalized reasoning, Paper 2's broad, field-defining utility gives it a wider potential scientific and economic impact.
Paper 2 introduces a novel, quantitative metric (no-CoT task-completion time horizon) for measuring a concrete AI safety concern—models reasoning internally without observable chain-of-thought. It provides empirical measurements across 43 benchmarks with 30,000+ questions, offers actionable projections, and directly addresses a critical gap in AI safety monitoring. Paper 1 proposes a conceptual framework mapping sim-to-real gaps to foundation model agents using existing MDP formalism, but is more of a position/agenda paper without substantial new empirical contributions. Paper 2's timeliness, methodological rigor, and direct policy relevance give it higher impact potential.
Paper 2 addresses a critical and urgent issue in AI safety: the ability of frontier models to perform complex reasoning without observable Chain-of-Thought. Its massive empirical scale (30,000 questions across 43 benchmarks) and introduction of standardized capability metrics (Time Horizon and reasoning token horizon) provide foundational tools for future AI capability tracking and policy-making. While Paper 1 introduces a highly novel quantum-inspired optimization approach for LLM reasoning, Paper 2's direct relevance to AI alignment, oversight, and scaling laws gives it broader and more immediate impact across the AI research community.
Paper 1 addresses a fundamental AI safety concern—the ability of frontier models to reason without chain-of-thought, which could undermine oversight mechanisms. It introduces novel metrics (time horizons, reasoning token horizons), provides large-scale empirical measurements across 43 benchmarks, and offers actionable forecasts for the AI safety community. Its implications span AI governance, policy, and alignment research. Paper 2, while practically useful for efficiency optimization, addresses a narrower technical problem (overthinking in LRMs) with incremental methodology. Paper 1's broader safety implications and policy relevance give it higher potential impact.
Paper 2 addresses a critical AI safety question—whether frontier models can reason without chain-of-thought, undermining oversight mechanisms. It introduces novel metrics (time horizons, reasoning token horizons), covers 43 benchmarks with 30,000+ questions, and provides actionable forecasts relevant to AI governance. Its breadth of impact spans AI safety, policy, and capability evaluation, making it highly timely. Paper 1, while technically sound, addresses a narrower optimization topic (continuous local search for SAT) with incremental empirical findings and more limited audience and real-world implications.