Back to Rankings

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Dewi Gould, Francis Rhys Ward, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown

cs.AI
Share
#259 of 3489 · Artificial Intelligence
Tournament Score
1515±44
10501800
80%
Win Rate
20
Wins
5
Losses
25
Matches
Rating
7.5/ 10
Significance8.5
Rigor7
Novelty7
Clarity8

Abstract

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the 50%50\%-task-completion time horizon (TH): the human time required for tasks a model completes with 50%50\% success rate. We complement this with a 50%50\% reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with 50%50\% success rate. We find that the no-CoT 50%50\% TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models"

1. Core Contribution

This paper introduces a systematic methodology for tracking the latent reasoning capabilities of frontier AI models—their ability to solve tasks *without* externalizing reasoning in chain-of-thought tokens. The central contribution is threefold: (a) a benchmark suite of 43 tasks (30,000+ questions) spanning diverse domains; (b) two complementary metrics—a 50% task-completion time horizon (TH) anchored to human solve times, and a 50% reasoning token horizon anchored to o3-mini's minimum reasoning tokens; and (c) a temporal trend analysis showing no-CoT THs have doubled approximately every 373 days over six years.

The paper directly addresses a critical question for AI safety: if CoT monitoring is a key safety mechanism, how much reasoning can models perform *without* CoT, and how fast is this capability growing? This framing is original and practically important.

2. Methodological Rigor

Strengths in methodology:

  • The hierarchical bootstrap procedure propagating uncertainty from four sources (benchmarks, problems, model runs, and human time estimates) is well-designed and represents a genuine methodological advance over Kwa et al.'s approach.
  • The calibration of LLM-estimated human solve times using in-context learning with real timing data (achieving MALR = 0.55, ~1.7× average error) is carefully validated across 20 benchmarks with real data.
  • The dual-anchor approach (human time + reasoning tokens) provides independent validation of the trend, with both anchors showing consistent exponential growth.
  • Extensive robustness checks: leave-one-out model sensitivity, restricting to real human times only, alternative noise models, different benchmark subsets, and comparison of exponential vs. linear vs. hyperbolic fits.
  • Methodological concerns:

  • The 95% CI on the TH doubling time is very wide [167, 691 days], spanning a 4× range. This substantially limits the predictive power of forward projections.
  • The exclusion of GPT-2 and GPT-3 from trend fits (due to near-chance performance) is justified but means the trend relies on only ~10 models spanning roughly 4 years of the claimed 6-year trend.
  • The paper acknowledges but does not fully resolve the conflation between latent reasoning, memorization, and heuristic pattern-matching. The benchmarks include some synthetic tasks designed to isolate reasoning, but many existing benchmarks may reward retrieval rather than computation.
  • Elicitation of no-CoT behavior is inherently difficult, and the authors acknowledge this may systematically underestimate latent reasoning capabilities.
  • 3. Potential Impact

    Direct impact on AI safety policy: This work provides quantitative evidence that models' ability to reason without observable CoT is growing predictably. This directly informs the viability of CoT monitoring as a safety strategy. The comparison showing no-CoT THs doubling at roughly half the rate of with-CoT THs (373 vs. 182 days) is policy-relevant: it suggests a growing but bounded gap between what models can do overtly versus covertly.

    Benchmark contribution: The 43-benchmark suite in the Inspect framework, covering math, coding, puzzles, safety-relevant tasks (steganography, scheming detection, sabotage), and novel synthetic tasks (causal reasoning, hash chains), is a substantial resource. The inclusion of safety-relevant benchmarks is particularly valuable.

    Scaling analysis: The open-weight analysis (35 models) revealing that doubling TH requires 4.2× total parameters or 1.3× layers, with MoE models scaling much more slowly than dense models (8.1× vs. 2.2× parameter increase per TH doubling), provides actionable architectural insights.

    4. Timeliness & Relevance

    This paper is exceptionally timely. CoT monitoring is currently one of the primary proposed safety mechanisms for frontier AI systems, and multiple labs are investing in it. Understanding its limitations—specifically, how much reasoning models can perform without observable CoT—is among the most pressing questions in AI safety. The paper's recommendation that frontier labs explicitly track no-CoT capabilities is practical and well-motivated.

    5. Strengths & Limitations

    Key strengths:

  • Addresses a genuinely important and under-measured quantity for AI safety
  • Large-scale evaluation across 14 frontier models and 35 open-weight models
  • Careful uncertainty quantification with multiple robustness checks
  • Novel reasoning token horizon metric provides model-grounded complement to human time anchors
  • Practical recommendations with clear policy implications
  • The filler token and question repeat ablations provide additional mechanistic insight
  • Notable limitations:

  • Wide confidence intervals significantly limit the utility of forward projections (the 2030 projection spans 9 to 615 minutes)
  • The benchmark suite, while diverse, may not adequately capture the types of reasoning most relevant to safety concerns (e.g., long-horizon planning for deceptive behavior)
  • Human time estimation relies heavily on LLM-generated estimates for 11 benchmarks, introducing systematic uncertainty that is difficult to fully characterize
  • The trend analysis necessarily assumes continued exponential growth, which may not hold as architectural paradigms shift
  • Some of the strongest recent models' performance on safety-relevant tasks (SHADE monitoring, steganography) suggests these tasks may be more about pattern recognition than deep reasoning
  • Comparison to prior art: This work substantively extends Kwa et al.'s TH methodology by adding uncertainty modeling for human times, introducing the reasoning token anchor, and focusing specifically on the no-CoT regime. The comparison between the two trend lines (Figure 2) is one of the paper's most valuable contributions.

    Overall Assessment

    This is a well-executed empirical study addressing a timely and important question. While the wide confidence intervals limit precise forecasting, the directional finding—that no-CoT capabilities are growing exponentially and have reached non-trivial levels—is robust and consequential. The work's greatest contribution is establishing a measurement framework and baseline that can be tracked over time, rather than any single projection.

    Rating:7.5/ 10
    Significance 8.5Rigor 7Novelty 7Clarity 8

    Generated Jun 8, 2026

    Comparison History (25)

    Wonvs. From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

    While Paper 1 offers a strong algorithmic improvement for handling LLM knowledge conflicts, Paper 2 has broader potential impact due to its critical implications for AI safety and oversight. By quantifying the scaling of no-chain-of-thought reasoning, it exposes a fundamental vulnerability in current monitoring paradigms. Its predictive time horizons and large-scale empirical analysis across numerous benchmarks provide essential insights for future AI alignment research, policy-making, and understanding the trajectory of frontier models.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

    Hypnos introduces a novel foundation model paradigm (next-token prediction) for multi-modal physiological signals, demonstrating strong results across sleep medicine, cardiology, and neurology with significantly less labeled data. It offers broad healthcare applications, methodological innovation (RQ-Transformer with residual vector quantization across 8 modalities), and cross-domain generalization. Paper 2, while timely for AI safety, is primarily a measurement/benchmarking study with trend extrapolation rather than a methodological contribution, and its impact is narrower, focused on AI safety monitoring policy rather than advancing scientific methodology.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

    Paper 1 introduces a genuinely novel paradigm—using images as the primary reasoning medium for LLMs—which challenges fundamental assumptions about how reasoning should be represented. It demonstrates practical benefits (reduced token usage, maintained/improved performance) across multiple benchmarks and opens a new research direction with broad implications for multimodal AI efficiency and reasoning. Paper 2, while timely and relevant to AI safety, is primarily a measurement/forecasting study that documents trends rather than introducing a new methodology, limiting its potential to spawn follow-up research and applications.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

    Paper 1 addresses a critical AI safety concern—models reasoning without chain-of-thought, undermining oversight mechanisms—with a rigorous large-scale evaluation (30,000+ questions, 43 benchmarks) and introduces novel metrics (time horizons, reasoning token horizons). Its findings have broad implications for AI governance, safety policy, and alignment research, with actionable recommendations for frontier developers. Paper 2, while practical, presents an incremental engineering contribution to RAG systems with domain-specific results. Paper 1's timeliness, safety relevance, and cross-disciplinary impact give it substantially higher potential scientific influence.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

    Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a scalable evaluation framework (no-CoT task-completion time horizons) across 43 benchmarks and 30k+ questions, with implications for AI safety governance, monitoring, and capability forecasting across many model families. Its findings (doubling trends, projections) are actionable for policymakers and developers. Paper 1 is innovative and rigorous systems work with clear practical value for RAG latency, but its impact is narrower (RAG prefill optimization on specific hardware/serving stacks) and more incremental relative to the wide cross-field significance of Paper 2.

    gpt-5.2·Jun 9, 2026
    Lostvs. Agents' Last Exam

    Paper 2 introduces a comprehensive, expertly-crafted benchmark that addresses a universal bottleneck in AI: the gap between standard benchmark performance and real-world economic utility. By providing a standardized evaluation framework for long-horizon agentic tasks, it is likely to become a foundational metric driving both research and industry applications. While Paper 1 addresses a critical AI safety concern regarding internalized reasoning, Paper 2's broad, field-defining utility gives it a wider potential scientific and economic impact.

    gemini-3.1-pro-preview·Jun 8, 2026
    Wonvs. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

    Paper 2 introduces a novel, quantitative metric (no-CoT task-completion time horizon) for measuring a concrete AI safety concern—models reasoning internally without observable chain-of-thought. It provides empirical measurements across 43 benchmarks with 30,000+ questions, offers actionable projections, and directly addresses a critical gap in AI safety monitoring. Paper 1 proposes a conceptual framework mapping sim-to-real gaps to foundation model agents using existing MDP formalism, but is more of a position/agenda paper without substantial new empirical contributions. Paper 2's timeliness, methodological rigor, and direct policy relevance give it higher impact potential.

    claude-opus-4-6·Jun 8, 2026
    Wonvs. Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

    Paper 2 addresses a critical and urgent issue in AI safety: the ability of frontier models to perform complex reasoning without observable Chain-of-Thought. Its massive empirical scale (30,000 questions across 43 benchmarks) and introduction of standardized capability metrics (Time Horizon and reasoning token horizon) provide foundational tools for future AI capability tracking and policy-making. While Paper 1 introduces a highly novel quantum-inspired optimization approach for LLM reasoning, Paper 2's direct relevance to AI alignment, oversight, and scaling laws gives it broader and more immediate impact across the AI research community.

    gemini-3.1-pro-preview·Jun 8, 2026
    Wonvs. DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

    Paper 1 addresses a fundamental AI safety concern—the ability of frontier models to reason without chain-of-thought, which could undermine oversight mechanisms. It introduces novel metrics (time horizons, reasoning token horizons), provides large-scale empirical measurements across 43 benchmarks, and offers actionable forecasts for the AI safety community. Its implications span AI governance, policy, and alignment research. Paper 2, while practically useful for efficiency optimization, addresses a narrower technical problem (overthinking in LRMs) with incremental methodology. Paper 1's broader safety implications and policy relevance give it higher potential impact.

    claude-opus-4-6·Jun 8, 2026
    Wonvs. A Study of Parallel Continuous Local Search

    Paper 2 addresses a critical AI safety question—whether frontier models can reason without chain-of-thought, undermining oversight mechanisms. It introduces novel metrics (time horizons, reasoning token horizons), covers 43 benchmarks with 30,000+ questions, and provides actionable forecasts relevant to AI governance. Its breadth of impact spans AI safety, policy, and capability evaluation, making it highly timely. Paper 1, while technically sound, addresses a narrower optimization topic (continuous local search for SAT) with incremental empirical findings and more limited audience and real-world implications.

    claude-opus-4-6·Jun 8, 2026