Mosh Levy, Yoav Goldberg, Asa Cooper Stickland
Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.
The paper reframes the problem of understanding LRM behavior: rather than trying to explain *why* a model produced an answer (via interpretability methods), it proposes training external "Behavior Forecasters" that predict *what the model would do* under related conditions. Specifically, given a single reasoning trajectory (prompt, chain-of-thought, answer), a trained forecaster predicts (1) rerun consistency — the probability the LRM reproduces the same answer on reruns, and (2) counterfactual sensitivity — how much removing each input segment changes the answer probability.
The key insight is that reasoning trajectories encode information about the LRM's computational process that is not accessible through naive reading of the text (by humans or frontier LLMs), but *is* learnable through supervised training. This sidesteps the well-documented unfaithfulness problem of chain-of-thought reasoning, treating trajectories as data with latent patterns rather than natural language to be interpreted.
The experimental design is generally sound. The paper evaluates across three diverse reasoning datasets (FEVEROUS, RuleTaker, TreeCut), two target LRMs (OLMo-3-7B-Think, Qwen3.5-2B), and includes appropriate baselines (frontier LLMs as naive readers, single-location probes, random predictions). The use of Spearman correlation as the primary metric is appropriate for the regression targets.
Several methodological choices strengthen credibility: (a) using OLMo as the primary target because its training data is fully released, enabling verification of data contamination; (b) cluster-bootstrap confidence intervals and paired permutation tests for the main comparisons; (c) systematic ablations covering input arrangement, initialization, and training regime.
However, there are notable concerns. The label construction requires 10 LRM runs per prompt (and 10 more per perturbation for counterfactual sensitivity), making ground-truth labels expensive despite the paper's framing around efficiency. The filtering criteria (≥5 valid runs, ≥70% consistency for counterfactual sensitivity) introduce selection bias — the forecaster is trained and evaluated primarily on "well-behaved" inputs. The rerun consistency results on TreeCut are notably weak (Spearman 0.140 for the forecaster vs. -0.043 for Claude), suggesting the approach may struggle on certain domains. The statistical analysis for transfer experiments acknowledges low power (only 3 targets, sign-test p=0.125), making those claims preliminary.
Practical applications: The approach could enable deployment-time decisions (abstention, human routing, flagging) at scale without expensive resampling. This directly addresses a real operational need as LRMs are deployed in safety-critical settings. The EU AI Act's human oversight requirements (cited in the paper) create regulatory demand for such capabilities.
Conceptual contribution: The framing of behavior forecasting as a learnable task, separate from explanation, is a meaningful conceptual advance. It opens a research direction where any automatically labelable behavioral property becomes a candidate for cheap forecasting. The finding that reasoning trajectories encode non-surface information about future behavior has implications for AI safety and monitoring.
Adjacent fields: The approach could influence AI safety (monitoring for deceptive reasoning), deployment engineering (quality assurance pipelines), and the broader interpretability community by providing an alternative paradigm to mechanistic understanding.
This paper addresses a genuine bottleneck. LRMs (O1, R1, etc.) are being rapidly deployed, yet interpretability tools designed for single-forward-pass models don't naturally extend to long reasoning trajectories. The growing evidence that chain-of-thought is often unfaithful (extensively cited) makes naive reading unreliable. The paper arrives at an opportune moment when the community needs practical tools for LRM oversight.
The finding that a frozen backbone with only head training performs poorly while full fine-tuning succeeds suggests the forecaster learns to transform the LRM's representations rather than simply reading off existing signals. This is scientifically interesting but raises questions about what exactly is being learned — is it genuine behavioral prediction or dataset-specific patterns?
The paper's connection to the faithfulness literature is well-drawn, and the hint-sensitivity transfer experiment (Table 3) provides a nice bridge to the sycophancy/unfaithfulness evaluation paradigm.
Overall, this is a well-executed paper that introduces a practical and conceptually clean framework. The results are convincing within scope, though the scope itself (small models, three datasets, two behavioral properties) leaves significant questions about generalizability to real deployment scenarios.
Generated Jun 11, 2026
Paper 2 likely has higher scientific impact due to a major new million-scale dataset (TouchThinker-1M) across many objects/scenarios/sensors plus an open-world benchmark, enabling broad follow-on research in robotics, multimodal learning, and embodied AI. The combination of large-scale resources and an action-aware representation addresses clear bottlenecks and is readily reusable, boosting real-world applicability and cross-field impact. Paper 1 is novel for LRM trust/behavior prediction and may be impactful in interpretability, but it is narrower in scope and lacks the same community-enabling artifacts.
Paper 1 bridges the gap between machine learning and systems engineering, addressing a critical bottleneck in deploying multi-agent LLM systems. By integrating infrastructure state directly into the agent orchestration process via reinforcement learning, it offers massive real-world applicability and efficiency gains. While Paper 2 tackles the important issue of AI explainability, Paper 1's interdisciplinary approach and immediate potential to scale complex AI pipelines give it a broader and more quantifiable systemic impact.
Paper 2 is likely higher impact: it introduces a broadly applicable, scalable alternative to explanation-based interpretability—training behavior forecasters from self-generated data—to predict model stability and counterfactual sensitivity. This is timely for trustworthy LRMs, reduces reliance on human labels, and offers clear real-world utility (deployment monitoring, uncertainty, robustness) across many domains and model types. While Paper 1 is novel and rigorous for Theory of Mind prompting with strong benchmark gains, its applicability is narrower and more task-specific than Paper 2’s general framework for forecasting model behavior.
Paper 2 introduces a fundamentally novel paradigm for AI interpretability—bypassing explanation to directly forecast model behavior—which has broad applicability across all LRM applications. Its conceptual innovation (behavior forecasting as a learnable task) opens a new research direction in AI safety/trustworthiness with cross-domain impact. Paper 1, while technically solid, addresses a narrower clinical domain (pulmonary diagnosis) with an incremental combination of knowledge graphs and reinforcement learning. Paper 2's framework is more generalizable, timely given rapid LRM deployment, and likely to inspire significant follow-up work across multiple fields.
Paper 2 introduces a novel paradigm for AI trustworthiness—treating behavior forecasting as a learnable task that bypasses traditional explanation methods for large reasoning models. This addresses a timely, high-impact problem given the rapid deployment of LRMs. The approach is broadly applicable across AI safety and interpretability, offers practical efficiency gains over frontier models, and opens a new research direction. Paper 1, while technically rigorous, addresses a narrower problem (adversarial attacks on data summarization) with more incremental contributions to an established subfield of adversarial robustness.
Paper 2 has higher likely scientific impact: it targets a central, timely bottleneck (long-context efficiency) with broad applicability beyond math reasoning (any long-context LLM deployment). The two-stage SA→SWA conversion plus on-policy RL adaptation offers a practical pathway to retrofit existing models, with clear real-world systems benefits (linear attention at higher accuracy). The data–architecture mismatch hypothesis and empirical demonstration that RL shifts conclusions about SWA viability is broadly relevant to architecture-aware training. Paper 1 is novel for interpretability/behavior prediction but has narrower immediate deployment impact.
Paper 1 introduces a fundamentally novel paradigm—treating behavior forecasting of LRMs as a learnable task that bypasses traditional explanation methods. This addresses a critical and broadly applicable problem (AI trust and interpretability) with a creative, generalizable approach. It offers methodological innovation (training forecasters from LRM trajectories without human annotation) and reveals that reasoning trajectories encode more information than naive reading captures. Paper 2, while practically useful, is more incremental—combining existing ideas (multi-agent debate, claim decomposition, code generation) into a domain-specific pipeline for financial QA. Paper 1's breadth of impact across AI safety, interpretability, and reasoning is significantly wider.
Paper 2 is likely higher impact due to a more broadly applicable, scalable paradigm: learning to forecast model behavior from trajectories without human labels. This can transfer across domains (reliability, robustness, interpretability, evaluation, safety) and offers practical deployment value via single-pass, low-cost inference. It also introduces a general training recipe (self-generated supervision, end-to-end finetuning, initialization from target LRM) validated across multiple datasets and tasks. Paper 1 is novel and timely for multi-turn safety diagnostics, but is narrower (CoT-specific, hazard scenario/attack setup) and more evaluation-focused than enabling new capabilities.
Paper 1 presents a fundamental conceptual shift in AI interpretability and safety by bypassing traditional explanations to directly forecast model behavior via a learned task. As Large Reasoning Models become ubiquitous, establishing trust and predicting their behavior on novel inputs is a critical bottleneck. This approach has broad implications across the entire AI ecosystem. Paper 2 is highly valuable for AI-driven scientific computing (AI4Science), but its impact is more narrowly focused on computational physics and engineering, making Paper 1 more broadly impactful across diverse domains relying on foundation models.
Paper 1 introduces a fundamentally novel paradigm for AI interpretability and trust by treating behavior forecasting as a learnable task, bypassing traditional and often unfaithful explanations. This conceptual shift has profound implications for AI safety, alignment, and evaluation of Large Reasoning Models. While Paper 2 offers a highly practical and timely systems optimization for KV cache management, Paper 1's approach has broader theoretical impact across the AI community by redefining how we understand and predict complex model behavior.