Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee, Ezra Karger
Abstract
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper identifies and characterizes a structurally identifiable class of inverse scaling in LLMs: distributional forecasting on time series exhibiting superlinear growth with tail risk of regime change. The central finding is that more capable models produce *worse* distributional forecasts on such tasks, with the failure localized to the upper tail of predictive distributions. Critically, this inverse scaling is invisible under single-threshold metrics (Brier score) commonly used in LLM forecasting benchmarks — the same outputs that show degradation under CRPS show *improvement* under Brier, reversing the sign of the capability-accuracy relationship.
The paper makes three interlocking contributions: (1) a new contamination-free benchmark (FBSim) built on procedurally generated strategy-game rollouts, (2) identification and mechanistic isolation of upper-tail overcommitment as the failure mode, and (3) a methodological critique of current LLM forecasting evaluation practices that rely exclusively on threshold-based metrics.
Methodological Rigor
The experimental design is notably thorough and multi-layered. The authors employ a progressive isolation strategy:
1. Discovery on FBSim (contamination-free, procedurally generated)
2. Mechanism isolation via synthetic SIR vs. linear control — demonstrating that superlinear growth + regime change is necessary, not crashes alone
3. Within-family 2×2 on Llama-3.1 ({70B, 405B} × {base, instruct}), cleanly separating scale from post-training effects
4. Real-world replication across four domains (COVID-19, housing, hyperinflation, measles)
5. Natural distribution test on 35 pre-vaccine measles seasons (1,339 state-seasons), which is unselected on severity
The statistical approach is appropriate: Spearman rank correlations with bootstrap CIs, exact permutation tests for small within-lineage subsets, Wilcoxon signed-rank for the paired 2×2 design. The robustness checks are extensive — leave-one-provider-out, one-model-per-lineage collapse, reasoning-mode partitioning, template exclusion, and threshold sweeps for the Brier-CRPS reversal.
The per-quantile pinball decomposition is particularly effective at localizing the failure: the p90 swings from strongly positive to strongly negative correlation with capability at long horizons, while p10 remains flat. This definitively shows the mechanism is asymmetric upper-tail shift, not general miscalibration.
One methodological concern: the cross-family analysis uses ECI as a capability proxy, which conflates multiple axes (scale, training data, RLHF, architecture). The authors address this reasonably with the Llama 2×2 and within-lineage replications, but the ECI-based analysis should be interpreted cautiously. The paper is honest about this limitation.
The conservative choice to clamp reasoning models to minimal effort (with reasoning-on showing stronger inversion) strengthens the result's credibility.
Potential Impact
The practical implications are significant. The domains where this failure mode appears — epidemiology, housing markets, hyperinflation — are precisely where distributional tail accuracy matters most for policy decisions. The finding that LLM-driven epidemic forecasts will systematically miscalibrate the upper tail during regime change is directly relevant to public health surveillance, especially given active research applying LLMs to disease forecasting (Du et al., 2025) and ongoing measles resurgence.
The methodological recommendation — that LLM forecasting benchmarks should report continuous proper scoring rules (CRPS, log score) alongside threshold metrics — has immediate implications for benchmark design. The demonstration that metric choice can *reverse the sign* of the capability-performance relationship (not just inflate/deflate it, as in Schaeffer et al. 2023) is a stronger version of an already influential observation.
The FBSim benchmark itself contributes a contamination-free evaluation environment for distributional forecasting, addressing a persistent concern in LLM evaluation.
Timeliness & Relevance
This paper arrives at an opportune moment. LLMs are increasingly deployed for forecasting in high-stakes domains, and the AI safety community is actively seeking structurally identifiable failure modes. The finding that inverse scaling persists at frontier capability (not U-shaped as in Wei et al. 2023) and is amplified by post-training is particularly relevant as the field scales models further. Current LLM forecasting benchmarks (ForecastBench, KalshiBench) report only binary metrics — this paper provides a concrete, well-evidenced argument for why that's insufficient.
Strengths
Limitations & Weaknesses
Overall Assessment
This is a well-executed empirical paper that identifies a practically important and structurally clean failure mode in LLM forecasting. The progressive evidence architecture — from contamination-free benchmark through controlled mechanism isolation to real-world replication — is exemplary. The metric-reversal finding alone has significant implications for how the field evaluates LLM forecasting systems. The work is directly relevant to safety-critical applications and provides actionable recommendations for benchmark design.
Generated May 25, 2026
Comparison History (14)
Paper 1 presents a specific, well-documented empirical finding (inverse scaling in LLM forecasting) with clear methodology, reproducible benchmarks, and actionable recommendations. It addresses a timely problem with direct implications for high-stakes domains (finance, epidemiology). Paper 2 is ambitious in scope but reads as a thesis-level collection of loosely connected theoretical results; its breadth sacrifices depth, and several claims (e.g., the 'Deterministic Horizon') require extraordinary validation. Paper 1's focused, falsifiable contribution with released benchmarks is more likely to influence evaluation practices and downstream research.
Paper 2 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling in forecasting under superlinear growth and tail risk—with broad implications across AI safety, finance, epidemiology, and evaluation methodology. Its finding that more capable models perform worse where it matters most challenges prevailing assumptions about scaling benefits, affecting how the entire field benchmarks and deploys LLMs. Paper 1, while solid applied work combining reasoning and RL for crystal generation, represents an incremental advance within a narrower materials science niche. Paper 2's cross-disciplinary relevance and methodological critique of standard evaluation practices give it wider and more lasting impact.
Paper 1 documents a counterintuitive and consequential finding—inverse scaling in LLM forecasting on critical tasks like epidemics and financial crises—with broad implications for how the field evaluates and deploys LLMs. It challenges the default assumption that more capable models are uniformly better, proposes a new benchmark, and identifies specific failure modes (tail risk underestimation) with real-world safety implications. Its methodological rigor (synthetic + real-world replication, per-quantile decomposition, within-family ablation) and relevance to high-stakes domains give it broader cross-disciplinary impact than Paper 2, which presents an incremental improvement to self-evolving agent training.
Paper 2 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling in forecasting under tail risk and superlinear growth—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are uniformly better, provides mechanistic decomposition, and offers actionable recommendations for evaluation methodology. This has wider cross-disciplinary impact and addresses a critical blind spot in LLM evaluation. Paper 1, while practically useful, is more incremental—compiling workflows into weights is well-established, and the contribution is primarily engineering-focused with narrower scope.
Paper 2 likely has higher impact: it introduces a general, procedurally generated, contamination-resistant benchmark for strategic reasoning with a principled capability-profile decomposition and a new stability/jaggedness metric. This framework is broadly applicable to evaluating LLMs as agents across economics, game theory, security, and multi-agent AI, and is timely given deployment in markets and negotiations. Paper 1 identifies an important inverse-scaling failure mode in forecasting and improves evaluation practice, but its domain scope is narrower (time-series/tail risk) and the main contribution is diagnostic rather than a widely extensible evaluation paradigm.
Paper 2 identifies a critical, counter-intuitive vulnerability (inverse scaling) in LLM forecasting on high-stakes tasks like epidemiology and finance. It exposes fundamental flaws in current evaluation metrics and introduces a new contamination-free benchmark. Its implications for AI safety, real-world deployment, and evaluation methodology are profound. In contrast, Paper 1 presents a valuable but incremental algorithmic improvement for search-augmented reasoning pipelines. The broad relevance, safety implications, and novel empirical findings of Paper 2 give it higher potential scientific impact.
Paper 1 presents a novel empirical finding (inverse scaling in LLM forecasting) with rigorous methodology, a new benchmark (FBSim), replication across multiple real-world domains, and actionable recommendations for evaluation practices. It directly challenges assumptions about capability scaling in a high-stakes application area (forecasting), with immediate implications for AI safety and deployment. Paper 2 offers a theoretical framework for AI accountability boundaries that, while relevant, is more conceptual and narrower in empirical grounding. Paper 1's combination of surprising empirical results, methodological contribution, and broad applicability gives it higher impact potential.
Paper 1 identifies a critical 'inverse scaling' vulnerability in LLMs for high-stakes forecasting (e.g., finance, epidemiology), showing that more capable models perform worse on tail risks. This challenges prevailing scaling assumptions and evaluation metrics, offering broad implications for AI safety, benchmarking, and real-world deployment. Paper 2 presents a valuable framework for formal proof optimization, but its impact is more narrowly confined to the automated theorem proving and formal math communities. Paper 1's broader relevance across multiple domains makes its potential impact higher.
Paper 1 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling on forecasting tasks with superlinear growth and tail risk—with implications across finance, epidemiology, and AI safety. It introduces a new benchmark, provides rigorous mechanistic analysis (per-quantile decomposition, within-family scaling studies), and challenges prevailing assumptions that more capable models are universally better. The finding that standard evaluation metrics miss this failure has broad methodological implications for the entire LLM evaluation community. Paper 2 presents a useful engineering contribution for modular LLM specialization but is more incremental in nature.
Paper 1 is likely higher impact: it identifies a surprising inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, supported by a new contamination-free benchmark plus replication on multiple real datasets. The finding has immediate implications for evaluating and deploying LLMs in high-stakes domains (finance, epidemiology) and critiques prevalent benchmark metrics, influencing methodology across AI forecasting research. Paper 2 is a solid hybrid CP+DP integration but is a case study on one scheduling problem and is explicitly not competitive with state-of-the-art, limiting novelty and broader adoption.
Paper 1 reveals a critical inverse scaling phenomenon in LLM forecasting, specifically in high-stakes domains like finance and epidemiology. Uncovering that more capable models perform worse on tail risks challenges current scaling laws and AI safety assumptions. While Paper 2 offers a valuable technical optimization for KV cache compression, Paper 1's findings have broader, more profound implications for the evaluation, reliability, and deployment of advanced AI systems in critical real-world applications.
Paper 1 identifies a novel, counterintuitive inverse scaling phenomenon in critical domains like finance and epidemiology. By exposing a fundamental flaw in how advanced LLMs handle tail-risk and challenging standard evaluation metrics, it offers broader implications for AI safety, benchmarking, and real-world deployment compared to Paper 2's specialized security defense tool.
Paper 1 identifies a critical inverse scaling phenomenon in LLMs applied to high-stakes forecasting (finance, epidemiology), revealing that more capable models perform worse on tail risks. This has broad implications for AI safety, model evaluation, and deployment in real-world scenarios. Paper 2, while methodologically rigorous, focuses on a narrower, domain-specific issue regarding format-schema interactions in knowledge graph extraction, limiting its broader scientific and practical impact compared to the fundamental limitations exposed in Paper 1.
Paper 2 (TerminalWorld) likely has higher scientific impact due to its broadly reusable, scalable benchmark generation engine built from large-scale real terminal recordings, yielding a substantial public resource (1,530 tasks) directly applicable to evaluating and improving agentic systems. Its real-world relevance and timeliness (agent reliability, software automation) plus weak correlation with existing benchmarks suggests it measures a distinct, important capability, potentially reshaping evaluation practice across academia and industry. Paper 1 is novel and important for forecasting calibration, but is narrower in scope and primarily influences evaluation methodology within forecasting/LLM reliability.