Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill, Jaeho Lee, Ezra Karger

May 21, 2026

arXiv:2605.22672v2 PDF

v1v2

cs.AI(primary)

#353of 2320·Artificial Intelligence

#353 of 2320 · Artificial Intelligence

Tournament Score

1493±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor8

Novelty7.5

Clarity8

Tournament Score

1493±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and characterizes a structurally identifiable class of inverse scaling in LLMs: distributional forecasting on time series exhibiting superlinear growth with tail risk of regime change. The central finding is that more capable models produce *worse* distributional forecasts on such tasks, with the failure localized to the upper tail of predictive distributions. Critically, this inverse scaling is invisible under single-threshold metrics (Brier score) commonly used in LLM forecasting benchmarks — the same outputs that show degradation under CRPS show *improvement* under Brier, reversing the sign of the capability-accuracy relationship.

The paper makes three interlocking contributions: (1) a new contamination-free benchmark (FBSim) built on procedurally generated strategy-game rollouts, (2) identification and mechanistic isolation of upper-tail overcommitment as the failure mode, and (3) a methodological critique of current LLM forecasting evaluation practices that rely exclusively on threshold-based metrics.

Methodological Rigor

The experimental design is notably thorough and multi-layered. The authors employ a progressive isolation strategy:

1. Discovery on FBSim (contamination-free, procedurally generated)

2. Mechanism isolation via synthetic SIR vs. linear control — demonstrating that superlinear growth + regime change is necessary, not crashes alone

3. Within-family 2×2 on Llama-3.1 ({70B, 405B} × {base, instruct}), cleanly separating scale from post-training effects

4. Real-world replication across four domains (COVID-19, housing, hyperinflation, measles)

5. Natural distribution test on 35 pre-vaccine measles seasons (1,339 state-seasons), which is unselected on severity

The statistical approach is appropriate: Spearman rank correlations with bootstrap CIs, exact permutation tests for small within-lineage subsets, Wilcoxon signed-rank for the paired 2×2 design. The robustness checks are extensive — leave-one-provider-out, one-model-per-lineage collapse, reasoning-mode partitioning, template exclusion, and threshold sweeps for the Brier-CRPS reversal.

The per-quantile pinball decomposition is particularly effective at localizing the failure: the p90 swings from strongly positive to strongly negative correlation with capability at long horizons, while p10 remains flat. This definitively shows the mechanism is asymmetric upper-tail shift, not general miscalibration.

One methodological concern: the cross-family analysis uses ECI as a capability proxy, which conflates multiple axes (scale, training data, RLHF, architecture). The authors address this reasonably with the Llama 2×2 and within-lineage replications, but the ECI-based analysis should be interpreted cautiously. The paper is honest about this limitation.

The conservative choice to clamp reasoning models to minimal effort (with reasoning-on showing stronger inversion) strengthens the result's credibility.

Potential Impact

The practical implications are significant. The domains where this failure mode appears — epidemiology, housing markets, hyperinflation — are precisely where distributional tail accuracy matters most for policy decisions. The finding that LLM-driven epidemic forecasts will systematically miscalibrate the upper tail during regime change is directly relevant to public health surveillance, especially given active research applying LLMs to disease forecasting (Du et al., 2025) and ongoing measles resurgence.

The methodological recommendation — that LLM forecasting benchmarks should report continuous proper scoring rules (CRPS, log score) alongside threshold metrics — has immediate implications for benchmark design. The demonstration that metric choice can *reverse the sign* of the capability-performance relationship (not just inflate/deflate it, as in Schaeffer et al. 2023) is a stronger version of an already influential observation.

The FBSim benchmark itself contributes a contamination-free evaluation environment for distributional forecasting, addressing a persistent concern in LLM evaluation.

Timeliness & Relevance

This paper arrives at an opportune moment. LLMs are increasingly deployed for forecasting in high-stakes domains, and the AI safety community is actively seeking structurally identifiable failure modes. The finding that inverse scaling persists at frontier capability (not U-shaped as in Wei et al. 2023) and is amplified by post-training is particularly relevant as the field scales models further. Current LLM forecasting benchmarks (ForecastBench, KalshiBench) report only binary metrics — this paper provides a concrete, well-evidenced argument for why that's insufficient.

Strengths

Multi-level evidence architecture: Discovery → mechanism isolation → controlled ablation → real-world replication → natural distribution test. Each layer addresses a specific alternative explanation.

The metric reversal finding is the paper's most impactful contribution: identical outputs yield opposite conclusions under different proper scoring rules. This has immediate implications for benchmark design.

The Llama 2×2 cleanly demonstrates that both scale and post-training independently contribute and compound, ruling out attribution to any single axis.

The informative negative (influenza shows no inversion) demonstrates structural specificity rather than a generic disease-data artifact.

Extensive robustness appendices (provider fixed effects, lineage collapse, reasoning mode, aggregation robustness) preemptively address likely reviewer concerns.

Reproducibility: Code, data, and scored outputs are provided.

Limitations & Weaknesses

Domain selection bias: Three of four real-world domains were selected *because* regime change occurred. The measles cohort partially mitigates this, but the ex-ante deployment frequency of the failure remains unclear.

ECI as capability proxy: Despite mitigations, the cross-family analysis cannot fully disentangle capability from provider-specific training choices. The N=28 model panel, while reasonable, is small for rank correlations.

Knowledge-calibration gap unexplained: The most intriguing finding (models can articulate regime-change alternatives but discard them) lacks mechanistic explanation — deferred to future work.

Elicitation format: Five-quantile elicitation with piecewise-linear CDF reconstruction is a specific choice that may interact with the finding. The Llama 2×2 uses a different elicitation (LLMTime continuation), partially addressing this.

No proposed solution: The paper identifies the problem and recommends evaluation changes but offers no training-time intervention. The finding that "overcommitment is unlikely to be corrected by scale alone" is demonstrated but no alternative path is proposed.

Effect sizes on some real-world domains are noisy due to small N (12 hyperinflation episodes, 19 housing metros).

Overall Assessment

This is a well-executed empirical paper that identifies a practically important and structurally clean failure mode in LLM forecasting. The progressive evidence architecture — from contamination-free benchmark through controlled mechanism isolation to real-world replication — is exemplary. The metric-reversal finding alone has significant implications for how the field evaluates LLM forecasting systems. The work is directly relevant to safety-critical applications and provides actionable recommendations for benchmark design.

Rating:7.8/ 10

Significance 8.5Rigor 8Novelty 7.5Clarity 8

Generated May 25, 2026

Comparison History (14)

vs. The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

claude-opus-4.65/25/2026

Paper 1 presents a specific, well-documented empirical finding (inverse scaling in LLM forecasting) with clear methodology, reproducible benchmarks, and actionable recommendations. It addresses a timely problem with direct implications for high-stakes domains (finance, epidemiology). Paper 2 is ambitious in scope but reads as a thesis-level collection of loosely connected theoretical results; its breadth sacrifices depth, and several claims (e.g., the 'Deterministic Horizon') require extraordinary validation. Paper 1's focused, falsifiable contribution with released benchmarks is more likely to influence evaluation practices and downstream research.

vs. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

claude-opus-4.65/25/2026

Paper 2 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling in forecasting under superlinear growth and tail risk—with broad implications across AI safety, finance, epidemiology, and evaluation methodology. Its finding that more capable models perform worse where it matters most challenges prevailing assumptions about scaling benefits, affecting how the entire field benchmarks and deploys LLMs. Paper 1, while solid applied work combining reasoning and RL for crystal generation, represents an incremental advance within a narrower materials science niche. Paper 2's cross-disciplinary relevance and methodological critique of standard evaluation practices give it wider and more lasting impact.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

claude-opus-4.65/25/2026

Paper 1 documents a counterintuitive and consequential finding—inverse scaling in LLM forecasting on critical tasks like epidemics and financial crises—with broad implications for how the field evaluates and deploys LLMs. It challenges the default assumption that more capable models are uniformly better, proposes a new benchmark, and identifies specific failure modes (tail risk underestimation) with real-world safety implications. Its methodological rigor (synthetic + real-world replication, per-quantile decomposition, within-family ablation) and relevance to high-stakes domains give it broader cross-disciplinary impact than Paper 2, which presents an incremental improvement to self-evolving agent training.

vs. Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

claude-opus-4.65/25/2026

Paper 2 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling in forecasting under tail risk and superlinear growth—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are uniformly better, provides mechanistic decomposition, and offers actionable recommendations for evaluation methodology. This has wider cross-disciplinary impact and addresses a critical blind spot in LLM evaluation. Paper 1, while practically useful, is more incremental—compiling workflows into weights is well-established, and the contribution is primarily engineering-focused with narrower scope.

vs. GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

gpt-5.25/25/2026

Paper 2 likely has higher impact: it introduces a general, procedurally generated, contamination-resistant benchmark for strategic reasoning with a principled capability-profile decomposition and a new stability/jaggedness metric. This framework is broadly applicable to evaluating LLMs as agents across economics, game theory, security, and multi-agent AI, and is timely given deployment in markets and negotiations. Paper 1 identifies an important inverse-scaling failure mode in forecasting and improves evaluation practice, but its domain scope is narrower (time-series/tail risk) and the main contribution is diagnostic rather than a widely extensible evaluation paradigm.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

gemini-3.15/25/2026

Paper 2 identifies a critical, counter-intuitive vulnerability (inverse scaling) in LLM forecasting on high-stakes tasks like epidemiology and finance. It exposes fundamental flaws in current evaluation metrics and introduces a new contamination-free benchmark. Its implications for AI safety, real-world deployment, and evaluation methodology are profound. In contrast, Paper 1 presents a valuable but incremental algorithmic improvement for search-augmented reasoning pipelines. The broad relevance, safety implications, and novel empirical findings of Paper 2 give it higher potential scientific impact.

vs. Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

claude-opus-4.65/25/2026

Paper 1 presents a novel empirical finding (inverse scaling in LLM forecasting) with rigorous methodology, a new benchmark (FBSim), replication across multiple real-world domains, and actionable recommendations for evaluation practices. It directly challenges assumptions about capability scaling in a high-stakes application area (forecasting), with immediate implications for AI safety and deployment. Paper 2 offers a theoretical framework for AI accountability boundaries that, while relevant, is more conceptual and narrower in empirical grounding. Paper 1's combination of surprising empirical results, methodological contribution, and broad applicability gives it higher impact potential.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

gemini-3.15/25/2026

Paper 1 identifies a critical 'inverse scaling' vulnerability in LLMs for high-stakes forecasting (e.g., finance, epidemiology), showing that more capable models perform worse on tail risks. This challenges prevailing scaling assumptions and evaluation metrics, offering broad implications for AI safety, benchmarking, and real-world deployment. Paper 2 presents a valuable framework for formal proof optimization, but its impact is more narrowly confined to the automated theorem proving and formal math communities. Paper 1's broader relevance across multiple domains makes its potential impact higher.

vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

claude-opus-4.65/25/2026

Paper 1 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling on forecasting tasks with superlinear growth and tail risk—with implications across finance, epidemiology, and AI safety. It introduces a new benchmark, provides rigorous mechanistic analysis (per-quantile decomposition, within-family scaling studies), and challenges prevailing assumptions that more capable models are universally better. The finding that standard evaluation metrics miss this failure has broad methodological implications for the entire LLM evaluation community. Paper 2 presents a useful engineering contribution for modular LLM specialization but is more incremental in nature.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

gpt-5.25/25/2026

Paper 1 is likely higher impact: it identifies a surprising inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, supported by a new contamination-free benchmark plus replication on multiple real datasets. The finding has immediate implications for evaluating and deploying LLMs in high-stakes domains (finance, epidemiology) and critiques prevalent benchmark metrics, influencing methodology across AI forecasting research. Paper 2 is a solid hybrid CP+DP integration but is a case study on one scheduling problem and is explicitly not competitive with state-of-the-art, limiting novelty and broader adoption.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

gemini-3.15/25/2026

Paper 1 reveals a critical inverse scaling phenomenon in LLM forecasting, specifically in high-stakes domains like finance and epidemiology. Uncovering that more capable models perform worse on tail risks challenges current scaling laws and AI safety assumptions. While Paper 2 offers a valuable technical optimization for KV cache compression, Paper 1's findings have broader, more profound implications for the evaluation, reliability, and deployment of advanced AI systems in critical real-world applications.

vs. MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

gemini-3.15/25/2026

Paper 1 identifies a novel, counterintuitive inverse scaling phenomenon in critical domains like finance and epidemiology. By exposing a fundamental flaw in how advanced LLMs handle tail-risk and challenging standard evaluation metrics, it offers broader implications for AI safety, benchmarking, and real-world deployment compared to Paper 2's specialized security defense tool.

vs. Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

gemini-3.15/25/2026

Paper 1 identifies a critical inverse scaling phenomenon in LLMs applied to high-stakes forecasting (finance, epidemiology), revealing that more capable models perform worse on tail risks. This has broad implications for AI safety, model evaluation, and deployment in real-world scenarios. Paper 2, while methodologically rigorous, focuses on a narrower, domain-specific issue regarding format-schema interactions in knowledge graph extraction, limiting its broader scientific and practical impact compared to the fundamental limitations exposed in Paper 1.

vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

gpt-5.25/25/2026

Paper 2 (TerminalWorld) likely has higher scientific impact due to its broadly reusable, scalable benchmark generation engine built from large-scale real terminal recordings, yielding a substantial public resource (1,530 tasks) directly applicable to evaluating and improving agentic systems. Its real-world relevance and timeliness (agent reliability, software automation) plus weak correlation with existing benchmarks suggests it measures a distinct, important capability, potentially reshaping evaluation practice across academia and industry. Paper 1 is novel and important for forecasting calibration, but is narrower in scope and primarily influences evaluation methodology within forecasting/LLM reliability.