Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill, Jaeho Lee, Ezra Karger

May 21, 2026

arXiv:2605.22672v1 PDF

cs.AI(primary)

#314of 2292·Artificial Intelligence

#314 of 2292 · Artificial Intelligence

Tournament Score

1498±49

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor7.5

Novelty8

Clarity7.5

Tournament Score

1498±49

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies a structurally specific class of inverse scaling in LLMs: distributional forecasting on time series exhibiting superlinear growth followed by regime change. The key finding is that more capable models produce *worse* distributional forecasts on these tasks, and critically, this degradation is invisible under single-threshold binary metrics (like Brier score) commonly used in LLM forecasting benchmarks. The paper introduces ForecastBench-Sim (FBSim), a contamination-free benchmark built on procedurally generated strategy-game rollouts, and demonstrates the phenomenon across synthetic and real-world domains.

The contribution is dual: (1) an empirical phenomenon — "competence-driven overcommitment" where capable models aggressively extrapolate growth trajectories in their upper tails — and (2) a methodological critique showing that scoring rule choice can *reverse the sign* of capability-accuracy relationships on identical outputs.

Methodological Rigor

The experimental design is notably thorough and multi-layered:

Causal identification strategy. The authors address the key challenge that cross-family ECI correlations conflate many variables by deploying three complementary approaches: provider fixed effects, within-lineage replication (OpenAI N=10, p=.008), and a controlled 2×2 within-family experiment on Llama-3.1 ({70B, 405B} × {base, instruct}). The 2×2 design cleanly isolates scale and post-training contributions and shows they compound — a particularly convincing result.

Mechanism isolation. The synthetic SIR vs. linear control experiment is well-designed: both share crash structure, but only the superlinear-growth series produces inverse scaling (non-overlapping CIs). This rules out crash-alone as the trigger and pinpoints the combination of superlinear growth + regime change.

Per-quantile decomposition. The pinball-loss analysis localizes the failure to the upper tail (p90), showing it's not generic miscalibration but asymmetric overcommitment. The calibration and sharpness analysis in Appendix E further confirms distributions widen asymmetrically rather than simply shifting.

Statistical rigor. Bootstrap CIs, exact permutation tests for small N, Wilcoxon signed-rank tests for paired comparisons, and multiple robustness checks (leave-one-provider-out, one-per-lineage collapse, reasoning-mode partition) are appropriate. The authors are transparent about limitations at small N (hyperinflation N=12, housing N=19).

Potential weaknesses in rigor: The cross-family analysis relies on ECI as a capability measure, which aggregates many axes. While the within-family ablation addresses this partially, alternative capability orderings could yield different patterns. The reasoning-effort clamping to "minimal" for time-series experiments is conservative but means the full reasoning-on results are only partially characterized. Domain selection for three of four real-world datasets is post-hoc (selected on known regime change), though the 35-season measles cohort is genuinely ex-ante.

Potential Impact

Immediate practical implications. The paper directly challenges the adequacy of current LLM forecasting evaluation practices. Major benchmarks (ForecastBench, KalshiBench) use exclusively binary/threshold metrics. The finding that metric choice reverses the sign of capability scaling — not merely attenuates it — is a strong call to action for the evaluation community. The recommendation to use CRPS or log scores alongside Brier is concrete and actionable.

Domain-specific consequences. The identified failure mode affects exactly the domains where tail accuracy matters most: epidemic forecasting (resource allocation, intervention timing), financial risk management (VaR, expected shortfall), and monetary policy. The measles finding is particularly timely given resurgent outbreaks. The demonstration that LLMs systematically miscalibrate the upper tail during regime change in epidemic forecasting should concern the growing community applying LLMs to disease surveillance (e.g., Du et al., 2025 in Nature Computational Science).

Broader ML implications. The finding extends the inverse scaling literature in an important direction: unlike prior adversarial or prompt-dependent examples, this inverse scaling is structurally identifiable, persists across families and prompting interventions, and hasn't resolved at frontier scale. The knowledge-calibration gap (models correctly identify hyperinflation crises in 46/48 probes yet fail to translate this into calibrated tails) raises fundamental questions about how LLMs translate retrieved knowledge into probabilistic outputs.

Training implications. The within-family result showing post-training (RLHF/instruction tuning) independently amplifies the failure suggests current training objectives may actively worsen distributional calibration on tail-sensitive tasks.

Timeliness & Relevance

The paper is exceptionally timely on multiple fronts: (1) LLMs are being actively deployed for forecasting in epidemiology and finance; (2) the AI safety community is increasingly concerned about inverse scaling and capability-risk relationships; (3) evaluation methodology for LLM forecasting is actively being developed; (4) measles resurgence makes epidemic forecasting a live public health concern. The gap between what benchmarks certify and what actually matters in deployment is a current bottleneck this paper directly addresses.

Strengths & Limitations

Key strengths:

Multi-domain replication with consistent mechanism across synthetic and real-world data

The metric-reversal finding (same outputs, opposite conclusions under different proper scoring rules) is striking and practically important

FBSim as a contamination-free benchmark contribution

The within-family 2×2 is a methodologically clean contribution

Extensive robustness analysis and transparent limitations section

The knowledge-calibration gap is a genuinely novel and provocative finding

Notable limitations:

The cross-family panel has N=28-29 models, limiting statistical power for subgroup analyses

Three of four real-world domains are selected on known regime change

The paper cannot explain *why* recoverable priors fail to translate into calibrated tails — this mechanistic gap is acknowledged but unresolved

FBSim's game-world setting may not generalize to all policy-relevant forecasting contexts

The parse-rate floor at ~ECI 115 means the inverse scaling is documented only among models capable enough to produce structured forecasts

Overall Assessment

This is a well-executed empirical paper with a clear, important finding that has immediate practical implications for LLM forecasting evaluation and deployment. The combination of a new benchmark, synthetic mechanism isolation, real-world replication, controlled within-family ablation, and the metric-reversal demonstration constitutes a thorough and convincing contribution. The main finding — that capability is liability for distributional forecasting under regime change, and that standard metrics hide this — deserves serious attention from both the LLM evaluation community and domain practitioners deploying LLMs for consequential forecasting.

Rating:7.8/ 10

Significance 8.5Rigor 7.5Novelty 8Clarity 7.5

Generated May 22, 2026

Comparison History (23)

vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

gemini-3.15/22/2026

Paper 2 identifies a critical, counter-intuitive inverse scaling phenomenon in high-stakes forecasting domains like epidemiology and finance. Discovering that increased model capabilities degrade performance on tail-risk predictions addresses urgent AI safety and reliability concerns. Its profound implications for real-world decision-making and LLM evaluation methodology give it a broader, more significant potential scientific impact than Paper 1's narrower focus on personality perception.

vs. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

claude-opus-4.65/22/2026

PopuLoRA introduces a novel population-based co-evolutionary framework for LLM post-training that addresses a fundamental limitation of single-agent self-play (self-calibration to easy problems). It demonstrates consistent improvements across 10 benchmarks in both code and math reasoning, with practical LoRA weight-space evolution operators. Paper 2 makes an important observation about inverse scaling in forecasting with tail risk, but is more diagnostic/evaluative in nature. PopuLoRA's methodological contribution—combining population-based training, asymmetric self-play, and weight-space evolution—opens broader research directions for scalable LLM training and has more transformative potential.

vs. SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

claude-opus-4.65/22/2026

Paper 1 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling on forecasting tasks with superlinear growth and tail risk—with broad implications for high-stakes domains like finance and epidemiology. It challenges prevailing assumptions that more capable models are uniformly better, provides mechanistic analysis, and offers actionable evaluation recommendations. The finding is validated across synthetic and real-world datasets. Paper 2 introduces a useful but narrower benchmark for web retrieval agents. While well-executed, it addresses a more specialized problem with less paradigm-shifting potential.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

gpt-5.25/22/2026

Paper 1 identifies a counterintuitive inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and shows how common benchmark metrics can mask the issue—directly impacting evaluation practice and deployment safety in high-stakes settings (finance, epidemiology). Its methodological contributions (tail-focused decomposition, within-family scale vs post-training analysis, metric critique) are broadly applicable beyond forecasting. Paper 2 is valuable as a standardized drug-design agent benchmark, but its impact is more field-specific and primarily infrastructural.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and pinpoints evaluation-metric artifacts that can flip conclusions. This is novel, timely, and broadly relevant to ML evaluation, calibration, and decision-making in high-stakes settings (finance, epidemiology). Paper 1 is valuable but more domain- and benchmark-specific, primarily extending evaluation for spreadsheet-agent workflows rather than uncovering a generalizable modeling pathology.

vs. A Causal Argumentation Method for Explainability of Machine Learning Models

gemini-3.15/22/2026

Paper 1 identifies a critical and counterintuitive vulnerability in modern LLMs (inverse scaling on tail risks), challenging current assumptions about AI capabilities. Its findings have immediate, high-stakes implications for real-world applications in epidemiology and finance. Furthermore, introducing a new benchmark and highlighting flaws in standard evaluation metrics provides broad value to the AI community. While Paper 2 offers a solid methodological contribution to XAI, Paper 1 addresses a more urgent, timely issue with broader cross-disciplinary impact.

vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

claude-opus-4.65/22/2026

Paper 1 documents a fundamental and counterintuitive finding—inverse scaling in LLM forecasting on critical tasks involving tail risk and superlinear growth—with broad implications across AI safety, finance, epidemiology, and evaluation methodology. It challenges prevailing assumptions that more capable models are universally better, proposes actionable evaluation recommendations, and spans multiple domains. Paper 2, while technically solid, addresses a narrower problem (Verilog design agents) with more limited cross-field impact. Paper 1's findings are more likely to reshape how the community evaluates and deploys LLMs in high-stakes forecasting settings.

vs. A Subjective Logic-based method for runtime confidence updates in safety arguments

claude-opus-4.65/22/2026

Paper 2 documents a counterintuitive and broadly impactful finding—inverse scaling in LLMs for forecasting tasks with tail risk—supported by both synthetic and real-world datasets across multiple domains. It challenges prevailing assumptions that more capable models are uniformly better, has immediate implications for AI safety, finance, epidemiology, and LLM evaluation methodology, and provides actionable recommendations. Paper 1, while rigorous, addresses a more niche topic (runtime safety case confidence updates using Subjective Logic) with narrower applicability primarily to safety-critical systems engineering.

vs. FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

gpt-5.25/22/2026

Paper 2 has higher likely scientific impact: it identifies a counterintuitive, general failure mode (inverse scaling) in LLM forecasting under superlinear growth and regime-change tail risk, spanning multiple domains (finance, epidemiology, macro) and tying the effect to evaluation methodology. It contributes a new benchmark (ForecastBench-Sim), decompositions pinpointing where errors arise (upper-tail), and evidence across model scale and post-training, making it broadly relevant to ML reliability, evaluation, and deployment. Paper 1 is strong and proven in production but is more application-specific to livestream recommendation.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

gpt-5.25/22/2026

Paper 1 has higher likely scientific impact due to clear technical novelty (inverse scaling in LLM forecasting under superlinear growth/regime-change tail risk), a new benchmark (ForecastBench-Sim), multi-domain empirical replication (epidemiology, finance, housing, inflation), and a methodological contribution (per-quantile tail decomposition; critique of threshold metrics). It is timely for LLM deployment in high-stakes forecasting and offers actionable evaluation guidance. Paper 2 is valuable HCI/org research with practical relevance, but its qualitative scope (n=24, single firm) limits generalizability and cross-field methodological influence.

vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments

claude-opus-4.65/22/2026

Paper 1 documents a surprising and consequential finding—inverse scaling in LLMs on forecasting tasks with superlinear growth and tail risk—with broad implications for AI safety, finance, epidemiology, and LLM evaluation methodology. It introduces a new benchmark, provides rigorous decomposition of the failure mode, replicates across synthetic and real-world domains, and offers actionable recommendations for evaluation practices. Its timeliness is high given widespread LLM deployment in forecasting. Paper 2 makes a solid but more niche contribution to assurance argument semantics using Subjective Logic, with narrower applicability primarily within safety engineering.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

gemini-3.15/22/2026

Paper 2 identifies a critical inverse scaling phenomenon in high-stakes LLM forecasting (e.g., finance, epidemiology), challenging the assumption that greater capability yields better results. By exposing how standard metrics mask tail-risk failures, it offers profound implications for AI safety, evaluation methodology, and cross-disciplinary applications, granting it broader scientific impact than Paper 1's algorithmic improvement to agentic workflows.

vs. Scaling Observation-aware Planning in Uncertain Domains

claude-opus-4.65/22/2026

Paper 2 addresses a timely and broadly impactful question about LLM reliability in high-stakes forecasting domains (finance, epidemiology). Its finding of inverse scaling—where more capable models perform worse on tail risks—challenges prevailing assumptions and has immediate implications for AI safety, deployment policy, and benchmark design. It introduces a new benchmark, demonstrates the effect across multiple real-world domains, and provides actionable recommendations. Paper 1 makes solid contributions to POMDP planning scalability but targets a narrower community. Paper 2's cross-disciplinary relevance and timeliness give it higher potential impact.

vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

gpt-5.25/22/2026

Paper 2 likely has higher impact: it identifies a counterintuitive inverse-scaling failure mode in high-stakes forecasting with tail risk, supported by a new contamination-free benchmark plus replication on multiple real-world domains. The methodological contribution (per-quantile error analysis, within-family scaling/post-training study, and metric critique showing sign reversals) can reshape evaluation practice across forecasting, AI safety, finance, and epidemiology. Paper 1 is practically valuable for modular specialization and efficiency, but it extends an active line (adapters/deltas/compression) and its impact is narrower to deployment/engineering compared with Paper 2’s cross-field implications and timely relevance.

vs. Von Neumann Networks

gemini-3.15/22/2026

Paper 2 introduces a fundamentally novel neural architecture bridging cellular automata, neural operators, and deep learning. This theoretical and structural breakthrough offers potential universality and parameter efficiency, suggesting a foundational paradigm shift in network design. While Paper 1 is highly timely and practically important for LLM evaluation, Paper 2's proposed architecture has a much higher ceiling for broad, revolutionary impact across all of artificial intelligence.

vs. Generative Recursive Reasoning

gpt-5.25/22/2026

Paper 2 likely has higher impact: it introduces a broadly applicable modeling framework (GRAM) that generalizes recursive reasoning to probabilistic multi-trajectory computation, enabling inference-time scaling, multimodal hypothesis exploration, and both conditional and unconditional generation. This is a novel architectural/training contribution with potential to influence reasoning, planning, generative modeling, and efficient test-time compute across many domains. Paper 1 is timely and valuable for evaluation and safety in forecasting, but it is more domain-specific (forecasting/tail risk) and primarily diagnostic/recommendational rather than proposing a new general modeling paradigm.

vs. Generative Recursive Reasoning

gemini-3.15/22/2026

Paper 2 introduces a fundamental architectural innovation in neural reasoning by enabling probabilistic multi-trajectory computation in latent space. This shift from deterministic, autoregressive models to stochastic recursive models has profound implications for the future design of general AI reasoning systems. While Paper 1 provides valuable empirical insights into current LLM limitations in specific forecasting scenarios, Paper 2 offers a broader, foundational methodology that could widely influence how extended computation and constraint satisfaction are approached across the field of machine learning.

vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

claude-opus-4.65/22/2026

Paper 2 identifies a fundamental and counterintuitive phenomenon—inverse scaling in LLM forecasting for superlinear/tail-risk scenarios—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are universally better, introduces a contamination-free benchmark, and provides actionable recommendations for evaluation methodology. This finding has high novelty, broad cross-disciplinary relevance, and timely importance as LLMs are increasingly deployed for real-world forecasting. Paper 1, while solid engineering work on multi-agent workflows, is more incremental in its contribution to the existing agentic AI design literature.

vs. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

gpt-5.25/22/2026

Paper 2 has higher potential impact: it identifies an unexpected inverse-scaling failure mode in more capable LLMs on high-stakes forecasting with tail risk and regime change, validates it across simulated and multiple real-world domains, and shows common evaluation metrics can mask the problem—implying immediate changes to benchmarking and deployment practice. This is timely given rapid LLM adoption in decision-making. Paper 1 is a valuable benchmark contribution for multi-page document parsing, but its impact is more incremental and primarily within document AI, whereas Paper 2 affects broader ML evaluation, safety, and applied forecasting fields.

vs. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and tail-regime risk, introduces a new benchmark, and shows replication across simulated and multiple real-world domains. The findings affect evaluation methodology (tail-inclusive scoring) and practical deployment in high-stakes settings (finance/epidemiology), with broad relevance to ML safety, calibration, and benchmarking. Paper 1 is a solid systems contribution with clear datacenter applicability, but its novelty and cross-field reach are narrower and more incremental compared to the conceptual and evaluative implications of Paper 2.