Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee, Ezra Karger
Abstract
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper identifies a structurally specific class of inverse scaling in LLMs: distributional forecasting on time series exhibiting superlinear growth followed by regime change. The key finding is that more capable models produce *worse* distributional forecasts on these tasks, and critically, this degradation is invisible under single-threshold binary metrics (like Brier score) commonly used in LLM forecasting benchmarks. The paper introduces ForecastBench-Sim (FBSim), a contamination-free benchmark built on procedurally generated strategy-game rollouts, and demonstrates the phenomenon across synthetic and real-world domains.
The contribution is dual: (1) an empirical phenomenon — "competence-driven overcommitment" where capable models aggressively extrapolate growth trajectories in their upper tails — and (2) a methodological critique showing that scoring rule choice can *reverse the sign* of capability-accuracy relationships on identical outputs.
Methodological Rigor
The experimental design is notably thorough and multi-layered:
Causal identification strategy. The authors address the key challenge that cross-family ECI correlations conflate many variables by deploying three complementary approaches: provider fixed effects, within-lineage replication (OpenAI N=10, p=.008), and a controlled 2×2 within-family experiment on Llama-3.1 ({70B, 405B} × {base, instruct}). The 2×2 design cleanly isolates scale and post-training contributions and shows they compound — a particularly convincing result.
Mechanism isolation. The synthetic SIR vs. linear control experiment is well-designed: both share crash structure, but only the superlinear-growth series produces inverse scaling (non-overlapping CIs). This rules out crash-alone as the trigger and pinpoints the combination of superlinear growth + regime change.
Per-quantile decomposition. The pinball-loss analysis localizes the failure to the upper tail (p90), showing it's not generic miscalibration but asymmetric overcommitment. The calibration and sharpness analysis in Appendix E further confirms distributions widen asymmetrically rather than simply shifting.
Statistical rigor. Bootstrap CIs, exact permutation tests for small N, Wilcoxon signed-rank tests for paired comparisons, and multiple robustness checks (leave-one-provider-out, one-per-lineage collapse, reasoning-mode partition) are appropriate. The authors are transparent about limitations at small N (hyperinflation N=12, housing N=19).
Potential weaknesses in rigor: The cross-family analysis relies on ECI as a capability measure, which aggregates many axes. While the within-family ablation addresses this partially, alternative capability orderings could yield different patterns. The reasoning-effort clamping to "minimal" for time-series experiments is conservative but means the full reasoning-on results are only partially characterized. Domain selection for three of four real-world datasets is post-hoc (selected on known regime change), though the 35-season measles cohort is genuinely ex-ante.
Potential Impact
Immediate practical implications. The paper directly challenges the adequacy of current LLM forecasting evaluation practices. Major benchmarks (ForecastBench, KalshiBench) use exclusively binary/threshold metrics. The finding that metric choice reverses the sign of capability scaling — not merely attenuates it — is a strong call to action for the evaluation community. The recommendation to use CRPS or log scores alongside Brier is concrete and actionable.
Domain-specific consequences. The identified failure mode affects exactly the domains where tail accuracy matters most: epidemic forecasting (resource allocation, intervention timing), financial risk management (VaR, expected shortfall), and monetary policy. The measles finding is particularly timely given resurgent outbreaks. The demonstration that LLMs systematically miscalibrate the upper tail during regime change in epidemic forecasting should concern the growing community applying LLMs to disease surveillance (e.g., Du et al., 2025 in Nature Computational Science).
Broader ML implications. The finding extends the inverse scaling literature in an important direction: unlike prior adversarial or prompt-dependent examples, this inverse scaling is structurally identifiable, persists across families and prompting interventions, and hasn't resolved at frontier scale. The knowledge-calibration gap (models correctly identify hyperinflation crises in 46/48 probes yet fail to translate this into calibrated tails) raises fundamental questions about how LLMs translate retrieved knowledge into probabilistic outputs.
Training implications. The within-family result showing post-training (RLHF/instruction tuning) independently amplifies the failure suggests current training objectives may actively worsen distributional calibration on tail-sensitive tasks.
Timeliness & Relevance
The paper is exceptionally timely on multiple fronts: (1) LLMs are being actively deployed for forecasting in epidemiology and finance; (2) the AI safety community is increasingly concerned about inverse scaling and capability-risk relationships; (3) evaluation methodology for LLM forecasting is actively being developed; (4) measles resurgence makes epidemic forecasting a live public health concern. The gap between what benchmarks certify and what actually matters in deployment is a current bottleneck this paper directly addresses.
Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
This is a well-executed empirical paper with a clear, important finding that has immediate practical implications for LLM forecasting evaluation and deployment. The combination of a new benchmark, synthetic mechanism isolation, real-world replication, controlled within-family ablation, and the metric-reversal demonstration constitutes a thorough and convincing contribution. The main finding — that capability is liability for distributional forecasting under regime change, and that standard metrics hide this — deserves serious attention from both the LLM evaluation community and domain practitioners deploying LLMs for consequential forecasting.
Generated May 22, 2026
Comparison History (23)
Paper 2 identifies a critical, counter-intuitive inverse scaling phenomenon in high-stakes forecasting domains like epidemiology and finance. Discovering that increased model capabilities degrade performance on tail-risk predictions addresses urgent AI safety and reliability concerns. Its profound implications for real-world decision-making and LLM evaluation methodology give it a broader, more significant potential scientific impact than Paper 1's narrower focus on personality perception.
PopuLoRA introduces a novel population-based co-evolutionary framework for LLM post-training that addresses a fundamental limitation of single-agent self-play (self-calibration to easy problems). It demonstrates consistent improvements across 10 benchmarks in both code and math reasoning, with practical LoRA weight-space evolution operators. Paper 2 makes an important observation about inverse scaling in forecasting with tail risk, but is more diagnostic/evaluative in nature. PopuLoRA's methodological contribution—combining population-based training, asymmetric self-play, and weight-space evolution—opens broader research directions for scalable LLM training and has more transformative potential.
Paper 1 identifies a fundamental and counterintuitive failure mode of LLMs—inverse scaling on forecasting tasks with superlinear growth and tail risk—with broad implications for high-stakes domains like finance and epidemiology. It challenges prevailing assumptions that more capable models are uniformly better, provides mechanistic analysis, and offers actionable evaluation recommendations. The finding is validated across synthetic and real-world datasets. Paper 2 introduces a useful but narrower benchmark for web retrieval agents. While well-executed, it addresses a more specialized problem with less paradigm-shifting potential.
Paper 1 identifies a counterintuitive inverse-scaling failure mode in LLM probabilistic forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and shows how common benchmark metrics can mask the issue—directly impacting evaluation practice and deployment safety in high-stakes settings (finance, epidemiology). Its methodological contributions (tail-focused decomposition, within-family scale vs post-training analysis, metric critique) are broadly applicable beyond forecasting. Paper 2 is valuable as a standardized drug-design agent benchmark, but its impact is more field-specific and primarily infrastructural.
Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and regime-change tail risk, validates it across simulated and multiple real-world domains, and pinpoints evaluation-metric artifacts that can flip conclusions. This is novel, timely, and broadly relevant to ML evaluation, calibration, and decision-making in high-stakes settings (finance, epidemiology). Paper 1 is valuable but more domain- and benchmark-specific, primarily extending evaluation for spreadsheet-agent workflows rather than uncovering a generalizable modeling pathology.
Paper 1 identifies a critical and counterintuitive vulnerability in modern LLMs (inverse scaling on tail risks), challenging current assumptions about AI capabilities. Its findings have immediate, high-stakes implications for real-world applications in epidemiology and finance. Furthermore, introducing a new benchmark and highlighting flaws in standard evaluation metrics provides broad value to the AI community. While Paper 2 offers a solid methodological contribution to XAI, Paper 1 addresses a more urgent, timely issue with broader cross-disciplinary impact.
Paper 1 documents a fundamental and counterintuitive finding—inverse scaling in LLM forecasting on critical tasks involving tail risk and superlinear growth—with broad implications across AI safety, finance, epidemiology, and evaluation methodology. It challenges prevailing assumptions that more capable models are universally better, proposes actionable evaluation recommendations, and spans multiple domains. Paper 2, while technically solid, addresses a narrower problem (Verilog design agents) with more limited cross-field impact. Paper 1's findings are more likely to reshape how the community evaluates and deploys LLMs in high-stakes forecasting settings.
Paper 2 documents a counterintuitive and broadly impactful finding—inverse scaling in LLMs for forecasting tasks with tail risk—supported by both synthetic and real-world datasets across multiple domains. It challenges prevailing assumptions that more capable models are uniformly better, has immediate implications for AI safety, finance, epidemiology, and LLM evaluation methodology, and provides actionable recommendations. Paper 1, while rigorous, addresses a more niche topic (runtime safety case confidence updates using Subjective Logic) with narrower applicability primarily to safety-critical systems engineering.
Paper 2 has higher likely scientific impact: it identifies a counterintuitive, general failure mode (inverse scaling) in LLM forecasting under superlinear growth and regime-change tail risk, spanning multiple domains (finance, epidemiology, macro) and tying the effect to evaluation methodology. It contributes a new benchmark (ForecastBench-Sim), decompositions pinpointing where errors arise (upper-tail), and evidence across model scale and post-training, making it broadly relevant to ML reliability, evaluation, and deployment. Paper 1 is strong and proven in production but is more application-specific to livestream recommendation.
Paper 1 has higher likely scientific impact due to clear technical novelty (inverse scaling in LLM forecasting under superlinear growth/regime-change tail risk), a new benchmark (ForecastBench-Sim), multi-domain empirical replication (epidemiology, finance, housing, inflation), and a methodological contribution (per-quantile tail decomposition; critique of threshold metrics). It is timely for LLM deployment in high-stakes forecasting and offers actionable evaluation guidance. Paper 2 is valuable HCI/org research with practical relevance, but its qualitative scope (n=24, single firm) limits generalizability and cross-field methodological influence.
Paper 1 documents a surprising and consequential finding—inverse scaling in LLMs on forecasting tasks with superlinear growth and tail risk—with broad implications for AI safety, finance, epidemiology, and LLM evaluation methodology. It introduces a new benchmark, provides rigorous decomposition of the failure mode, replicates across synthetic and real-world domains, and offers actionable recommendations for evaluation practices. Its timeliness is high given widespread LLM deployment in forecasting. Paper 2 makes a solid but more niche contribution to assurance argument semantics using Subjective Logic, with narrower applicability primarily within safety engineering.
Paper 2 identifies a critical inverse scaling phenomenon in high-stakes LLM forecasting (e.g., finance, epidemiology), challenging the assumption that greater capability yields better results. By exposing how standard metrics mask tail-risk failures, it offers profound implications for AI safety, evaluation methodology, and cross-disciplinary applications, granting it broader scientific impact than Paper 1's algorithmic improvement to agentic workflows.
Paper 2 addresses a timely and broadly impactful question about LLM reliability in high-stakes forecasting domains (finance, epidemiology). Its finding of inverse scaling—where more capable models perform worse on tail risks—challenges prevailing assumptions and has immediate implications for AI safety, deployment policy, and benchmark design. It introduces a new benchmark, demonstrates the effect across multiple real-world domains, and provides actionable recommendations. Paper 1 makes solid contributions to POMDP planning scalability but targets a narrower community. Paper 2's cross-disciplinary relevance and timeliness give it higher potential impact.
Paper 2 likely has higher impact: it identifies a counterintuitive inverse-scaling failure mode in high-stakes forecasting with tail risk, supported by a new contamination-free benchmark plus replication on multiple real-world domains. The methodological contribution (per-quantile error analysis, within-family scaling/post-training study, and metric critique showing sign reversals) can reshape evaluation practice across forecasting, AI safety, finance, and epidemiology. Paper 1 is practically valuable for modular specialization and efficiency, but it extends an active line (adapters/deltas/compression) and its impact is narrower to deployment/engineering compared with Paper 2’s cross-field implications and timely relevance.
Paper 2 introduces a fundamentally novel neural architecture bridging cellular automata, neural operators, and deep learning. This theoretical and structural breakthrough offers potential universality and parameter efficiency, suggesting a foundational paradigm shift in network design. While Paper 1 is highly timely and practically important for LLM evaluation, Paper 2's proposed architecture has a much higher ceiling for broad, revolutionary impact across all of artificial intelligence.
Paper 2 likely has higher impact: it introduces a broadly applicable modeling framework (GRAM) that generalizes recursive reasoning to probabilistic multi-trajectory computation, enabling inference-time scaling, multimodal hypothesis exploration, and both conditional and unconditional generation. This is a novel architectural/training contribution with potential to influence reasoning, planning, generative modeling, and efficient test-time compute across many domains. Paper 1 is timely and valuable for evaluation and safety in forecasting, but it is more domain-specific (forecasting/tail risk) and primarily diagnostic/recommendational rather than proposing a new general modeling paradigm.
Paper 2 introduces a fundamental architectural innovation in neural reasoning by enabling probabilistic multi-trajectory computation in latent space. This shift from deterministic, autoregressive models to stochastic recursive models has profound implications for the future design of general AI reasoning systems. While Paper 1 provides valuable empirical insights into current LLM limitations in specific forecasting scenarios, Paper 2 offers a broader, foundational methodology that could widely influence how extended computation and constraint satisfaction are approached across the field of machine learning.
Paper 2 identifies a fundamental and counterintuitive phenomenon—inverse scaling in LLM forecasting for superlinear/tail-risk scenarios—with broad implications across finance, epidemiology, and AI safety. It challenges the prevailing assumption that more capable models are universally better, introduces a contamination-free benchmark, and provides actionable recommendations for evaluation methodology. This finding has high novelty, broad cross-disciplinary relevance, and timely importance as LLMs are increasingly deployed for real-world forecasting. Paper 1, while solid engineering work on multi-agent workflows, is more incremental in its contribution to the existing agentic AI design literature.
Paper 2 has higher potential impact: it identifies an unexpected inverse-scaling failure mode in more capable LLMs on high-stakes forecasting with tail risk and regime change, validates it across simulated and multiple real-world domains, and shows common evaluation metrics can mask the problem—implying immediate changes to benchmarking and deployment practice. This is timely given rapid LLM adoption in decision-making. Paper 1 is a valuable benchmark contribution for multi-page document parsing, but its impact is more incremental and primarily within document AI, whereas Paper 2 affects broader ML evaluation, safety, and applied forecasting fields.
Paper 2 likely has higher scientific impact: it identifies a counterintuitive inverse-scaling failure mode in LLM forecasting under superlinear growth and tail-regime risk, introduces a new benchmark, and shows replication across simulated and multiple real-world domains. The findings affect evaluation methodology (tail-inclusive scoring) and practical deployment in high-stakes settings (finance/epidemiology), with broad relevance to ML safety, calibration, and benchmarking. Paper 1 is a solid systems contribution with clear datacenter applicability, but its novelty and cross-field reach are narrower and more incremental compared to the conceptual and evaluative implications of Paper 2.