When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms

Waleed Esmail, Stuart Russell, Jana Klinge, Alexander Kappes, Christine Thomas

Jun 9, 2026arXiv:2606.10868v1

cs.LGastro-ph.IM

#4712of 5669·cs.LG

#4712 of 5669 · cs.LG

Tournament Score

1306±42

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5

Rigor7.5

Novelty4.5

Clarity8.5

Abstract

Long-horizon autoregressive forecasting of oscillatory physical signals, such as seismograms, gravitational-wave strain, and similar wavefields is limited by error accumulation: as a causal model is fed its own outputs over hundreds of steps, small per-step errors compound into phase drift that pointwise metrics fail to detect. We ask when such rollout stays stable, using synthetic three-component seismograms as a physically structured testbed and the \textsc{SeismoGPT} autoregressive forecaster as the model under study. Through controlled, intra-architecture ablations evaluated on free-running rollout with paired significance tests, we isolate the contribution of each design choice. Multi-token prediction is the dominant stabilizer, accounting for almost the entire improvement over a single-token baseline ( $+ 0.040$ median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize, identifying phase-aware objectives as the natural next step. We frame this as a controlled study of rollout stability on oscillatory wavefields, not a benchmark of forecasting architectures.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper conducts a controlled ablation study examining the factors that stabilize long-horizon autoregressive rollout of oscillatory physical signals, using synthetic three-component seismograms as a testbed and the SeismoGPT autoregressive forecaster as the model. The central finding is that multi-token prediction (MTP) is the dominant stabilizer, accounting for nearly all improvement over a single-token baseline (+0.040 out of +0.045 total median NCC gain). Two additional components—a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss—provide small but statistically significant additive gains (+0.009 and +0.005 NCC, respectively). The paper also identifies a sharp context-ratio threshold (ρ ≈ 1, corresponding to the full P-S seismic interval) below which rollout quality collapses, and diagnoses the dominant residual failure mode as polarity inversion that magnitude-based spectral losses cannot, by construction, penalize.

The paper is explicit—commendably so—that it is *not* proposing a new architecture or benchmarking across model families, but rather performing a careful within-architecture dissection of what drives rollout stability.

Methodological Rigor

The experimental methodology is notably disciplined for this type of study:

Paired evaluation: All 10,000 test events are shared across configurations, enabling paired Wilcoxon signed-rank tests with bootstrap confidence intervals. This is substantially more rigorous than typical ablation studies that report only aggregate means.

Multiple complementary metrics: NCC (phase/timing), SRR (amplitude fidelity), and PSD error (spectral shape) separate different failure modes, revealing that auxiliary losses improve spectral shape even when they slightly hurt NCC—a nuance that single-metric studies would miss.

Honest limitation disclosure: The authors repeatedly acknowledge that each configuration is trained only once (no seed variation), that confidence intervals reflect test-event variability rather than training stochasticity, and that the small component effects (+0.005 and +0.009) should be interpreted cautiously. This transparency is exemplary.

However, the single-seed limitation is a genuine weakness. The two smaller effects are of the same order of magnitude as training noise in many deep learning settings. Without multi-seed experiments, these effects could be artifacts of a particular initialization. The authors acknowledge this but it remains a significant gap. Additionally, the study is confined to a single synthetic dataset and a single architecture family (causal transformers), limiting generalizability claims.

Potential Impact

The paper's impact operates on several levels:

1. Practical guidance for wavefield forecasting: The finding that MTP dominates rollout stability is actionable for practitioners working on seismograms, gravitational waves, or other oscillatory signals. The context-ratio threshold provides a concrete operational rule.

2. Diagnostic framework: The decomposition of rollout failure into phase drift versus amplitude decay, and the identification of polarity inversion as the dominant residual failure, provides a diagnostic vocabulary that could transfer to other autoregressive forecasting domains (weather, fluid dynamics, audio).

3. Loss function design insight: The clear demonstration that magnitude-based spectral losses *cannot by construction* correct phase errors is a useful negative result that should redirect loss-function engineering toward phase-aware objectives (anti-wrapping losses, complex STFT terms).

4. Cross-domain relevance: While framed in seismology, the findings about MTP stabilization of oscillatory rollout are potentially relevant to audio generation, biomedical signal forecasting (ECG/EEG), and neural PDE solvers—though this remains speculative without empirical validation.

The impact is somewhat bounded by the narrow scope: one model family, one synthetic dataset, no comparison to external baselines (Mamba, direct multi-horizon forecasters, etc.).

Timeliness & Relevance

The paper addresses a genuine and growing need. Autoregressive transformers are increasingly applied to physical time series, and the rollout stability problem is a recognized bottleneck. Multi-token prediction has gained significant attention through DeepSeek-V3 and Gloeckle et al. (2024) in language modeling, and quantifying its effect in a different domain (physical wavefields) is timely. The connection between phase drift and exposure bias in oscillatory signals is underexplored, making this a relevant niche contribution.

Strengths

1. Exceptional transparency: The paper is unusually honest about what it does and does not show. Claims are carefully scoped, limitations are front-loaded, and the authors explicitly distinguish within-run effects from fully established findings.

2. Clean experimental design: The matched-ablation framework with paired tests is a model for how ablation studies should be conducted.

3. Failure mode analysis: The polarity-inversion diagnosis is mechanistically grounded (magnitude STFT invariance to sign) and directly motivates a concrete research direction.

4. Physical interpretability of the context-ratio threshold: Linking the ρ ≈ 1 threshold to the P-S interval gives the finding physical meaning beyond a raw number.

Limitations

1. Single training seed: The most critical weakness. Small effects (+0.005, +0.009 NCC) may not survive multi-seed evaluation.

2. Synthetic data only: Real seismograms have noise, instrument response, and complexity that synthetic data lacks. Transferability is unestablished.

3. No external baselines: Without comparison to state-space models, direct forecasters, or other architectures, the findings are confined to the SeismoGPT family.

4. Incremental novelty: The dominant finding (MTP helps) validates existing work; the two smaller findings are modest in magnitude and uncertain in robustness.

5. No formal analysis: The paper offers empirical observations about why MTP and coherence losses help, but no theoretical framework for rollout stability in oscillatory signals.

Overall Assessment

This is a carefully executed, honestly scoped ablation study that makes a useful empirical contribution to understanding autoregressive rollout on oscillatory signals. Its main value lies in the rigor of the experimental protocol and the diagnostic clarity of the failure analysis, rather than in large-magnitude discoveries. The dominant finding—that MTP is the key stabilizer—is solid but somewhat expected; the secondary findings are interesting but insufficiently robust (single seed). The paper would benefit significantly from multi-seed validation and at least one external baseline comparison.

Rating:5.5/ 10

Significance 5Rigor 7.5Novelty 4.5Clarity 8.5

Generated Jun 10, 2026

Comparison History (23)

Lostvs. Finding Multiple Interpretations in Datasets

Paper 1 addresses a fundamental problem in machine learning interpretability and scientific discovery (the Rashomon effect), offering a methodology applicable across diverse domains to extract varied insights. Paper 2, while methodologically rigorous, is highly specialized, focusing specifically on autoregressive forecasting stability for oscillatory physical signals, which limits its broader impact across disciplines.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

Paper 1 proposes a broadly applicable, novel dynamic dataset pruning method with class-aware closed-form allocation and adaptive sampling, demonstrating strong empirical gains (including improved worst-group accuracy and large speedups) across multiple datasets, models, and training paradigms—suggesting high real-world utility and broad ML impact. Paper 2 is a careful, rigorous controlled study that yields valuable insights for autoregressive rollout stability in oscillatory signals, but its scope is narrower (primarily sequence forecasting for wavefields) and more diagnostic than transformative. Overall, Paper 1 is likely to have wider and more immediate impact.

gpt-5.2·Jun 11, 2026

Wonvs. Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

Paper 2 has higher impact potential due to clearer methodological rigor and broader scientific relevance. It tackles a fundamental limitation of autoregressive rollout for oscillatory physical signals (error accumulation/phase drift), uses a controlled synthetic testbed, performs intra-architecture ablations with paired significance tests, and yields actionable mechanistic insights (multi-token prediction dominance, context-ratio threshold, phase-aware loss need). These findings can generalize to seismology, gravitational-wave analysis, and other wavefield forecasting domains. Paper 1 is timely and useful for predictive maintenance but is more application-focused and leverages an existing foundation model with a lightweight head, offering less conceptual novelty.

gpt-5.2·Jun 11, 2026

Lostvs. Efficient Multinomial Logistic Bandit via Frequent Directions

Paper 2 introduces a novel algorithmic contribution (EOFD-MLogB) with rigorous theoretical guarantees (regret bounds) and significant computational improvements for multinomial logistic bandits, a fundamental problem in online learning. It has broader applicability across recommendation systems, clinical trials, and other sequential decision-making domains. Paper 1, while methodologically sound, is a controlled ablation study of an existing architecture (SeismoGPT) on synthetic data, explicitly framed as 'not a benchmark of forecasting architectures,' limiting its novelty and breadth of impact. Paper 2's theoretical contributions and algorithmic innovation offer more lasting and broadly applicable scientific impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Paper 2 likely has higher impact: it introduces a broadly useful benchmark + standardized adapter protocol enabling fair, reproducible evaluation of coding agents across models, harnesses, languages, and cost—an immediate need with wide community adoption potential. Its methodology emphasizes controlled contracts, future-commit cleanup, and systematic sweeps quantifying harness vs model effects, supporting rigor and practical relevance. Paper 1 is a solid controlled ablation study with clear insights for oscillatory forecasting, but it is narrower in scope (synthetic seismograms, one architecture) and less likely to drive cross-field tooling adoption.

gpt-5.2·Jun 11, 2026

Lostvs. Robust Regression of General ReLUs with Queries

Paper 2 has higher potential impact due to a clearer theoretical breakthrough: the first computationally efficient robust learner for general ReLUs in an interactive/query setting with near-optimal query complexity, plus complementary lower bounds showing necessity of queries for improvements over passive learning. This advances core learning theory (active/interactive learning, agnostic regression, complexity separations) with broad relevance across theoretical CS and foundations of ML. Paper 1 is a careful, useful controlled ablation study for autoregressive wavefield forecasting, but its contributions are more incremental and domain-specific, with narrower cross-field impact.

gpt-5.2·Jun 10, 2026

Wonvs. Learning Doubly Sparse Explicitly Conditioned Transforms

Paper 2 is likely to have higher impact: it tackles a timely, broadly relevant problem (stability of long-horizon autoregressive rollouts on oscillatory physical signals) and provides controlled ablations with significance testing and clear, generalizable findings (multi-token prediction as dominant stabilizer; context-ratio threshold; identification of phase/polarity failure modes). These insights transfer across domains using sequence models (geophysics, climate/waves, gravitational waves, time-series ML). Paper 1 is novel and useful for sparse transform learning, but its impact is narrower and more specialized despite methodological contributions.

gpt-5.2·Jun 10, 2026

Lostvs. Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

Paper 1 introduces a novel forecasting architecture (temporal–spatial–sample attention) plus a structural causal synthetic task generator (TS-SCM) aimed at learning transferable priors, which could influence both methodology and benchmarking beyond a single domain. Its applications span general multivariate time-series settings (industrial/public) and ties to causal modeling, giving broader cross-field reach and timeliness. Paper 2 is rigorous and insightful but primarily a controlled diagnostic study on oscillatory wavefields with incremental design findings; its impact is likely narrower and more domain-specific despite strong methodological care.

gpt-5.2·Jun 10, 2026

Lostvs. Does Order Matter : Connecting The Law of Robustness to Robust Generalization

Paper 2 addresses a fundamental open problem explicitly posed by Bubeck and Selke (2021), connecting the Law of Robustness to robust generalization via Rademacher complexity bounds. This theoretical contribution has broad implications across machine learning theory, adversarial robustness, and generalization theory. Paper 1, while methodologically careful, is a controlled empirical study on a specific architecture (SeismoGPT) for a narrower application domain (seismogram forecasting), and explicitly frames itself as not a benchmark contribution. Paper 2's theoretical results are more generalizable and foundational.

claude-opus-4-6·Jun 10, 2026

Lostvs. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Paper 1 proposes a broadly applicable, novel rollout-budget allocation framework (TRACE) for multi-turn agentic RL that operates at both prompt and prefix levels, directly addressing a key bottleneck in RLVR (low reward contrast under fixed sampling budgets). It introduces a generalizable success-probability predictor and tree-structured adaptive sampling with demonstrated efficiency/accuracy gains on standard agentic benchmarks, making it timely and likely reusable across LLM-based RL systems. Paper 2 is methodologically rigorous and insightful but is primarily a controlled diagnostic study on a narrower synthetic seismogram setting, with more limited immediate cross-domain impact.

gpt-5.2·Jun 10, 2026

#4712of 5669·cs.LG

#4712 of 5669 · cs.LG

Tournament Score

1306±42

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5

Rigor7.5

Novelty4.5

Clarity8.5