Waleed Esmail, Stuart Russell, Jana Klinge, Alexander Kappes, Christine Thomas
Long-horizon autoregressive forecasting of oscillatory physical signals, such as seismograms, gravitational-wave strain, and similar wavefields is limited by error accumulation: as a causal model is fed its own outputs over hundreds of steps, small per-step errors compound into phase drift that pointwise metrics fail to detect. We ask when such rollout stays stable, using synthetic three-component seismograms as a physically structured testbed and the \textsc{SeismoGPT} autoregressive forecaster as the model under study. Through controlled, intra-architecture ablations evaluated on free-running rollout with paired significance tests, we isolate the contribution of each design choice. Multi-token prediction is the dominant stabilizer, accounting for almost the entire improvement over a single-token baseline ( median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize, identifying phase-aware objectives as the natural next step. We frame this as a controlled study of rollout stability on oscillatory wavefields, not a benchmark of forecasting architectures.
This paper conducts a controlled ablation study examining the factors that stabilize long-horizon autoregressive rollout of oscillatory physical signals, using synthetic three-component seismograms as a testbed and the SeismoGPT autoregressive forecaster as the model. The central finding is that multi-token prediction (MTP) is the dominant stabilizer, accounting for nearly all improvement over a single-token baseline (+0.040 out of +0.045 total median NCC gain). Two additional components—a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss—provide small but statistically significant additive gains (+0.009 and +0.005 NCC, respectively). The paper also identifies a sharp context-ratio threshold (ρ ≈ 1, corresponding to the full P-S seismic interval) below which rollout quality collapses, and diagnoses the dominant residual failure mode as polarity inversion that magnitude-based spectral losses cannot, by construction, penalize.
The paper is explicit—commendably so—that it is *not* proposing a new architecture or benchmarking across model families, but rather performing a careful within-architecture dissection of what drives rollout stability.
The experimental methodology is notably disciplined for this type of study:
However, the single-seed limitation is a genuine weakness. The two smaller effects are of the same order of magnitude as training noise in many deep learning settings. Without multi-seed experiments, these effects could be artifacts of a particular initialization. The authors acknowledge this but it remains a significant gap. Additionally, the study is confined to a single synthetic dataset and a single architecture family (causal transformers), limiting generalizability claims.
The paper's impact operates on several levels:
1. Practical guidance for wavefield forecasting: The finding that MTP dominates rollout stability is actionable for practitioners working on seismograms, gravitational waves, or other oscillatory signals. The context-ratio threshold provides a concrete operational rule.
2. Diagnostic framework: The decomposition of rollout failure into phase drift versus amplitude decay, and the identification of polarity inversion as the dominant residual failure, provides a diagnostic vocabulary that could transfer to other autoregressive forecasting domains (weather, fluid dynamics, audio).
3. Loss function design insight: The clear demonstration that magnitude-based spectral losses *cannot by construction* correct phase errors is a useful negative result that should redirect loss-function engineering toward phase-aware objectives (anti-wrapping losses, complex STFT terms).
4. Cross-domain relevance: While framed in seismology, the findings about MTP stabilization of oscillatory rollout are potentially relevant to audio generation, biomedical signal forecasting (ECG/EEG), and neural PDE solvers—though this remains speculative without empirical validation.
The impact is somewhat bounded by the narrow scope: one model family, one synthetic dataset, no comparison to external baselines (Mamba, direct multi-horizon forecasters, etc.).
The paper addresses a genuine and growing need. Autoregressive transformers are increasingly applied to physical time series, and the rollout stability problem is a recognized bottleneck. Multi-token prediction has gained significant attention through DeepSeek-V3 and Gloeckle et al. (2024) in language modeling, and quantifying its effect in a different domain (physical wavefields) is timely. The connection between phase drift and exposure bias in oscillatory signals is underexplored, making this a relevant niche contribution.
1. Exceptional transparency: The paper is unusually honest about what it does and does not show. Claims are carefully scoped, limitations are front-loaded, and the authors explicitly distinguish within-run effects from fully established findings.
2. Clean experimental design: The matched-ablation framework with paired tests is a model for how ablation studies should be conducted.
3. Failure mode analysis: The polarity-inversion diagnosis is mechanistically grounded (magnitude STFT invariance to sign) and directly motivates a concrete research direction.
4. Physical interpretability of the context-ratio threshold: Linking the ρ ≈ 1 threshold to the P-S interval gives the finding physical meaning beyond a raw number.
1. Single training seed: The most critical weakness. Small effects (+0.005, +0.009 NCC) may not survive multi-seed evaluation.
2. Synthetic data only: Real seismograms have noise, instrument response, and complexity that synthetic data lacks. Transferability is unestablished.
3. No external baselines: Without comparison to state-space models, direct forecasters, or other architectures, the findings are confined to the SeismoGPT family.
4. Incremental novelty: The dominant finding (MTP helps) validates existing work; the two smaller findings are modest in magnitude and uncertain in robustness.
5. No formal analysis: The paper offers empirical observations about why MTP and coherence losses help, but no theoretical framework for rollout stability in oscillatory signals.
This is a carefully executed, honestly scoped ablation study that makes a useful empirical contribution to understanding autoregressive rollout on oscillatory signals. Its main value lies in the rigor of the experimental protocol and the diagnostic clarity of the failure analysis, rather than in large-magnitude discoveries. The dominant finding—that MTP is the key stabilizer—is solid but somewhat expected; the secondary findings are interesting but insufficiently robust (single seed). The paper would benefit significantly from multi-seed validation and at least one external baseline comparison.
Generated Jun 10, 2026
Paper 1 addresses a fundamental problem in machine learning interpretability and scientific discovery (the Rashomon effect), offering a methodology applicable across diverse domains to extract varied insights. Paper 2, while methodologically rigorous, is highly specialized, focusing specifically on autoregressive forecasting stability for oscillatory physical signals, which limits its broader impact across disciplines.
Paper 1 proposes a broadly applicable, novel dynamic dataset pruning method with class-aware closed-form allocation and adaptive sampling, demonstrating strong empirical gains (including improved worst-group accuracy and large speedups) across multiple datasets, models, and training paradigms—suggesting high real-world utility and broad ML impact. Paper 2 is a careful, rigorous controlled study that yields valuable insights for autoregressive rollout stability in oscillatory signals, but its scope is narrower (primarily sequence forecasting for wavefields) and more diagnostic than transformative. Overall, Paper 1 is likely to have wider and more immediate impact.
Paper 2 has higher impact potential due to clearer methodological rigor and broader scientific relevance. It tackles a fundamental limitation of autoregressive rollout for oscillatory physical signals (error accumulation/phase drift), uses a controlled synthetic testbed, performs intra-architecture ablations with paired significance tests, and yields actionable mechanistic insights (multi-token prediction dominance, context-ratio threshold, phase-aware loss need). These findings can generalize to seismology, gravitational-wave analysis, and other wavefield forecasting domains. Paper 1 is timely and useful for predictive maintenance but is more application-focused and leverages an existing foundation model with a lightweight head, offering less conceptual novelty.
Paper 2 introduces a novel algorithmic contribution (EOFD-MLogB) with rigorous theoretical guarantees (regret bounds) and significant computational improvements for multinomial logistic bandits, a fundamental problem in online learning. It has broader applicability across recommendation systems, clinical trials, and other sequential decision-making domains. Paper 1, while methodologically sound, is a controlled ablation study of an existing architecture (SeismoGPT) on synthetic data, explicitly framed as 'not a benchmark of forecasting architectures,' limiting its novelty and breadth of impact. Paper 2's theoretical contributions and algorithmic innovation offer more lasting and broadly applicable scientific impact.
Paper 2 likely has higher impact: it introduces a broadly useful benchmark + standardized adapter protocol enabling fair, reproducible evaluation of coding agents across models, harnesses, languages, and cost—an immediate need with wide community adoption potential. Its methodology emphasizes controlled contracts, future-commit cleanup, and systematic sweeps quantifying harness vs model effects, supporting rigor and practical relevance. Paper 1 is a solid controlled ablation study with clear insights for oscillatory forecasting, but it is narrower in scope (synthetic seismograms, one architecture) and less likely to drive cross-field tooling adoption.
Paper 2 has higher potential impact due to a clearer theoretical breakthrough: the first computationally efficient robust learner for general ReLUs in an interactive/query setting with near-optimal query complexity, plus complementary lower bounds showing necessity of queries for improvements over passive learning. This advances core learning theory (active/interactive learning, agnostic regression, complexity separations) with broad relevance across theoretical CS and foundations of ML. Paper 1 is a careful, useful controlled ablation study for autoregressive wavefield forecasting, but its contributions are more incremental and domain-specific, with narrower cross-field impact.
Paper 2 is likely to have higher impact: it tackles a timely, broadly relevant problem (stability of long-horizon autoregressive rollouts on oscillatory physical signals) and provides controlled ablations with significance testing and clear, generalizable findings (multi-token prediction as dominant stabilizer; context-ratio threshold; identification of phase/polarity failure modes). These insights transfer across domains using sequence models (geophysics, climate/waves, gravitational waves, time-series ML). Paper 1 is novel and useful for sparse transform learning, but its impact is narrower and more specialized despite methodological contributions.
Paper 1 introduces a novel forecasting architecture (temporal–spatial–sample attention) plus a structural causal synthetic task generator (TS-SCM) aimed at learning transferable priors, which could influence both methodology and benchmarking beyond a single domain. Its applications span general multivariate time-series settings (industrial/public) and ties to causal modeling, giving broader cross-field reach and timeliness. Paper 2 is rigorous and insightful but primarily a controlled diagnostic study on oscillatory wavefields with incremental design findings; its impact is likely narrower and more domain-specific despite strong methodological care.
Paper 2 addresses a fundamental open problem explicitly posed by Bubeck and Selke (2021), connecting the Law of Robustness to robust generalization via Rademacher complexity bounds. This theoretical contribution has broad implications across machine learning theory, adversarial robustness, and generalization theory. Paper 1, while methodologically careful, is a controlled empirical study on a specific architecture (SeismoGPT) for a narrower application domain (seismogram forecasting), and explicitly frames itself as not a benchmark contribution. Paper 2's theoretical results are more generalizable and foundational.
Paper 1 proposes a broadly applicable, novel rollout-budget allocation framework (TRACE) for multi-turn agentic RL that operates at both prompt and prefix levels, directly addressing a key bottleneck in RLVR (low reward contrast under fixed sampling budgets). It introduces a generalizable success-probability predictor and tree-structured adaptive sampling with demonstrated efficiency/accuracy gains on standard agentic benchmarks, making it timely and likely reusable across LLM-based RL systems. Paper 2 is methodologically rigorous and insightful but is primarily a controlled diagnostic study on a narrower synthetic seismogram setting, with more limited immediate cross-domain impact.