Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

Xu Zhang, Peang Wang, Wei Wang

Jun 7, 2026arXiv:2606.08578v1

cs.LG

#1659of 5669·cs.LG

#1659 of 5669 · cs.LG

Tournament Score

1447±44

10501750

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor6.5

Novelty5.5

Clarity7.5

Abstract

Recently, large time series models (LTSMs) have gained increasing attention due to their similarities to large language models, including flexible context length, scalability, and task generality, outperforming advanced task-specific models. However, prior studies indicate that pre-trained LTSMs may exhibit a poorly conditioned non-convex loss landscape, leading to limited trainability. As a result, direct fine-tuning tends to cause overfitting and suboptimal performance, sometimes even worse than training from scratch, substantially diminishing the benefits of pre-training. To overcome this limitation, we propose Smoothed Full Fine-tuning (SFF), a novel fine-tuning technology. Specifically, we construct an auxiliary LTSM via random initialization to obtain a smoother loss landscape, and then linearly interpolate its weights with those of the pre-trained model to smooth the original landscape. This process improves trainability while preserving pre-trained knowledge, thereby enabling more effective downstream fine-tuning. From an optimization perspective, SFF perturbs sharp minima without significantly harming flat regions, facilitating escape from poor local basins toward smoother and more generalizable solutions. Extensive experiments on benchmark datasets demonstrate consistent improvements across eight representative LTSMs, including Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, on diverse downstream tasks. The code is available at the link: https://github.com/Meteor-Stars/SFF.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?"

1. Core Contribution

The paper identifies that pre-trained large time series models (LTSMs) often converge to sharp minima during pre-training, yielding poorly conditioned, non-convex loss landscapes that hinder downstream fine-tuning. The proposed solution, Smoothed Full Fine-tuning (SFF), is elegantly simple: linearly interpolate the weights of the pre-trained model with those of a randomly initialized copy (using standard Kaiming/Xavier initialization) before fine-tuning. The key insight is that randomly initialized models reside in flat regions of the loss landscape, and weight interpolation can smooth sharp regions of the pre-trained model without harming already-flat regions. This is a pre-processing step requiring only a few lines of code, no additional memory, and no computational overhead during training.

2. Methodological Rigor

Theoretical analysis: The paper provides a Hessian-based analysis showing that interpolation reduces the maximum eigenvalue of the Hessian at sharp minima (via convex combination of Hessians) while preserving flatness at already-flat minima. The argument relies on a local quadratic approximation and the assumption that the Hessian of the interpolated point can be approximated as a convex combination of endpoint Hessians (Eq. 5). This is a non-trivial assumption—it holds exactly only for quadratic losses, not for deep networks in general—and the paper acknowledges this implicitly by using "≈" and "≲" rather than strict equalities. The connection to Fort & Scherlis (2019) regarding initialization smoothness (Tr(H)/||H||_F >> 1) adds theoretical grounding, though the overall analysis remains more heuristic than rigorous.

Empirical evaluation: The experiments are comprehensive. SFF is tested across eight LTSMs spanning four architectural families (encoder-only, decoder-only, encoder-decoder, MLP-only) and scales from 3MB to 3.8GB. The evaluation covers forecasting (8 datasets, multiple data proportions from 1% to 100%), anomaly detection (250 datasets), and imputation tasks. Multiple random seeds are used throughout. The comparisons against FF, LP, LP-FF, SAM, SWA, Mixout, L2-SP, LoRA, and label smoothing are thorough. Loss landscape visualizations provide intuitive support.

Potential concerns: The Hessian convex combination approximation (Eq. 5) is the weakest theoretical link—it is generally inaccurate for highly non-linear functions over large parameter distances. The paper would benefit from empirical Hessian spectrum analysis before and after interpolation. Additionally, the hyperparameter α is selected from {0.3, 0.5, 0.7, 0.9}, requiring validation-based tuning, which somewhat undermines the "zero overhead" claim. The paper also does not explore whether the observed improvements diminish as LTSMs become better pre-trained on more diverse data.

3. Potential Impact

Practical impact: SFF's simplicity is its greatest strength—it can be implemented in ~3 lines of PyTorch and applied as a universal preprocessing step before any fine-tuning procedure. This makes adoption trivial for practitioners working with any pre-trained time series model. The consistent improvements across diverse architectures and scales suggest broad applicability.

Broader implications: The observation that pre-trained LTSMs may exhibit worse fine-tuning performance than training from scratch (due to loss landscape conditioning) is an important finding for the foundation model community. If this phenomenon generalizes beyond time series to other modalities (as the authors suggest), SFF or variants could become standard practice. The paper opens a new research direction: loss-landscape-aware fine-tuning strategies for foundation models.

Limitations in scope: The improvements, while consistent, are often modest (3-7% MSE reduction over FF for Timer). The method is most impactful in low-data regimes and for models that genuinely suffer from sharp minima. As pre-training practices improve, the problem SFF addresses may become less severe.

4. Timeliness & Relevance

This work is highly timely. Large time series models (Timer, TimesFM, MOMENT, Chronos, etc.) have emerged rapidly in 2023-2024, and the community is actively seeking effective fine-tuning strategies. The paper addresses a genuine bottleneck: many practitioners have observed that direct fine-tuning of these models yields disappointing results. The connection to loss landscape theory provides principled understanding rather than ad hoc solutions.

5. Strengths & Limitations

Key Strengths:

Simplicity and zero overhead: The method adds no parameters, memory, or training cost—only a one-time weight interpolation before fine-tuning.

Universality: Consistent improvements across 8 diverse LTSMs, 3 tasks, multiple data regimes, and various architectures demonstrate robustness.

Important empirical observation: The finding that pre-trained LTSMs can underperform training from scratch due to loss landscape conditioning is valuable to the community.

Thorough experimental coverage: 250 anomaly detection datasets, multiple data proportions, multiple seeds, and comparisons with 7+ baselines.

Published at ICLR 2026, indicating strong peer validation.

Notable Weaknesses:

Theoretical gaps: The Hessian convex combination approximation is not validated empirically and may not hold for large interpolation distances in deep networks.

Modest improvements in some settings: Gains diminish with more training data and for already well-conditioned models.

Limited novelty in the technique itself: Weight interpolation is well-studied; the novelty lies primarily in the application context and the observation about LTSM loss landscapes.

No analysis of when/why pre-training produces sharp minima: The paper observes the phenomenon but doesn't investigate root causes (batch size, learning rate, data distribution).

α selection: While robust across a range, still requires validation-based tuning per model/dataset.

6. Additional Observations

The paper's strongest contribution may be the empirical finding itself—that LTSMs suffer from sharp minima post-pretraining—rather than the specific solution. This observation could motivate better pre-training practices (e.g., using SAM during pre-training) that address the root cause. The connection to model soups and continual learning literature is appropriately discussed, with clear differentiation of goals and pipelines.

Rating:6.8/ 10

Significance 6.5Rigor 6.5Novelty 5.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (18)

Wonvs. Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Paper 1 addresses a critical bottleneck (fine-tuning trainability) in the rapidly emerging field of Large Time Series Models. Its proposed solution is elegant, methodologically sound, and extensively validated across eight different foundation models. This broad applicability and relevance to foundation model adaptation give it a higher potential for widespread impact compared to Paper 2's domain-specific architectural contribution for dynamical systems.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Paper 2 introduces a concrete, broadly applicable fine-tuning method (SFF) targeting a timely and widely relevant problem—improving trainability and generalization of large pre-trained time-series foundation models. It is positioned for immediate real-world use across many downstream forecasting/TS tasks and claims validation across eight major LTSMs, suggesting methodological breadth and practical robustness. Paper 1 is valuable infrastructure (a benchmark) for symbolic regression, but its impact is likely narrower and more incremental unless it becomes a widely adopted standard. Overall, Paper 2 has higher cross-field and near-term impact potential.

gpt-5.2·Jun 9, 2026

Lostvs. Differentially Private Synthetic Data via APIs 4: Tabular Data

Paper 2 likely has higher impact: it advances a timely, high-stakes area (differentially private synthetic tabular data) with broad applicability in healthcare, finance, and public-sector data sharing. Extending Private Evolution to tabular data while removing reliance on large foundation models is a notable innovation with clear practicality (much faster) and addresses an important unmet need (high-order correlations). The methodological framing (DP guarantees + extensive evaluation) and cross-field relevance of DP data release give it wider potential reach than Paper 1’s more domain-specific fine-tuning technique for large time-series models.

gpt-5.2·Jun 9, 2026

Wonvs. When Are Neural Interaction Discoveries Real? Identifiability, Recoverability, and a Pre-Fit Diagnostic

Paper 1 addresses a highly practical and timely problem—fine-tuning large time series models—with a simple, broadly applicable method (SFF) validated across eight major LTSMs. Given the rapid growth of foundation models for time series, this work has immediate wide applicability and addresses a critical barrier (overfitting during fine-tuning). Paper 2 makes important theoretical contributions on identifiability of neural interaction discoveries, but its scope is narrower (specific model class, interaction recovery), limiting its breadth of impact despite strong methodological rigor.

claude-opus-4-6·Jun 9, 2026

Wonvs. Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: improving fine-tuning of large time-series foundation models affects many domains (finance, healthcare, IoT, energy) and can be adopted widely with minimal hardware constraints. The proposed SFF method is simple, general, and empirically validated across eight prominent LTSMs, suggesting strong methodological breadth and reproducibility (code released). Paper 1 is innovative and impactful for aerial robotics, but its impact is narrower, more dependent on specific hardware/simulation-to-real assumptions, and harder to transfer broadly across fields.

gpt-5.2·Jun 9, 2026

Wonvs. A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

Paper 2 addresses a broadly relevant problem—fine-tuning large time series models—with a simple, practical, and generalizable method (SFF) applicable across eight major models and diverse tasks. Its novelty in smoothing non-convex loss landscapes via weight interpolation has wide applicability beyond time series to other foundation model domains. Paper 1, while rigorous and technically strong, targets a narrower niche (conformal risk control certificates for selective prediction) with more limited audience and applicability, and its gains are explicitly regime-scoped and not universal. Paper 2's breadth of impact and timeliness in the foundation model era give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Paper 1 addresses a critical bottleneck (poorly conditioned loss landscapes) in the rapidly growing field of Large Time Series Models. By providing a novel, model-agnostic fine-tuning method that demonstrably improves 8 state-of-the-art foundation models, it offers broad methodological impact across diverse time-series applications. While Paper 2 provides a valuable benchmark for causal inference, Paper 1's algorithmic innovation has more immediate and widespread applicability across the machine learning and forecasting communities.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. The Confidence Trap: Calibration Attacks for Graph Neural Networks

Paper 2 addresses a highly timely and broadly applicable challenge: fine-tuning Large Time Series Models (LTSMs). As foundation models for time series gain rapid traction across industries (finance, healthcare, forecasting), a generalizable fine-tuning technique that improves performance across numerous state-of-the-art models (e.g., TimesFM, Chronos) promises wider adoption and more immediate practical impact. In contrast, Paper 1's focus on adversarial calibration attacks for GNNs, while rigorous, caters to a more specialized subfield within adversarial machine learning.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. De novo molecular generation with optical property preconditioning at the token level

Paper 1 proposes a novel optimization technique to solve a fundamental trainability issue in Large Time Series Models. Its broad applicability across eight state-of-the-art models and various downstream tasks ensures wide impact across multiple domains relying on time series analysis. In contrast, Paper 2 focuses on a highly specific application (OLED molecular generation) and primarily serves as a benchmark analysis rather than introducing a broadly applicable novel methodology.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

Paper 1 likely has higher scientific impact: it introduces a broadly applicable, methodologically rigorous auditing framework (decomposition + phase-preserving interventions + sham controls + simulations) that exposes a major, under-addressed confound (aperiodic 1/f reliance) affecting interpretability and validity of physiological DL across EEG and ECG, including foundation models. This has immediate real-world implications for clinical deployment, biomarker discovery, and standards for model evaluation. Paper 2 presents a useful optimization technique for fine-tuning LTSMs, but it is more incremental relative to existing smoothing/interpolation fine-tuning ideas and its impact may be narrower and less domain-critical.

gpt-5.2·Jun 9, 2026

#1659of 5669·cs.LG

#1659 of 5669 · cs.LG

Tournament Score

1447±44

10501750

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor6.5

Novelty5.5

Clarity7.5