CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

Yosuke Yamaguchi, Issei Suemitsu, Yuki Kajihara, Wenpeng Wei

Jun 9, 2026arXiv:2606.10798v1

cs.LG

#2497of 5669·cs.LG

#2497 of 5669 · cs.LG

Tournament Score

1416±42

10501750

53%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor5.5

Novelty5

Clarity7.5

Abstract

Pretrained time series foundation models (TSFMs) have enabled zero-shot forecasting on unseen target series. However, existing TSFMs often incur high computational cost and provide limited support for diverse variable types, often failing to account for covariates that exogenously influence target variability. To address these challenges, we propose CITRAS-FM, a tiny 7M-parameter TSFM that supports univariate, multivariate, and covariate-informed zero-shot forecasting with real-time CPU inference. Built on a patch-based, decoder-only Transformer, CITRAS-FM introduces Shifted Attention into the cross-variate module to effectively exploit known covariates accessible throughout the forecast horizon. Moreover, to enable covariate-aware pretraining despite the scarcity of covariate-rich corpora, we propose CovSynth, which synthesizes realistic covariates from decomposed components of target series. Experiments on fev-bench, spanning 100 tasks across various settings, demonstrate that CITRAS-FM achieves state-of-the-art zero-shot accuracy among sub-10M TSFMs while delivering sub-0.1-second CPU inference, offering a strong balance between forecasting accuracy and real-time deployability.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CITRAS-FM

1. Core Contribution

CITRAS-FM addresses two practical limitations of existing time series foundation models (TSFMs): (1) high computational cost that precludes deployment on resource-constrained devices, and (2) limited support for covariates in zero-shot forecasting settings. The paper proposes a 7M-parameter TSFM that supports univariate, multivariate, and covariate-informed zero-shot forecasting with sub-0.1-second CPU inference.

Two specific technical contributions stand out:

Shifted Attention: A modification to the cross-variate attention module that enables the model to attend to known covariates one patch step ahead, providing forecast-horizon information to target predictions. This is a simplification of the KV Shift mechanism from the authors' prior work (CITRAS), making the temporal alignment of covariates more straightforward.

CovSynth: A synthetic covariate generation method that addresses the scarcity of covariate-rich pretraining data by decomposing target series via STL and constructing pseudo-covariates (event, long-term, periodic) from the residual component. This is a pragmatic solution to a genuine data bottleneck.

2. Methodological Rigor

The evaluation uses fev-bench, a comprehensive benchmark with 100 tasks, which is a strength compared to cherry-picked datasets. The categorization into fev-all, fev-cov, fev-multi, and fev-uni provides useful granularity. The choice of Scaled Quantile Loss and skill scores relative to SeasonalNaive is standard and appropriate.

However, several methodological concerns arise:

Ablation study is thin: Only two ablations (removing Shifted Attention and CovSynth) are presented. The contributions of other design choices—causal scaling, SwiGLU, pre-layer normalization, the specific pretraining data mixture ratios—are not isolated. The improvement from CovSynth is modest (1.8 points on fev-cov), and removing it actually *improves* fev-uni by 0.7 points, suggesting possible negative transfer in non-covariate settings that isn't discussed.

Pretraining data mixture: The model trains on three diverse datasets (TSMixup with 11B points, Cauker-generated with 4B points, Gift-Eval with 19B points), sampled with equal probability. The contribution of this data mixture versus the architectural innovations is unclear.

Limited statistical analysis: No confidence intervals, significance tests, or variance across runs are reported, making it difficult to assess whether the improvements over KAIROS mini (3.5 points on fev-all) are statistically meaningful.

Efficiency comparison (Table III) is limited to a single dataset (Application) with only 5 models. The comparison would be more convincing across multiple covariate-informed tasks.

3. Potential Impact

The practical impact potential is notable:

Edge/on-device deployment: The sub-0.1-second CPU inference makes this model viable for IoT, server monitoring, and manufacturing scenarios where GPU access is unavailable. This addresses a real deployment gap.

Covariate utilization in zero-shot settings: As Table I shows, very few TSFMs support all variable types in zero-shot mode. Being the smallest model to do so (alongside the 120M-parameter Chronos-2) is a meaningful practical contribution.

CovSynth as a general technique: The idea of synthesizing covariates from target decomposition could be adopted by other TSFMs facing the same data scarcity problem, though its generalizability beyond this specific model hasn't been validated.

The impact is somewhat bounded by the fact that the model doesn't outperform larger models overall—Chronos-2, TiRex, and TimesFM-2.5 all achieve higher fev-all scores. The contribution is primarily in the efficiency-accuracy tradeoff space, which is important but more niche than a general accuracy improvement.

4. Timeliness & Relevance

The paper is highly timely. TSFMs are a rapidly evolving area, and the push toward efficient, deployable models is a current priority. The covariate handling problem is genuine—most TSFMs indeed ignore covariates, which limits their practical utility. The paper appears in the context of concurrent work (Chronos-2, COSMIC, Toto) all released in 2025, positioning it well in the current discourse.

The acceptance to EUSIPCO 2026 (a signal processing conference rather than a top ML venue) somewhat limits its visibility, though the arXiv availability helps.

5. Strengths & Limitations

Strengths:

Clear problem framing with a well-defined niche (tiny + covariate-aware + zero-shot)

Practical validation on a comprehensive benchmark (100 tasks)

The efficiency-accuracy tradeoff is well-demonstrated, particularly the Application dataset experiment

CovSynth is a creative solution to the covariate data scarcity problem

Well-written paper with clear notation and architecture description

Limitations:

The Shifted Attention mechanism is incremental over the prior KV Shift in CITRAS—it's a simplification rather than a fundamentally new idea

CovSynth's synthetic covariates are derived from the target itself (via STL residuals), which creates a somewhat circular relationship. Real covariates often contain information orthogonal to what's derivable from the target. The paper doesn't analyze how well CovSynth-pretrained models transfer to genuinely exogenous covariates

The model underperforms larger models significantly on fev-multi (54.2 vs 57.9 for Chronos-2) and fev-uni (31.3 vs 37.0), showing that the efficiency gains come with meaningful accuracy costs

Only a single GPU (V100) is used for pretraining, but total pretraining time is not reported

No analysis of failure cases or limitations of the covariate synthesis approach

The paper doesn't explore scaling laws—would a 15M or 30M version substantially close the gap to larger models?

Additional Observations

The paper builds directly on the authors' prior CITRAS work, making it somewhat incremental. The transition from supervised to foundation model is the main advance. The contribution would be strengthened by demonstrating CovSynth's utility when applied to other TSFMs, and by more thorough analysis of when covariate information helps versus hurts.

Rating:5.8/ 10

Significance 5.5Rigor 5.5Novelty 5Clarity 7.5

Generated Jun 10, 2026

Comparison History (19)

Wonvs. HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

Paper 1 addresses a critical bottleneck in time series foundation models by enabling high accuracy and covariate integration with minimal computational overhead (7M parameters, CPU inference). This offers immense practical utility and real-world deployability across diverse industries compared to Paper 2, which, while methodologically rigorous and impactful for scientific machine learning, targets a more specialized domain of PDE solving.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

Paper 1 addresses a fundamental and timely challenge in AI safety—how to maintain oversight of increasingly capable AI systems. The bootstrapped monitoring framework introduces a novel and practically important concept: using untrusted but transparent reasoning chains to bridge capability gaps in AI oversight. This has broad implications for AI alignment and governance as models scale. Paper 2, while solid engineering work on efficient time series models, represents more incremental progress in a crowded field. The AI safety/control problem Paper 1 tackles is more urgent and cross-cutting, with higher potential to influence policy and practice.

claude-opus-4-6·Jun 11, 2026

Wonvs. From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

Paper 1 presents a highly efficient, tiny (7M parameter) time series foundation model that supports real-time CPU inference and novel covariate integration. Its focus on zero-shot forecasting with low computational cost addresses a major bottleneck in deploying foundation models, giving it massive potential for broad, real-world applications across diverse industries. While Paper 2 offers a strong methodological advance in graph structure discovery using diffusion priors, Paper 1's timely contribution to efficient foundation models and immediate practical deployability suggest a broader and higher overall scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. On Subquadratic Architectures: From Applications to Principles

Paper 1 has higher potential scientific impact because it addresses a fundamental challenge in deep learning: developing scalable subquadratic alternatives to Transformers. By rigorously comparing leading architectures (xLSTM, Mamba-2) across diverse, complex domains (code, time-series) and providing a unified theoretical formulation explaining xLSTM's superiority in state tracking, it offers broad foundational insights. Paper 2 is highly practical and efficient but focuses on a narrower niche (tiny time-series forecasting models), making its potential impact more domain-specific compared to the general architectural implications of Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

Paper 1 addresses a fundamental gap in PINN reliability by providing rigorous two-sided error bounds—both lower and upper—which is novel and important for trustworthy scientific computing. The theoretical contribution (computable a posteriori certificates without access to exact solutions) has broad implications for the growing PINN community and could influence how neural network-based PDE/ODE solvers are validated and certified. Paper 2, while practically useful, is more incremental—a smaller foundation model with covariate support—competing in a crowded TSFM landscape. Paper 1's methodological rigor and foundational nature give it higher long-term scientific impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. APPO: Agentic Procedural Policy Optimization

Paper 2 (APPO) is likely to have higher scientific impact due to broader cross-field relevance and timeliness: fine-grained credit assignment and branching in agentic RL directly targets a central bottleneck in LLM-agent training and can generalize across tool use, planning, and sequence decision-making. The method introduces a principled branching score and advantage scaling, evaluated on 13 benchmarks with consistent gains, suggesting methodological rigor and adoption potential. Paper 1 is valuable and practical, but its impact is narrower (time-series forecasting) and more incremental within TSFMs.

gpt-5.2·Jun 11, 2026

Lostvs. Can we trust our models? Epistemic calibration in second-order classification

Paper 1 introduces a fundamentally new theoretical concept (epistemic calibration) that addresses a critical gap in uncertainty quantification for machine learning. It provides formal definitions, impossibility theorems, consistent estimators, and broad experimental validation. This foundational contribution has potential to reshape how epistemic uncertainty is evaluated across many domains, especially high-stakes applications. Paper 2, while practically useful, is more incremental—proposing a smaller, efficient time series model with covariate support, representing engineering optimization rather than conceptual innovation. Paper 1's theoretical depth and broad applicability give it higher long-term scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Efficient Reasoning on the Edge

CITRAS-FM addresses a more specific and underexplored gap in time series foundation models—covariate-informed zero-shot forecasting with a tiny model. Its novel contributions (Shifted Attention, CovSynth for synthetic covariate generation, and comprehensive benchmarking on 100 tasks) represent meaningful methodological innovations. Paper 2, while practical, primarily combines existing techniques (LoRA, budget forcing via RL, adapter switching) for on-device reasoning without introducing fundamentally new methods. Paper 1's contributions to the rapidly growing TSFM field, especially enabling covariate-aware pretraining and achieving SOTA with only 7M parameters, have broader research implications.

claude-opus-4-6·Jun 10, 2026

Wonvs. How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

Paper 1 addresses a broad and highly impactful domain (time-series forecasting) with a novel, ultra-compact foundation model capable of zero-shot covariate-informed predictions. Its ability to run in real-time on CPUs and its innovative data synthesis method (CovSynth) offer vast real-world applications across finance, logistics, and healthcare. While Paper 2 provides crucial methodological insights for the EEG/BCI community, Paper 1's contributions have wider cross-disciplinary applicability and immediate practical utility in edge deployment.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

Paper 1 addresses a critical bottleneck in time series foundation models by introducing a highly efficient, CPU-deployable 7M-parameter model. Its ability to incorporate covariates and perform zero-shot forecasting offers massive real-world utility across diverse industries. While Paper 2 provides rigorous theoretical advancements in bandit learning, Paper 1's combination of efficiency, broad applicability, and timeliness in the foundation model landscape gives it a higher potential for widespread scientific and practical impact.

gemini-3.1-pro-preview·Jun 10, 2026

#2497of 5669·cs.LG

#2497 of 5669 · cs.LG

Tournament Score

1416±42

10501750

53%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor5.5

Novelty5

Clarity7.5