GITCO: Gated Inference-Time Context Optimization in TSFMs
Manya Pandey, Dhruv Kumar, Murari Mandal, Saurabh Deshpande
Abstract
Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.
AI Impact Assessments
(1 models)Scientific Impact Assessment: GITCO — Gated Inference-Time Context Optimization in TSFMs
1. Core Contribution
GITCO introduces a lightweight, inference-time framework for improving zero-shot forecasting accuracy of frozen, patch-based Time Series Foundation Models (TSFMs). The key insight is that structurally anomalous patches within the input context window can disproportionately capture attention and silently degrade forecast quality — a phenomenon the authors term "context poisoning." Rather than modifying model weights, GITCO operates entirely on the input: a Gate decides whether to intervene, a Router selects among three expert probes, and a Critic identifies and smooths the most disruptive patch via a simple moving average.
The paper also introduces the concept of context sensitivity profiles (Φ_M) — the mapping from time series meta-features to expected improvement under inference-time context intervention, conditioned on model architecture. This is framed as a characterizable, model-specific property, supported by the contrasting results on TimesFM 2.5 (learnable gate) versus Chronos2 (no learnable gate from the same feature vocabulary).
2. Methodological Rigor
Strengths in evaluation design: The authors employ K=11-fold cross-validation across 53 GIFT-Eval datasets, which provides reasonable statistical rigor for the gating and routing decisions. The use of sliding-window evaluation with stride-1 extraction and capped window counts is sensible. The Captured Improvement Ratio (CIR) metric is well-motivated as a value-weighted measure that accounts for asymmetric intervention costs.
Concerns:
3. Potential Impact
The paper addresses a real problem: frozen TSFMs in production cannot be retrained per-deployment, so input-side interventions are practical. The idea of treating input context quality as an optimization target is conceptually appealing and aligns with the broader trend of test-time compute scaling in NLP.
However, the practical impact is constrained by several factors:
The concept of context sensitivity profiles is potentially more impactful as a diagnostic tool for understanding and comparing TSFM architectures, though it is only sketched here rather than deeply developed.
4. Timeliness & Relevance
The paper is timely. TSFMs are an active area with models like TimesFM, Chronos, Moirai, and others rapidly emerging. The question of how to improve these models at inference time without retraining is practically relevant for enterprise deployments. The connection to test-time compute scaling in LLMs (Snell et al., 2024) is apt, though the analogy is somewhat loose — chain-of-thought and self-consistency operate on reasoning processes, while GITCO operates on signal preprocessing.
The GIFT-Eval benchmark choice is appropriate and current. The focus on zero-shot evaluation reflects realistic deployment scenarios.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The paper's positioning at the intersection of test-time compute scaling and time series forecasting is strategically interesting. However, the actual mechanism (detect bad patch → smooth it) is closer to classical signal preprocessing than to the sophisticated inference-time reasoning strategies in NLP. The conceptual framing somewhat oversells the technical contribution.
The CIR metric, while useful, is self-referential: it measures how well the system captures improvement defined by its own oracle, which uses the same three probes. This makes 89.9% less impressive than it initially appears.
Generated Jun 5, 2026
Comparison History (19)
Paper 2 introduces a critical benchmark for long-horizon AI agents, a rapidly expanding and highly relevant field. As persistent AI assistants become ubiquitous, evaluating their ability to handle nuanced, contradictory, or complementary memories is essential. Benchmarks in this area tend to drive significant follow-up research and shape the development of future models. While Paper 1 offers a valuable methodological improvement for time series models, Paper 2 addresses a fundamental capability gap in the broader and currently more impactful domain of Large Language Model agents.
Agents' Last Exam (ALE) addresses a fundamental gap between AI benchmark performance and real-world economic impact, introducing a comprehensive, living benchmark with 250+ industry experts across 55 subfields. Its breadth of impact spans virtually all non-physical industries, and it tackles the timely, high-stakes question of AI deployment relevance. GITCO, while methodologically sound, is narrowly focused on improving time series foundation models via inference-time context optimization—a useful but incremental contribution within a specific subfield. ALE's potential to reshape how AI systems are evaluated for economic value gives it substantially broader impact.
Paper 2 addresses the highly timely and broadly impactful problem of AI-generated content attribution, proposing a novel mechanism leveraging internal LLM representations for self-recognition and fingerprinting. This has immediate real-world applications in content provenance, AI safety, and regulation. The 98% accuracy with no quality degradation is compelling. Paper 1, while technically solid, addresses a narrower problem (context optimization for time series foundation models) with incremental improvements (~2% MASE reduction) on a specific model, limiting its breadth of impact across the broader ML community.
Paper 1 addresses a critically timely and high-visibility topic—the environmental footprint of AI-driven hyperscale data centers—with novel facility-level empirical data covering 403 US data centers. Its finding that HDC carbon intensity is 48% above the national grid average has immediate policy relevance and broad societal impact across energy, environmental, and technology domains. Paper 2, while technically sound, presents an incremental improvement (~1.95% MASE reduction) to a specific time series forecasting framework, limiting its breadth of impact to a narrower ML audience.
Paper 1 addresses a critical bottleneck in the widespread deployment of agentic AI: balancing scalable autonomy with safety and human oversight. Its framework for gradual, earned autonomy has massive cross-disciplinary implications for AI alignment, governance, and human-computer interaction. While Paper 2 is methodologically rigorous and presents strong empirical results, its scope is much narrower, primarily impacting the specialized subfield of time series forecasting.
Paper 2 tackles a critical bottleneck in deploying Large Language Models—efficient long-context generation via sparse attention. By providing a system that accelerates algorithm prototyping and achieves significant throughput gains on massive models (up to 229B parameters) and modern hardware, it addresses a highly active and impactful research area. While Paper 1 offers a novel approach for time series models, the breadth of impact, timeliness, and real-world applicability of LLM serving optimization give Paper 2 a higher potential for widespread scientific impact.
Security of AI agents against prompt injection is a critical, highly timely issue with massive real-world implications. Paper 2 provides foundational security guarantees for Computer Use Agents, addressing a fundamental barrier to their safe deployment. This offers broader cross-disciplinary impact (AI and cybersecurity) compared to Paper 1's narrower focus on time series forecasting optimization.
Paper 1 introduces a novel framework (TBS) that bridges cognitive science and multi-agent simulation by separating internal reasoning from public expression, offering broad interdisciplinary impact across computational social science, opinion dynamics, and AI. It addresses fundamental questions about social deliberation mechanisms. Paper 2, while technically solid, addresses a narrower engineering problem (context poisoning in time series foundation models) with incremental improvements (~1.95% MASE reduction). Paper 1's conceptual contribution—making internal-to-public expression pathways observable—has greater potential to influence multiple research communities and inspire new methodological directions.
Paper 1 introduces a novel inference-time optimization framework and a new theoretical property (context sensitivity profiles) for time series foundation models, offering fundamental methodological contributions. Paper 2, while practically valuable, primarily benchmarks existing agentic LLM techniques for a specific application, offering less foundational scientific innovation.
AgentProcessBench addresses a more broadly impactful problem—step-level verification for tool-using LLM agents—which is central to the rapidly growing field of AI agents. It introduces the first benchmark of its kind with substantial human annotations (8,509 labeled steps), enabling reproducible research across the community. Its insights on process-level supervision complementing outcome supervision have broad implications for test-time scaling and reward modeling. Paper 1, while technically sound, addresses a narrower optimization for a specific class of time series models (TSFMs) with modest improvements (+1.95% MASE), limiting its breadth of impact.
Paper 2 addresses a fundamental vulnerability (context poisoning) in Time Series Foundation Models, proposing a lightweight, training-free inference-time optimization. Because time series forecasting applies universally across domains (finance, healthcare, climate), improving zero-shot accuracy without parameter updates offers broader, more immediate real-world impact compared to Paper 1's focus on the important but more specialized domain of hardware verification.
Paper 2 has higher potential impact due to a novel, broadly applicable inference-time intervention for time-series foundation models that improves accuracy without retraining, making it practical for real deployments. It introduces a clear problem (context poisoning), a concrete method (Gate/Router/Critic), and quantifies gains across many datasets, suggesting methodological rigor and generality. The added concept of context sensitivity profiles could influence both evaluation and future TSFM design. Paper 1 is valuable as a diagnostic benchmark, but benchmarks typically yield narrower downstream impact unless widely adopted.
Paper 2 addresses a critical bottleneck in the highly active field of multi-agent LLM systems: wasted computation and failure diagnosis. Its observability framework offers broad applicability for improving efficiency, reducing costs, and enhancing the reliability of complex AI systems. While Paper 1 provides a useful optimization for time series models, Paper 2's focus on foundational diagnostics for LLM agents likely yields wider, more immediate cross-disciplinary impact and addresses pressing real-world scalability challenges.
Paper 2 addresses a critical bottleneck in autonomous agents (latency and cost of explicit chain-of-thought) by introducing latent reasoning and a generative world model. This approach has broad implications for efficient, real-time multimodal agents and implicit reasoning, offering significant efficiency gains (75% fewer tokens). Paper 1 offers a useful but more niche inference-time optimization for time series models with relatively modest performance improvements, making Paper 2's methodological innovation and potential cross-field impact substantially higher.
Paper 1 addresses knowledge editing in Large Language Models, a highly active field with broad cross-disciplinary implications for AI safety and reliability. By formally coupling propagation and preservation pressures, it introduces a novel theoretical framework to a critical bottleneck. Paper 2's focus on Time Series Foundation Models is highly practical but targets a narrower application domain, giving Paper 1 a higher potential for widespread scientific impact and broader real-world relevance.
Paper 2 demonstrates higher potential scientific impact because it addresses a systemic issue in causal discovery: the validity of its evaluation benchmarks. By analyzing the consistency of 11 popular benchmarks against over 38,000 domain papers, it exposes flaws affecting the entire field, especially emerging LLM-based methods. While Paper 1 offers a valuable inference-time optimization for time-series models, Paper 2's foundational critique is likely to influence how all future causal discovery methods are evaluated, forcing methodological shifts and ensuring higher long-term relevance across multiple scientific disciplines.
PolarMem introduces a more novel conceptual contribution—negative/polarized memory for VLMs—which addresses a fundamental gap in how memory systems handle absence and logical exclusion. Its breadth of evaluation (8 backbones, 6 benchmarks) and the generality of the framework (training-free, applicable to any frozen VLM) suggest wider impact across multimodal AI. GITCO, while practical, addresses a narrower problem (context poisoning in time series foundation models) with incremental improvements (~2% MASE reduction) on a single model family, limiting its breadth and transformative potential.
GITCO introduces a novel inference-time optimization framework addressing a newly identified problem (context poisoning) in time series foundation models, with rigorous evaluation across 53 datasets. It also introduces 'context sensitivity profiles' as a new characterizable property. While Paper 2 contributes a useful benchmark and training improvements for LLM math reasoning, the space is crowded with similar benchmarks. GITCO's approach of optimizing input context without weight updates is more innovative and has broader applicability across the growing TSFM ecosystem.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: hierarchical skill consolidation and self-evolving agents address a central bottleneck in modern agentic AI and can transfer across many domains (tool use, robotics, web agents). The reported gains across multiple environments and backbones suggest generality, and the framework could influence downstream system design. Paper 1 is novel and rigorous for TSFMs and offers practical inference-time robustness, but its impact is narrower (forecasting foundation models) and the absolute improvement is modest, limiting cross-field breadth.