GITCO: Gated Inference-Time Context Optimization in TSFMs

Manya Pandey, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

Jun 3, 2026

arXiv:2606.05332v1 PDF

cs.AI(primary)

#2811of 3355·Artificial Intelligence

#2811 of 3355 · Artificial Intelligence

Tournament Score

1305±48

10501800

21%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor5

Novelty6

Clarity7

Tournament Score

1305±48

10501800

21%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GITCO — Gated Inference-Time Context Optimization in TSFMs

1. Core Contribution

GITCO introduces a lightweight, inference-time framework for improving zero-shot forecasting accuracy of frozen, patch-based Time Series Foundation Models (TSFMs). The key insight is that structurally anomalous patches within the input context window can disproportionately capture attention and silently degrade forecast quality — a phenomenon the authors term "context poisoning." Rather than modifying model weights, GITCO operates entirely on the input: a Gate decides whether to intervene, a Router selects among three expert probes, and a Critic identifies and smooths the most disruptive patch via a simple moving average.

The paper also introduces the concept of context sensitivity profiles (Φ_M) — the mapping from time series meta-features to expected improvement under inference-time context intervention, conditioned on model architecture. This is framed as a characterizable, model-specific property, supported by the contrasting results on TimesFM 2.5 (learnable gate) versus Chronos2 (no learnable gate from the same feature vocabulary).

2. Methodological Rigor

Strengths in evaluation design: The authors employ K=11-fold cross-validation across 53 GIFT-Eval datasets, which provides reasonable statistical rigor for the gating and routing decisions. The use of sliding-window evaluation with stride-1 extraction and capped window counts is sensible. The Captured Improvement Ratio (CIR) metric is well-motivated as a value-weighted measure that accounts for asymmetric intervention costs.

Concerns:

The improvement magnitude is modest: +1.95% mean MASE reduction across all 53 datasets. While the CIR of 89.9% sounds impressive, it is defined relative to a constrained oracle (three probes, SMA denoising), which itself represents a narrow intervention space. The absolute gains are small enough that they may not be practically significant in many deployment contexts.

The Router's 3-class accuracy of 33.3% ± 28.4% is essentially random. The authors argue this doesn't matter because the improvement landscape is "flat," but this undermines the claimed modularity and the narrative that routing is a meaningful component. The ablation (Table 2) partially supports this — Router Only achieves Σ∆% = +42.16% but at low precision — yet the interaction between Gate and Router contributions is not cleanly disentangled.

The denoising operator (5-point SMA on a single patch) is extremely simple. While the authors claim ablations show localization matters more than filter complexity, no alternative operators are tested in the main paper.

The Chronos2 negative result is intellectually interesting but also raises questions about generalizability. The framework essentially only "works" on one model in this evaluation.

Statistical significance tests are absent. With 53 datasets and modest effect sizes, it is unclear how robust these improvements are to dataset composition changes.

3. Potential Impact

The paper addresses a real problem: frozen TSFMs in production cannot be retrained per-deployment, so input-side interventions are practical. The idea of treating input context quality as an optimization target is conceptually appealing and aligns with the broader trend of test-time compute scaling in NLP.

However, the practical impact is constrained by several factors:

The framework is validated on only one model with positive results.

The improvement margins are small in absolute terms.

The meta-feature vocabulary and gate/router classifiers may need architecture-specific re-derivation for each new TSFM, limiting plug-and-play deployability.

The intervention space (single-patch SMA smoothing) is narrow.

The concept of context sensitivity profiles is potentially more impactful as a diagnostic tool for understanding and comparing TSFM architectures, though it is only sketched here rather than deeply developed.

4. Timeliness & Relevance

The paper is timely. TSFMs are an active area with models like TimesFM, Chronos, Moirai, and others rapidly emerging. The question of how to improve these models at inference time without retraining is practically relevant for enterprise deployments. The connection to test-time compute scaling in LLMs (Snell et al., 2024) is apt, though the analogy is somewhat loose — chain-of-thought and self-consistency operate on reasoning processes, while GITCO operates on signal preprocessing.

The GIFT-Eval benchmark choice is appropriate and current. The focus on zero-shot evaluation reflects realistic deployment scenarios.

5. Strengths & Limitations

Key Strengths:

Novel framing: The idea of inference-time context optimization for TSFMs is genuinely new and opens a research direction. The "context poisoning" formulation is intuitive and well-motivated.

Principled gating design: The asymmetric loss formulation and Gating Primacy Principle are well-reasoned. The recognition that false positives are more costly than false negatives is a practical insight.

Honest reporting: The Chronos2 negative result and the Router's low accuracy are reported transparently, which strengthens credibility.

Reproducibility: Code is available, evaluation uses a public benchmark, and the methodology is clearly described.

Notable Limitations:

Single positive result: Only TimesFM 2.5 shows deployable improvements. N=1 for architecture validation is insufficient to claim generality.

Small effect sizes: 1.95% mean MASE improvement without significance testing leaves practical relevance uncertain.

Narrow intervention space: One patch, one filter. The framework's ceiling is low by design.

Context sensitivity profiles are underdeveloped: Introduced as a contribution but only demonstrated via a binary contrast (learnable vs. not learnable) rather than systematically characterized.

Missing baselines: No comparison with other input preprocessing methods (e.g., robust scaling, outlier removal, wavelet denoising) or with the concurrent work by Hua et al. (2026) on diversified inference.

Workshop-length paper: The 5-page format necessarily limits depth, but several claims (e.g., "over 50% of series show marginal improvability") lack supporting detail.

6. Additional Observations

The paper's positioning at the intersection of test-time compute scaling and time series forecasting is strategically interesting. However, the actual mechanism (detect bad patch → smooth it) is closer to classical signal preprocessing than to the sophisticated inference-time reasoning strategies in NLP. The conceptual framing somewhat oversells the technical contribution.

The CIR metric, while useful, is self-referential: it measures how well the system captures improvement defined by its own oracle, which uses the same three probes. This makes 89.9% less impressive than it initially appears.

Rating:4.5/ 10

Significance 4.5Rigor 5Novelty 6Clarity 7

Generated Jun 5, 2026

Comparison History (19)

vs. SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

gemini-3.16/6/2026

Paper 2 introduces a critical benchmark for long-horizon AI agents, a rapidly expanding and highly relevant field. As persistent AI assistants become ubiquitous, evaluating their ability to handle nuanced, contradictory, or complementary memories is essential. Benchmarks in this area tend to drive significant follow-up research and shape the development of future models. While Paper 1 offers a valuable methodological improvement for time series models, Paper 2 addresses a fundamental capability gap in the broader and currently more impactful domain of Large Language Model agents.

vs. Agents' Last Exam

claude-opus-4.66/6/2026

Agents' Last Exam (ALE) addresses a fundamental gap between AI benchmark performance and real-world economic impact, introducing a comprehensive, living benchmark with 250+ industry experts across 55 subfields. Its breadth of impact spans virtually all non-physical industries, and it tackles the timely, high-stakes question of AI deployment relevance. GITCO, while methodologically sound, is narrowly focused on improving time series foundation models via inference-time context optimization—a useful but incremental contribution within a specific subfield. ALE's potential to reshape how AI systems are evaluated for economic value gives it substantially broader impact.

vs. LLM Self-Recognition: Steering and Retrieving Activation Signatures

claude-opus-4.66/6/2026

Paper 2 addresses the highly timely and broadly impactful problem of AI-generated content attribution, proposing a novel mechanism leveraging internal LLM representations for self-recognition and fingerprinting. This has immediate real-world applications in content provenance, AI safety, and regulation. The 98% accuracy with no quality degradation is compelling. Paper 1, while technically solid, addresses a narrower problem (context optimization for time series foundation models) with incremental improvements (~2% MASE reduction) on a specific model, limiting its breadth of impact across the broader ML community.

vs. Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

claude-opus-4.66/6/2026

Paper 1 addresses a critically timely and high-visibility topic—the environmental footprint of AI-driven hyperscale data centers—with novel facility-level empirical data covering 403 US data centers. Its finding that HDC carbon intensity is 48% above the national grid average has immediate policy relevance and broad societal impact across energy, environmental, and technology domains. Paper 2, while technically sound, presents an incremental improvement (~1.95% MASE reduction) to a specific time series forecasting framework, limiting its breadth of impact to a narrower ML audience.

vs. The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

gemini-3.16/6/2026

Paper 1 addresses a critical bottleneck in the widespread deployment of agentic AI: balancing scalable autonomy with safety and human oversight. Its framework for gradual, earned autonomy has massive cross-disciplinary implications for AI alignment, governance, and human-computer interaction. While Paper 2 is methodologically rigorous and presents strong empirical results, its scope is much narrower, primarily impacting the specialized subfield of time series forecasting.

vs. Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

gemini-3.16/6/2026

Paper 2 tackles a critical bottleneck in deploying Large Language Models—efficient long-context generation via sparse attention. By providing a system that accelerates algorithm prototyping and achieves significant throughput gains on massive models (up to 229B parameters) and modern hardware, it addresses a highly active and impactful research area. While Paper 1 offers a novel approach for time series models, the breadth of impact, timeliness, and real-world applicability of LLM serving optimization give Paper 2 a higher potential for widespread scientific impact.

vs. CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

gemini-3.16/6/2026

Security of AI agents against prompt injection is a critical, highly timely issue with massive real-world implications. Paper 2 provides foundational security guarantees for Computer Use Agents, addressing a fundamental barrier to their safe deployment. This offers broader cross-disciplinary impact (AI and cybersecurity) compared to Paper 1's narrower focus on time series forecasting optimization.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

claude-opus-4.66/6/2026

Paper 1 introduces a novel framework (TBS) that bridges cognitive science and multi-agent simulation by separating internal reasoning from public expression, offering broad interdisciplinary impact across computational social science, opinion dynamics, and AI. It addresses fundamental questions about social deliberation mechanisms. Paper 2, while technically solid, addresses a narrower engineering problem (context poisoning in time series foundation models) with incremental improvements (~1.95% MASE reduction). Paper 1's conceptual contribution—making internal-to-public expression pathways observable—has greater potential to influence multiple research communities and inspire new methodological directions.

vs. Evaluating Agentic Configuration Repair for Computer Networks

gemini-3.16/5/2026

Paper 1 introduces a novel inference-time optimization framework and a new theoretical property (context sensitivity profiles) for time series foundation models, offering fundamental methodological contributions. Paper 2, while practically valuable, primarily benchmarks existing agentic LLM techniques for a specific application, offering less foundational scientific innovation.

vs. AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

claude-opus-4.66/5/2026

AgentProcessBench addresses a more broadly impactful problem—step-level verification for tool-using LLM agents—which is central to the rapidly growing field of AI agents. It introduces the first benchmark of its kind with substantial human annotations (8,509 labeled steps), enabling reproducible research across the community. Its insights on process-level supervision complementing outcome supervision have broad implications for test-time scaling and reward modeling. Paper 1, while technically sound, addresses a narrower optimization for a specific class of time series models (TSFMs) with modest improvements (+1.95% MASE), limiting its breadth of impact.

vs. LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

gemini-3.16/5/2026

Paper 2 addresses a fundamental vulnerability (context poisoning) in Time Series Foundation Models, proposing a lightweight, training-free inference-time optimization. Because time series forecasting applies universally across domains (finance, healthcare, climate), improving zero-shot accuracy without parameter updates offers broader, more immediate real-world impact compared to Paper 1's focus on the important but more specialized domain of hardware verification.

vs. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

gpt-5.26/5/2026

Paper 2 has higher potential impact due to a novel, broadly applicable inference-time intervention for time-series foundation models that improves accuracy without retraining, making it practical for real deployments. It introduces a clear problem (context poisoning), a concrete method (Gate/Router/Critic), and quantifies gains across many datasets, suggesting methodological rigor and generality. The added concept of context sensitivity profiles could influence both evaluation and future TSFM design. Paper 1 is valuable as a diagnostic benchmark, but benchmarks typically yield narrower downstream impact unless widely adopted.

vs. Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in the highly active field of multi-agent LLM systems: wasted computation and failure diagnosis. Its observability framework offers broad applicability for improving efficiency, reducing costs, and enhancing the reliability of complex AI systems. While Paper 1 provides a useful optimization for time series models, Paper 2's focus on foundational diagnostics for LLM agents likely yields wider, more immediate cross-disciplinary impact and addresses pressing real-world scalability challenges.

vs. MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in autonomous agents (latency and cost of explicit chain-of-thought) by introducing latent reasoning and a generative world model. This approach has broad implications for efficient, real-time multimodal agents and implicit reasoning, offering significant efficiency gains (75% fewer tokens). Paper 1 offers a useful but more niche inference-time optimization for time series models with relatively modest performance improvements, making Paper 2's methodological innovation and potential cross-field impact substantially higher.

vs. Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

gemini-3.16/5/2026

Paper 1 addresses knowledge editing in Large Language Models, a highly active field with broad cross-disciplinary implications for AI safety and reliability. By formally coupling propagation and preservation pressures, it introduces a novel theoretical framework to a critical bottleneck. Paper 2's focus on Time Series Foundation Models is highly practical but targets a narrower application domain, giving Paper 1 a higher potential for widespread scientific impact and broader real-world relevance.

vs. Consistency evaluation of benchmarks used for causal discovery

gemini-3.16/5/2026

Paper 2 demonstrates higher potential scientific impact because it addresses a systemic issue in causal discovery: the validity of its evaluation benchmarks. By analyzing the consistency of 11 popular benchmarks against over 38,000 domain papers, it exposes flaws affecting the entire field, especially emerging LLM-based methods. While Paper 1 offers a valuable inference-time optimization for time-series models, Paper 2's foundational critique is likely to influence how all future causal discovery methods are evaluated, forcing methodological shifts and ensuring higher long-term relevance across multiple scientific disciplines.

vs. PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

claude-opus-4.66/5/2026

PolarMem introduces a more novel conceptual contribution—negative/polarized memory for VLMs—which addresses a fundamental gap in how memory systems handle absence and logical exclusion. Its breadth of evaluation (8 backbones, 6 benchmarks) and the generality of the framework (training-free, applicable to any frozen VLM) suggest wider impact across multimodal AI. GITCO, while practical, addresses a narrower problem (context poisoning in time series foundation models) with incremental improvements (~2% MASE reduction) on a single model family, limiting its breadth and transformative potential.

vs. PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

claude-opus-4.66/5/2026

GITCO introduces a novel inference-time optimization framework addressing a newly identified problem (context poisoning) in time series foundation models, with rigorous evaluation across 53 datasets. It also introduces 'context sensitivity profiles' as a new characterizable property. While Paper 2 contributes a useful benchmark and training improvements for LLM math reasoning, the space is crowded with similar benchmarks. GITCO's approach of optimizing input context without weight updates is more innovative and has broader applicability across the growing TSFM ecosystem.

vs. SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: hierarchical skill consolidation and self-evolving agents address a central bottleneck in modern agentic AI and can transfer across many domains (tool use, robotics, web agents). The reported gains across multiple environments and backbones suggest generality, and the framework could influence downstream system design. Paper 1 is novel and rigorous for TSFMs and offers practical inference-time robustness, but its impact is narrower (forecasting foundation models) and the absolute improvement is modest, limiting cross-field breadth.