From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim

May 28, 2026

arXiv:2605.29768v1 PDF

cs.AI(primary)

#1510of 2821·Artificial Intelligence

#1510 of 2821 · Artificial Intelligence

Tournament Score

1400±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity5.5

Tournament Score

1400±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, naïve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks"

1. Core Contribution

This paper introduces two interconnected dataset contributions for traffic forecasting: XXLTraffic, a fixed-sensor dataset family spanning up to 27 years of California PeMS and NSW Transport data, and EvoXXLTraffic, a sensor-evolving reorganization that exposes yearly active sensor sets, traffic-flow matrices, and graph snapshots across nine PeMS districts. The key novelty lies in moving beyond the fixed-sensor assumption that dominates existing traffic forecasting benchmarks. EvoXXLTraffic captures real-world sensor network growth (ranging from +305% to over +10,000%) and defines a yearly streaming forecasting protocol where each calendar year constitutes a continual learning task.

The paper identifies a practical but largely ignored problem: real traffic sensor networks are not static. Sensors are added, removed, and the graph topology evolves. By formalizing this as a benchmark problem with clear definitions (sensor evolution, cold-start sensors, gap-based forecasting), the authors provide a structured framework for future research.

2. Methodological Rigor

The dataset construction pipeline is thoroughly described, covering raw data collection, preprocessing (forward/backward fill, zero fallback), yearly graph construction via Haversine distance with Gaussian kernel thresholding, and sensor alignment across years. The protocol is well-defined: 60/20/20 chronological splits, z-score normalization, fixed input/output lengths, and clear evaluation metrics.

The benchmarking is extensive, covering 13 baselines across four categories: static STGNNs (DCRNN, ASTGCN, TGCN), naïve schemes (Pretrain, Retrain, Online-NN, Online-AN), evolving-graph continual methods (TrafficStream, PECPM, STKEC, EAC), and retrieval/test-time methods (STRAP, ST-TTC). Results are reported with mean±std over five seeds, lending statistical credibility. The dual evaluation—all-sensor and new-sensor (cold-start)—is a particularly strong design choice that reveals complementary failure modes.

However, there are methodological concerns. The fixed-sensor subsets use only 10% subsampling of the data with a fixed seed, justified by Table 9 showing similar rankings with 50% and 100%—but this is demonstrated only for MICN on one dataset, which is insufficient to generalize. The backbone architecture (GCN+TCN, hidden width 64) is relatively simple, and it's unclear whether the conclusions would hold with more expressive architectures. The learning rate of 0.03 for AdamW is unusually high and may disadvantage certain methods.

3. Potential Impact

Dataset contribution: This is the paper's strongest dimension. The traffic forecasting community has long relied on datasets like METR-LA (4 months, 207 sensors) and PEMS-BAY (6 months, 325 sensors). XXLTraffic/EvoXXLTraffic is a substantial scaling in both temporal coverage (~27 years) and the number of sensors (up to ~19,672 across all districts). The sensor-evolving dimension fills a genuine gap.

Benchmark findings: The key finding that Online-AN (simple yearly fine-tuning on all active sensors) dominates sophisticated continual learning methods is both surprising and practically important. It suggests the field's existing evolving-graph methods were designed for an overly benign regime (small proportions of new nodes). The identification of "small initial graph + large yearly delta + node-indexed parameterization" as the failure regime is a useful diagnostic for future method design.

Broader influence: The dataset could impact adjacent areas including continual learning, cold-start problems in graph learning, urban computing, and foundation models for spatio-temporal data. The gap-forecasting setup (1–2 year gaps between observation and prediction) is relevant for infrastructure planning scenarios.

4. Timeliness & Relevance

The paper addresses a timely need. As spatio-temporal foundation models emerge, there is growing demand for large-scale, long-duration, realistic benchmarks. Existing benchmarks are saturating—many methods achieve similar performance on METR-LA/PEMS-BAY. The evolving-graph setting is particularly relevant as cities worldwide continue expanding sensor infrastructure. The connection to continual learning, an active research frontier, adds timeliness.

5. Strengths & Limitations

Strengths:

Unprecedented scale: 27 years, 9 districts, up to ~19,672 sensors, capturing real-world growth dynamics

Well-formulated problem taxonomy distinguishing fixed-sensor and evolving-sensor settings with clear mathematical definitions

Comprehensive benchmarking with both all-sensor and cold-start evaluation revealing complementary insights

The surprising dominance of Online-AN challenges assumptions and redirects research attention

Training efficiency analysis (Figure 6) showing expensive continual methods are Pareto-dominated

Open-source commitment with code and data

Limitations:

The paper is primarily a dataset/benchmark paper with no proposed method. While this is acceptable, the analytical insights could go deeper—e.g., why does Online-AN work so well? What structural properties of traffic data make sophisticated continual learning unnecessary?

The gap-forecasting experiments (Tables 7-8) use time-series baselines that may not be optimal for this setting. No method specifically designed for non-contiguous forecasting is tested.

The zero-fallback imputation for missing values conflates genuinely zero-flow sensors with missing data, potentially introducing bias.

The yearly granularity for tasks may be too coarse—sensor installations happen continuously, and a finer-grained streaming protocol might reveal different dynamics.

The paper is an extension of a SIGSPATIAL 2025 conference paper, and while the EvoXXLTraffic extension is substantial, some portions (gap/hourly/daily forecasting) are directly retained.

PEMS11 results show many outliers/missing entries, suggesting data quality issues that are insufficiently addressed.

The paper does not discuss potential biases from the geographic focus (California + NSW) or how findings generalize to other regions.

Additional observations: The paper's tables are extremely dense (Tables 12-15 span multiple pages), making it difficult to extract high-level patterns. A more concise presentation with summary statistics and detailed tables in appendices would improve readability. The connection between the fixed-sensor and evolving-sensor experiments could be made more explicit—do the gap-forecasting insights transfer to the streaming setting?

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 5.5

Generated May 29, 2026

Comparison History (16)

vs. EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

gemini-3.15/29/2026

Paper 2 presents a highly novel approach by adapting LLMs to model complex, dynamic physical processes in molecular dynamics. Its cross-disciplinary potential to impact chemistry, materials science, and AI-for-science gives it a broader and more fundamental scientific footprint. While Paper 1 introduces an impressive and necessary dataset for traffic forecasting, Paper 2's methodological innovation in bridging linguistic models with temporal physical simulations offers wider foundational implications.

vs. The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

claude-opus-4.65/29/2026

Paper 2 identifies a novel, previously undocumented failure mode ('unfaithful capitulation') in reasoning LLMs where the chain-of-thought remains correct but the final answer flips under adversarial pressure. This has immediate, broad implications for AI safety, alignment, and deployment of reasoning models in real-world multi-turn settings. The finding is methodologically rigorous with causal evidence and cross-model validation. Paper 1 contributes a valuable benchmark dataset for traffic forecasting under evolving sensor networks, but its impact is more domain-specific and incremental compared to the fundamental insight about reasoning model reliability in Paper 2.

vs. Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

claude-opus-4.65/29/2026

Paper 1 introduces a novel, large-scale benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—sensor-evolving networks over decades. This creates lasting infrastructure for the community, reveals that existing SOTA methods fail under realistic conditions, and opens new research directions in continual/evolving graph learning. Paper 2, while methodologically sound, offers an incremental contribution (data selection metric for distillation) in an already crowded space. Benchmark papers with novel problem formulations and publicly released datasets tend to have broader, longer-lasting impact across multiple research communities.

vs. Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

gemini-3.15/29/2026

Paper 2 introduces a foundational, large-scale benchmark dataset for spatio-temporal forecasting that addresses a fundamental flaw in existing research (fixed vs. evolving sensor networks). By demonstrating that current state-of-the-art models fail under realistic, evolving conditions, it is likely to drive sustained methodological innovation across time series forecasting, continual learning, and graph neural networks. Paper 1 provides a valuable empirical reality check on a timely speculative bubble (AI in DeFi), but its impact is narrower and more temporal compared to establishing a new benchmark in machine learning.

vs. A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

gemini-3.15/29/2026

Paper 2 addresses a critical and highly timely challenge in AI: the rapid saturation of LLM agent benchmarks. By introducing an automated, scalable method to generate diverse tool-use tasks, it has broad implications for evaluating future autonomous systems. While Paper 1 provides a highly valuable dataset for spatio-temporal forecasting, Paper 2's focus on LLM agents gives it a significantly wider potential impact across numerous fast-paced AI subfields.

vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

claude-opus-4.65/29/2026

Paper 1 introduces a novel, large-scale benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental and previously overlooked gap in traffic forecasting: sensor-evolving networks over ultra-long time spans. This creates a new research paradigm for continual/evolving graph learning with broad impact across transportation, urban computing, and continual learning communities. The finding that many SOTA methods fail under realistic conditions is highly impactful. Paper 2, while technically solid, is an incremental efficiency improvement for VLMs focused on token reduction, a crowded research area with many competing approaches, and is architecture-specific (Qwen3-VL), limiting its breadth of impact.

vs. BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

gpt-5.25/29/2026

Paper 2 has higher likely scientific impact due to broader, timely relevance to LLM agents and alignment with a fast-moving area (reflection/self-improvement). It introduces a targeted, model-agnostic evaluation framework, a new metric (FAR), and controlled simulations that enable diagnosing failure modes beyond aggregate task scores—useful across many agent architectures and application domains. Paper 1 is strong and rigorous, with a valuable large-scale evolving-graph dataset for traffic forecasting, but its impact is more domain-specific (spatio-temporal forecasting/transport) and less cross-field than a general benchmark for agent self-evolution.

vs. Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

gpt-5.25/29/2026

Paper 2 likely has higher impact due to a major, timely benchmark contribution: an ultra-large, multi-decade, sensor-evolving traffic dataset plus a streaming continual-forecasting protocol that exposes a realistic failure mode of current SOTA. This can reshape evaluation practices and spur new methods across spatio-temporal ML, continual learning, and dynamic graph forecasting, with broad applicability to real transportation systems. Paper 1 is solid and rigorous (uncertainty + transfer + TRI), but is narrower in scope (two-building case study) and less likely to become a widely adopted community benchmark.

vs. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

gemini-3.15/29/2026

Paper 1 introduces a massive, realistic dataset and benchmark spanning up to 27 years, addressing a critical flaw in existing traffic forecasting models (fixed sensor sets). Large-scale datasets often drive significant methodological advances and garner high citations. While Paper 2 explores a timely topic (AI agents in science), its methodological rigor is limited by being an N=1 case study, whereas Paper 1 provides a rigorous, broadly applicable benchmark that directly challenges and will likely evolve the current state-of-the-art in continual learning and spatio-temporal forecasting.

vs. Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

claude-opus-4.65/29/2026

Paper 1 introduces a novel diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation—moving beyond binary ASR to temporal, mechanistic understanding of jailbreak failures. It offers both theoretical insight and practical utility (early-stop defense), is training-free, and is broadly applicable across models and attack types. Paper 2 contributes a valuable large-scale benchmark for traffic forecasting with evolving sensors, but its impact is more domain-specific. Paper 1's relevance to the rapidly growing AI safety field, methodological novelty, and cross-cutting applicability give it higher potential impact.

vs. CubePart: An Open-Vocabulary Part-Controllable 3D Generator

gemini-3.15/29/2026

Paper 2 introduces a massive, multi-decade benchmark dataset that breaks existing assumptions in traffic forecasting by incorporating sensor-evolving networks. By demonstrating that current state-of-the-art models fail on this realistic setup, it forces a paradigm shift in spatio-temporal modeling and continual learning. While Paper 1 offers a highly useful generative AI tool for 3D modeling, fundamental benchmark datasets like the one in Paper 2 typically drive broader, long-lasting methodological advancements across an entire subfield.

vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

claude-opus-4.65/29/2026

Paper 1 introduces a large-scale, multi-decade benchmark dataset (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—evolving sensor networks. This has broader impact by enabling more realistic evaluation of spatiotemporal models, challenging existing SOTA methods, and providing a community resource spanning 27 years across multiple districts. Its contributions (new dataset, evaluation protocol, comprehensive benchmarking) have high reuse potential. Paper 2 presents a domain-specific simulation framework for tourist mobility in Tokyo with narrower applicability, less methodological novelty (combining existing techniques like LLMs with GPS priors), and more limited generalizability.

vs. Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

gemini-3.15/29/2026

Paper 1 addresses the highly active field of Large Language Models, proposing a novel compositional prompt optimization framework that can be broadly applied across various LLM-based agentic workflows. Its potential for real-world application and cross-domain impact is significantly higher than Paper 2, which introduces a valuable but domain-specific dataset for traffic forecasting.

vs. MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

gemini-3.15/29/2026

Paper 1 bridges a critical gap between generative AI and real-world manufacturing by benchmarking Text-to-CAD models on functionality and assemblability. This has broad, transformative potential across industrial design, mechanical engineering, and AI. Paper 2 presents a valuable and realistic dataset for traffic forecasting, but its impact is relatively confined to the niche of spatio-temporal modeling and urban planning.

vs. The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

claude-opus-4.65/29/2026

Paper 2 introduces a novel, large-scale benchmark dataset (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—the unrealistic assumption of fixed sensor sets. Spanning up to 27 years of real-world data with a well-defined streaming evaluation protocol, it enables more realistic research in spatio-temporal forecasting, continual learning, and evolving graph methods. Its broad applicability across multiple ML subfields and potential to reshape benchmarking standards gives it higher scientific impact. Paper 1, while practically relevant, is more of an engineering architecture for AI agent governance with a narrower, industry-focused contribution and less methodological novelty.

vs. AlphaTransit: Learning to Design City-scale Transit Routes

claude-opus-4.65/29/2026

Paper 1 introduces a large-scale, multi-decade benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research: sensor network evolution over time. Benchmarks that expose limitations of SOTA methods tend to have broad, lasting impact by redirecting an entire research community. The dataset spans 27 years across multiple districts, enabling new research directions in continual learning, evolving graphs, and realistic traffic forecasting. Paper 2, while methodologically interesting in applying AlphaZero-style planning to transit design, is evaluated on a single city benchmark and represents a more incremental application of existing techniques (MCTS + neural networks) to a specific problem.