From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks
Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim
Abstract
Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, naïve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks"
1. Core Contribution
This paper introduces two interconnected dataset contributions for traffic forecasting: XXLTraffic, a fixed-sensor dataset family spanning up to 27 years of California PeMS and NSW Transport data, and EvoXXLTraffic, a sensor-evolving reorganization that exposes yearly active sensor sets, traffic-flow matrices, and graph snapshots across nine PeMS districts. The key novelty lies in moving beyond the fixed-sensor assumption that dominates existing traffic forecasting benchmarks. EvoXXLTraffic captures real-world sensor network growth (ranging from +305% to over +10,000%) and defines a yearly streaming forecasting protocol where each calendar year constitutes a continual learning task.
The paper identifies a practical but largely ignored problem: real traffic sensor networks are not static. Sensors are added, removed, and the graph topology evolves. By formalizing this as a benchmark problem with clear definitions (sensor evolution, cold-start sensors, gap-based forecasting), the authors provide a structured framework for future research.
2. Methodological Rigor
The dataset construction pipeline is thoroughly described, covering raw data collection, preprocessing (forward/backward fill, zero fallback), yearly graph construction via Haversine distance with Gaussian kernel thresholding, and sensor alignment across years. The protocol is well-defined: 60/20/20 chronological splits, z-score normalization, fixed input/output lengths, and clear evaluation metrics.
The benchmarking is extensive, covering 13 baselines across four categories: static STGNNs (DCRNN, ASTGCN, TGCN), naïve schemes (Pretrain, Retrain, Online-NN, Online-AN), evolving-graph continual methods (TrafficStream, PECPM, STKEC, EAC), and retrieval/test-time methods (STRAP, ST-TTC). Results are reported with mean±std over five seeds, lending statistical credibility. The dual evaluation—all-sensor and new-sensor (cold-start)—is a particularly strong design choice that reveals complementary failure modes.
However, there are methodological concerns. The fixed-sensor subsets use only 10% subsampling of the data with a fixed seed, justified by Table 9 showing similar rankings with 50% and 100%—but this is demonstrated only for MICN on one dataset, which is insufficient to generalize. The backbone architecture (GCN+TCN, hidden width 64) is relatively simple, and it's unclear whether the conclusions would hold with more expressive architectures. The learning rate of 0.03 for AdamW is unusually high and may disadvantage certain methods.
3. Potential Impact
Dataset contribution: This is the paper's strongest dimension. The traffic forecasting community has long relied on datasets like METR-LA (4 months, 207 sensors) and PEMS-BAY (6 months, 325 sensors). XXLTraffic/EvoXXLTraffic is a substantial scaling in both temporal coverage (~27 years) and the number of sensors (up to ~19,672 across all districts). The sensor-evolving dimension fills a genuine gap.
Benchmark findings: The key finding that Online-AN (simple yearly fine-tuning on all active sensors) dominates sophisticated continual learning methods is both surprising and practically important. It suggests the field's existing evolving-graph methods were designed for an overly benign regime (small proportions of new nodes). The identification of "small initial graph + large yearly delta + node-indexed parameterization" as the failure regime is a useful diagnostic for future method design.
Broader influence: The dataset could impact adjacent areas including continual learning, cold-start problems in graph learning, urban computing, and foundation models for spatio-temporal data. The gap-forecasting setup (1–2 year gaps between observation and prediction) is relevant for infrastructure planning scenarios.
4. Timeliness & Relevance
The paper addresses a timely need. As spatio-temporal foundation models emerge, there is growing demand for large-scale, long-duration, realistic benchmarks. Existing benchmarks are saturating—many methods achieve similar performance on METR-LA/PEMS-BAY. The evolving-graph setting is particularly relevant as cities worldwide continue expanding sensor infrastructure. The connection to continual learning, an active research frontier, adds timeliness.
5. Strengths & Limitations
Strengths:
Limitations:
Additional observations: The paper's tables are extremely dense (Tables 12-15 span multiple pages), making it difficult to extract high-level patterns. A more concise presentation with summary statistics and detailed tables in appendices would improve readability. The connection between the fixed-sensor and evolving-sensor experiments could be made more explicit—do the gap-forecasting insights transfer to the streaming setting?
Generated May 29, 2026
Comparison History (16)
Paper 2 presents a highly novel approach by adapting LLMs to model complex, dynamic physical processes in molecular dynamics. Its cross-disciplinary potential to impact chemistry, materials science, and AI-for-science gives it a broader and more fundamental scientific footprint. While Paper 1 introduces an impressive and necessary dataset for traffic forecasting, Paper 2's methodological innovation in bridging linguistic models with temporal physical simulations offers wider foundational implications.
Paper 2 identifies a novel, previously undocumented failure mode ('unfaithful capitulation') in reasoning LLMs where the chain-of-thought remains correct but the final answer flips under adversarial pressure. This has immediate, broad implications for AI safety, alignment, and deployment of reasoning models in real-world multi-turn settings. The finding is methodologically rigorous with causal evidence and cross-model validation. Paper 1 contributes a valuable benchmark dataset for traffic forecasting under evolving sensor networks, but its impact is more domain-specific and incremental compared to the fundamental insight about reasoning model reliability in Paper 2.
Paper 1 introduces a novel, large-scale benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—sensor-evolving networks over decades. This creates lasting infrastructure for the community, reveals that existing SOTA methods fail under realistic conditions, and opens new research directions in continual/evolving graph learning. Paper 2, while methodologically sound, offers an incremental contribution (data selection metric for distillation) in an already crowded space. Benchmark papers with novel problem formulations and publicly released datasets tend to have broader, longer-lasting impact across multiple research communities.
Paper 2 introduces a foundational, large-scale benchmark dataset for spatio-temporal forecasting that addresses a fundamental flaw in existing research (fixed vs. evolving sensor networks). By demonstrating that current state-of-the-art models fail under realistic, evolving conditions, it is likely to drive sustained methodological innovation across time series forecasting, continual learning, and graph neural networks. Paper 1 provides a valuable empirical reality check on a timely speculative bubble (AI in DeFi), but its impact is narrower and more temporal compared to establishing a new benchmark in machine learning.
Paper 2 addresses a critical and highly timely challenge in AI: the rapid saturation of LLM agent benchmarks. By introducing an automated, scalable method to generate diverse tool-use tasks, it has broad implications for evaluating future autonomous systems. While Paper 1 provides a highly valuable dataset for spatio-temporal forecasting, Paper 2's focus on LLM agents gives it a significantly wider potential impact across numerous fast-paced AI subfields.
Paper 1 introduces a novel, large-scale benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental and previously overlooked gap in traffic forecasting: sensor-evolving networks over ultra-long time spans. This creates a new research paradigm for continual/evolving graph learning with broad impact across transportation, urban computing, and continual learning communities. The finding that many SOTA methods fail under realistic conditions is highly impactful. Paper 2, while technically solid, is an incremental efficiency improvement for VLMs focused on token reduction, a crowded research area with many competing approaches, and is architecture-specific (Qwen3-VL), limiting its breadth of impact.
Paper 2 has higher likely scientific impact due to broader, timely relevance to LLM agents and alignment with a fast-moving area (reflection/self-improvement). It introduces a targeted, model-agnostic evaluation framework, a new metric (FAR), and controlled simulations that enable diagnosing failure modes beyond aggregate task scores—useful across many agent architectures and application domains. Paper 1 is strong and rigorous, with a valuable large-scale evolving-graph dataset for traffic forecasting, but its impact is more domain-specific (spatio-temporal forecasting/transport) and less cross-field than a general benchmark for agent self-evolution.
Paper 2 likely has higher impact due to a major, timely benchmark contribution: an ultra-large, multi-decade, sensor-evolving traffic dataset plus a streaming continual-forecasting protocol that exposes a realistic failure mode of current SOTA. This can reshape evaluation practices and spur new methods across spatio-temporal ML, continual learning, and dynamic graph forecasting, with broad applicability to real transportation systems. Paper 1 is solid and rigorous (uncertainty + transfer + TRI), but is narrower in scope (two-building case study) and less likely to become a widely adopted community benchmark.
Paper 1 introduces a massive, realistic dataset and benchmark spanning up to 27 years, addressing a critical flaw in existing traffic forecasting models (fixed sensor sets). Large-scale datasets often drive significant methodological advances and garner high citations. While Paper 2 explores a timely topic (AI agents in science), its methodological rigor is limited by being an N=1 case study, whereas Paper 1 provides a rigorous, broadly applicable benchmark that directly challenges and will likely evolve the current state-of-the-art in continual learning and spatio-temporal forecasting.
Paper 1 introduces a novel diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation—moving beyond binary ASR to temporal, mechanistic understanding of jailbreak failures. It offers both theoretical insight and practical utility (early-stop defense), is training-free, and is broadly applicable across models and attack types. Paper 2 contributes a valuable large-scale benchmark for traffic forecasting with evolving sensors, but its impact is more domain-specific. Paper 1's relevance to the rapidly growing AI safety field, methodological novelty, and cross-cutting applicability give it higher potential impact.
Paper 2 introduces a massive, multi-decade benchmark dataset that breaks existing assumptions in traffic forecasting by incorporating sensor-evolving networks. By demonstrating that current state-of-the-art models fail on this realistic setup, it forces a paradigm shift in spatio-temporal modeling and continual learning. While Paper 1 offers a highly useful generative AI tool for 3D modeling, fundamental benchmark datasets like the one in Paper 2 typically drive broader, long-lasting methodological advancements across an entire subfield.
Paper 1 introduces a large-scale, multi-decade benchmark dataset (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—evolving sensor networks. This has broader impact by enabling more realistic evaluation of spatiotemporal models, challenging existing SOTA methods, and providing a community resource spanning 27 years across multiple districts. Its contributions (new dataset, evaluation protocol, comprehensive benchmarking) have high reuse potential. Paper 2 presents a domain-specific simulation framework for tourist mobility in Tokyo with narrower applicability, less methodological novelty (combining existing techniques like LLMs with GPS priors), and more limited generalizability.
Paper 1 addresses the highly active field of Large Language Models, proposing a novel compositional prompt optimization framework that can be broadly applied across various LLM-based agentic workflows. Its potential for real-world application and cross-domain impact is significantly higher than Paper 2, which introduces a valuable but domain-specific dataset for traffic forecasting.
Paper 1 bridges a critical gap between generative AI and real-world manufacturing by benchmarking Text-to-CAD models on functionality and assemblability. This has broad, transformative potential across industrial design, mechanical engineering, and AI. Paper 2 presents a valuable and realistic dataset for traffic forecasting, but its impact is relatively confined to the niche of spatio-temporal modeling and urban planning.
Paper 2 introduces a novel, large-scale benchmark dataset (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research—the unrealistic assumption of fixed sensor sets. Spanning up to 27 years of real-world data with a well-defined streaming evaluation protocol, it enables more realistic research in spatio-temporal forecasting, continual learning, and evolving graph methods. Its broad applicability across multiple ML subfields and potential to reshape benchmarking standards gives it higher scientific impact. Paper 1, while practically relevant, is more of an engineering architecture for AI agent governance with a narrower, industry-focused contribution and less methodological novelty.
Paper 1 introduces a large-scale, multi-decade benchmark (XXLTraffic/EvoXXLTraffic) addressing a fundamental gap in traffic forecasting research: sensor network evolution over time. Benchmarks that expose limitations of SOTA methods tend to have broad, lasting impact by redirecting an entire research community. The dataset spans 27 years across multiple districts, enabling new research directions in continual learning, evolving graphs, and realistic traffic forecasting. Paper 2, while methodologically interesting in applying AlphaZero-style planning to transit design, is evaluated on a single city benchmark and represents a more incremental application of existing techniques (MCTS + neural networks) to a specific problem.