Kaijie Xu, Anqi Wang, Xilin Dai
Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.
This paper makes two intertwined contributions: (1) PowerPhase, a benchmark for probabilistic multivariate time series forecasting on transmission-scale power grids with up to 36,964 jointly forecasted channels, and (2) PowerForge, a scenario-based quantile forecaster with physics-informed architectural priors designed for this regime. The key conceptual contribution is the identification and formalization of the safety–fidelity trade-off — the observation that distributional accuracy (CRPS) and operational constraint satisfaction (voltage-band compliance) rank models differently. This is operationally meaningful: a model with excellent CRPS but poor voltage violation detection could be genuinely dangerous in grid operations.
The benchmark fills a genuine gap at the intersection of ML forecasting benchmarks (which max out at ~2,000 channels with no physical constraints) and power systems benchmarks (which lack temporal structure or probabilistic evaluation). PowerPhase ships with constraint-aware metrics (Safety_mBrier, NECV, CVaR_α) alongside standard CRPS and Distortion.
Benchmark construction is reasonably well-documented. The data generation pipeline uses real German TSO load/renewable traces from OPSD, five archetypal load profiles, and AC power-flow solves via pandapower on six standard test networks. The process is transparent but synthetic — load profiles are constructed from national aggregates with Gaussian perturbations, fixed power factors, and deterministic profile assignments. This is a significant caveat: real power systems exhibit far more complex spatial correlations, topology changes, contingencies, and measurement noise.
Experimental evaluation is adequate: eight baselines spanning density-based, scenario-based, and statistical families; three seeds; rolling-origin testing with 10 windows per grid. However, 10 test windows is quite small for robust statistical conclusions, and the standard deviations in Table 2 are sometimes large relative to differences between models.
PowerForge design reflects sensible engineering for the domain: the anchor-delta parameterization removes diurnal patterns, type-specific heads match variable supports, the causal P,Q→V,θ bridge encodes known power-flow directionality, and the low-rank global mixer achieves sub-quadratic cross-channel interaction. The ablation study (Table 3) systematically isolates component contributions, with anchor-delta being the dominant factor (+84% CRPS degradation when removed).
Concerns: The comparison may not be entirely fair. Deep baselines use GluonTS default configurations, while PowerForge is specifically designed and tuned for this benchmark. TACTiS-2 is trained "on its first stage only due to compute constraints," which may disadvantage it. The paper does not report wall-clock training times or memory consumption, which would be critical for assessing practical scalability claims.
For the power systems community: The benchmark could catalyze development of ML forecasting methods that respect operational constraints. The safety-fidelity trade-off framing is practically important and could influence how grid operators evaluate forecasting tools.
For the ML forecasting community: PowerPhase pushes the scale frontier for multivariate probabilistic benchmarks by an order of magnitude. The constraint-aware evaluation protocol is a template for other physically constrained domains (chemical processes, structural engineering, etc.).
Practical limitations: The synthetic nature of PowerPhase reduces its direct operational relevance. Real PMU data with measurement noise, missing data, topology changes, and true constraint violations would be far more compelling. The paper acknowledges this but it substantially limits the benchmark's authority as a proxy for real grid operations.
The paper addresses a timely need. Renewable integration is increasing grid stochasticity, making probabilistic forecasting operationally critical. The ML community's push toward foundation models for time series creates demand for large-scale structured benchmarks. However, concurrent work (PFΔ, OPFData, PSML, PowerGraph) shows this space is rapidly filling, though PowerPhase is differentiated by its temporal + probabilistic + scale combination.
The paper's framing around "massive-variate" forecasting is somewhat inflated — the channels are 4× the number of buses, and many channels (especially V at PV buses) have very constrained dynamics due to the power-flow solve. The effective dimensionality is likely much lower than the raw channel count suggests. The low-rank mixer with K=64 tokens working well supports this interpretation.
The qualitative analysis (Figure 3) is informative but based on single-channel cherry-picked examples. More systematic visualization would strengthen the narrative.
Generated Jun 12, 2026
Paper 2 has higher potential impact: it introduces a uniquely large-scale (up to 36,964 channels) probabilistic forecasting benchmark grounded in AC power-flow physics, plus new constraint-aware evaluation metrics that expose a safety–fidelity trade-off—highly relevant for real-world power-system operations. It also contributes a new model (PowerForge) tailored to heterogeneous variables and constraints. This combination of novel benchmark + metrics + method targets a critical infrastructure domain with broad relevance to time-series ML, uncertainty quantification, and safety-aware decision-making. Paper 1 is valuable but mainly consolidates existing WHAR work.
Paper 2 introduces a fundamental theoretical framework ('loss shift' and 'Bayes quotients') to transfer learning, an area that underpins modern machine learning. While Paper 1 provides an impressive domain-specific benchmark and method for power systems, Paper 2's insights into representation learning and loss functions have the potential for broader impact across all fields applying machine learning, influencing both theoretical understanding and practical model design.
Paper 1 is likely higher impact due to a large-scale, domain-grounded benchmark (up to ~37k channels) with physically meaningful AC power-flow targets and new constraint-aware probabilistic metrics, enabling broad, reproducible evaluation and shifting practice toward safety-critical assessment. Benchmarks and metrics often catalyze sustained follow-on work across forecasting, uncertainty quantification, and power/energy systems. Paper 2 is novel (manifold-guided conditional tabular diffusion with inference-time generalization) and broadly applicable, but its impact hinges more on adoption versus Paper 1’s concrete infrastructure and direct relevance to high-stakes grid operations.
Paper 1 provides a comprehensive survey unifying three critical bottlenecks in LLM training efficiency—data, memory, and compute—under a constraint-centric framework. Given the massive interest in LLM efficiency across academia and industry, this survey has broad applicability and timeliness. Paper 2, while technically solid in introducing a new benchmark and method for power system forecasting, addresses a more niche domain. The survey's potential to shape research directions across the entire LLM training ecosystem gives it substantially broader impact across multiple fields and larger audience reach.
Paper 2 has higher likely impact: it targets a broad, timely bottleneck (efficient inference for large reasoning models) with immediate applicability across many LLM deployments. It contributes both an algorithmic fix (step-aware entropy temperature scaling to recover accuracy under NVFP4) and a systems advance (new small-batch CUDA kernel) with strong reported latency gains, increasing adoption potential. Paper 1 is novel and rigorous but is more domain-specific (power systems forecasting benchmarks/metrics), likely narrowing breadth and immediate cross-field uptake.
Paper 2 likely has higher impact: it delivers a large-scale, realistic benchmark (up to ~37k channels) with physically grounded targets (AC power-flow) and introduces constraint-aware probabilistic metrics, enabling standardized evaluation of safety-critical forecasting at unprecedented scale. This can catalyze broad follow-on work across ML, time-series, energy systems, and risk-aware decision-making. It also proposes a competitive baseline model (PowerForge). Paper 1 is novel and rigorous but narrower (a specific optimizer bias) and its real-world gains appear regime-dependent, limiting immediate cross-field adoption.
Paper 1 introduces a massive-scale benchmark for time series forecasting in power systems, scaling an order of magnitude beyond existing datasets. By addressing critical real-world operational constraints and introducing novel metrics, it opens a significant new avenue for applied ML research. In contrast, Paper 2 presents an incremental self-learning GNN approach to graph clustering with results that are only competitive under specific conditions, leading to a lower potential scientific impact.
Paper 1 introduces a novel large-scale benchmark (PowerPhase) addressing a significant gap in probabilistic forecasting for power systems, with up to 36,964 channels—an order of magnitude beyond existing benchmarks. It identifies the safety-fidelity trade-off concept, proposes constraint-aware metrics, and introduces PowerForge. This has high practical impact for critical infrastructure and energy systems. Paper 2 presents an interesting RL-based framework for model editing but is more exploratory, with moderate results on established tasks (bias mitigation, unlearning) that already have effective specialized methods, limiting its comparative impact.
Paper 2 likely has higher impact due to a substantial new benchmark at unprecedented scale (up to ~37k channels) grounded in AC power-flow physics, plus constraint-aware probabilistic metrics that formalize a broadly relevant safety–fidelity trade-off. Its real-world application domain (power-system operations) is high-stakes and timely, and the proposed model (PowerForge) is evaluated across multiple grids, baselines, and seeds. Paper 1 is valuable for agent evaluation standardization, but its contribution is narrower and more tooling/protocol-centric, with less cross-domain societal impact.
SupraBench introduces the first systematic benchmark for evaluating LLMs in supramolecular chemistry, bridging AI and a critical chemistry subdomain with broad applications in drug delivery, materials science, and catalysis. It provides a curated corpus (SupraPMC), multiple task types, and reveals specific LLM failure modes, establishing a foundation for future research. Paper 2, while valuable for power systems forecasting, addresses a more niche application. SupraBench's cross-disciplinary novelty (AI + chemistry), the growing interest in LLMs for scientific reasoning, and its potential to accelerate supramolecular design give it broader and more timely impact.