Hongbo Wang
Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: -step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, . The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a -equivariant network recovers the full Lyapunov spectrum (); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a -inflated certificate provably needs the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.
This paper establishes a theoretical framework for certifying the predictable horizon of equivariant latent world models. The central claim is that while scaling neural networks improves average interpolation error, only structural inductive biases (equivariance) can provide *certified, per-situation, multi-step* guarantees on prediction trustworthiness. The key results are:
1. Theorem A: T-step rollout error is provably constant across symmetry orbits for equivariant models.
2. Theorem B + Proposition 6: A two-sided bound showing the certified horizon scales as T_j(ε) ~ log(1/ε)/λ_j, with a matching lower bound proving this is tight — approximate equivariance is provably horizon-limited.
3. Lemma 2: Orbit-constant error *characterizes* equivariance — no non-equivariant model can possess this certificate regardless of scale.
4. Noether hinge: Conserved/invariant channels enjoy unbounded horizons (linear rather than exponential error growth).
The paper also demonstrates training-free auditing of pretrained world models (TD-MPC2, V-JEPA 2-AC) using the same certificate machinery.
The theoretical development is thorough and carefully constructed. The proofs are complete (Appendix B), building cleanly from assumptions (A1)-(A5) through composition closure to the main results. The two-sided nature of the horizon bound (Theorem B upper, Proposition 6 lower) is particularly strong — it moves beyond typical one-sided guarantees.
The scope conditions are admirably honest. Proposition 7 explicitly characterizes when the certificate is informative (λ₁ > 0) versus vacuous (λ₁ ≈ 0), and this is empirically validated (PushT interior as the predicted degenerate case, R² ≈ 0.02). The paper consistently labels inconclusive results as such (E11's Acrobot).
However, several caveats deserve attention. The C¹-closeness verification relies only on one-step L² error — acknowledged as a gap, especially in high dimensions. The dominated splitting hypothesis for Proposition 8 is assumed, not certified. The Noether hinge validation uses constructed equivariant teachers rather than naturally learned models. The prefactor distribution on real chaotic loops (median κ₁ ≈ 20, heavy tail) means the worst-case certificate can be substantially more conservative than the typical-case calibration ratios (0.83–1.02) suggest.
Immediate applications: The framework has clear utility for (1) scheduling re-observations in sensing-constrained settings (E12), (2) training-free auditing of pretrained world models (E13-E16), and (3) deployed monitoring with certificate-derived cadences (E15). The sensing-budget result (Proposition 9) is practically actionable: inflated certificates provably waste budget proportionally.
Broader influence: The "scale buys interpolation, structure buys a horizon" thesis, if validated at larger scales, could redirect architectural choices in world model design. The training-free audit capability for pretrained models (TD-MPC2, V-JEPA 2-AC) is particularly timely as foundation world models proliferate. The finding that calibration does not improve with parameters across the 1M-317M multitask ladder (E14) is a striking negative result about scaling.
Adjacent fields: The connection between Lyapunov spectra and certified horizons could influence numerical weather prediction, data assimilation, and dynamical systems more broadly. The epistemic exploration drive (E8) suggests applications in active learning.
The paper addresses a genuine gap: world models are increasingly deployed for planning (TD-MPC2, Dreamer variants, JEPA-based systems), yet there exists no principled way to certify *how far ahead* a specific prediction can be trusted. Existing equivariance certificates are single-shot; existing horizon analyses lack per-situation guarantees. The intersection was empty, and this paper fills it.
The training-free audit of foundation-scale models (V-JEPA 2-AC at 1B) is particularly timely. The finding that the cross-validated audit, not the raw spectrum, is the deployable object at this scale (the tangent spectrum over-promises T₁ ≈ 9 when actual divergence is bias-dominated) provides important practical guidance.
1. Tight, two-sided bounds: The matching lower bound (Proposition 6) elevates the certificate from an upper estimate to a characterization of the achievable horizon.
2. Extraordinary honesty: INCONCLUSIVE results are reported as such; every limitation is explicitly stated; prefactor distributions and heavy tails are disclosed rather than hidden.
3. Theory-experiment alignment: Each theoretical prediction has a corresponding empirical validation, including the predicted failure modes (degenerate branch, bias-dominated regime).
4. The structure-exclusive argument: Lemma 2's characterization — that orbit-constant error is equivalent to equivariance — makes the impossibility result for non-equivariant models a theorem rather than an empirical observation.
5. Practical applicability: The training-free audit of public checkpoints demonstrates the certificate works on models the authors did not train.
1. Scale: All experiments are 1-2 GPU. The 40-D Lorenz-96 is the highest-dimensional chaotic system; the gap between this and real-world complexity is vast.
2. Assumption (A3): Requiring the group to be a genuine dynamical symmetry is restrictive. Many real systems have only approximate or partial symmetries.
3. Downstream value dilution: E11 and step93 show that the certificate's resolution may not align with task-relevant tolerances, limiting planning applications. The paper proves this boundary (Proposition 11) but cannot cross it.
4. Single author: While the paper is remarkably thorough, the breadth of claims would benefit from independent verification. The paper is extremely dense and difficult to parse.
5. Pixel-domain limitations: The absolute multi-step accuracy on raw pixels is poor for all architectures at this scale — the certificate certifies something that isn't yet practically useful there.
6. Writing: The paper is extremely compressed, with a rhetorical style that can obscure rather than clarify. The sheer density of results, propositions, and experimental references makes extraction of core insights challenging.
This is a theoretically ambitious and carefully executed paper that establishes genuine new results at the intersection of equivariance and predictability horizons. The two-sided bounds, the exclusivity characterization, and the training-free audit capability are significant contributions. The honest gating and explicit scope limitations set a high standard. The main concerns are the small experimental scale, the restrictive dynamical symmetry assumption, and the dense presentation that may limit accessibility and adoption.
Generated Jun 12, 2026
PolyFlow addresses a practical, well-defined problem (constrained generation in safety-critical systems) with a clean, implementable solution that guarantees zero constraint violation while maintaining efficiency. It has immediate real-world applicability in planning and control tasks, clear methodological contributions (projection-free architecture, discrete-time flow formulation), and released code. Paper 2, while theoretically interesting in certifying prediction horizons for equivariant world models, is more niche, harder to parse, and its practical impact is narrower. PolyFlow's combination of safety guarantees, computational efficiency, and broad applicability gives it higher potential impact.
Paper 1 bridges rigorous theoretical guarantees (Lyapunov-based certificates) with large-scale empirical auditing of foundational world models (e.g., 1B parameter V-JEPA). Providing certified predictability addresses a crucial safety and reliability bottleneck in AI. Proving that structural priors (equivariance) are necessary for reliable forecasting—and that scale alone is insufficient—presents a profound paradigm shift. While Paper 2 offers valuable geometric insights into diffusion models, Paper 1's combination of deep theory and immediate practical applicability to safe AI deployment gives it a broader and more transformative potential scientific impact.
Paper 2 introduces a fundamentally new theoretical framework connecting equivariance, Lyapunov spectra, and certified prediction horizons for world models, with broad implications across dynamical systems, robotics, and AI safety. Its provable guarantees (orbit-constant error, two-sided horizon bounds) and training-free auditing of pretrained models represent deeper conceptual contributions. Paper 1, while practically useful, offers incremental improvements (6.3% speedup) to speculative decoding for diffusion models. Paper 2's cross-disciplinary relevance (control theory, symmetry, trustworthy AI) and novel certification methodology suggest broader and more lasting scientific impact.
Paper 2 addresses the highly active area of latent reasoning in LLMs, combining practical RL training improvements with mechanistic interpretability. The SWITCH framework offers immediate applicability to the rapidly growing field of reasoning models, with clear engineering contributions (boundary tokens enabling standard on-policy RL) and scientific insights (mechanistic analysis of hidden-state recurrence). Paper 1, while theoretically rigorous in providing certified prediction horizons for equivariant world models, addresses a narrower niche. Paper 2's broader relevance to the LLM reasoning community, timeliness, and dual practical/interpretability contributions give it higher estimated impact.
Paper 1 introduces a fundamentally novel theoretical framework connecting equivariance, Lyapunov spectra, and certified prediction horizons for world models, with broad implications across machine learning, dynamical systems, and AI safety/deployment. Its theoretical contributions (provable certificates, two-sided bounds, characterization of equivariance) are deep and general, applying to pretrained models training-free. Paper 2 presents a solid applied contribution in ecological bioacoustics but is more incremental—combining known techniques (semi-supervised learning, knowledge distillation, active learning) for a specific domain with modest absolute performance. Paper 1's breadth of impact and theoretical novelty give it substantially higher potential impact.
Paper 1 introduces a unifying theoretical framework (SIM) for interpretable ML grounded in Lagrangian mechanics, addressing a fundamental gap in a critically important and rapidly growing field. Its breadth of impact is enormous—it spans traditional, concept-based, and mechanistic interpretability, offers pedagogical value, and provides a deductive methodology applicable across the entire interpretability discipline. Paper 2, while technically rigorous with novel certified horizon guarantees for equivariant world models, addresses a narrower problem (multi-step prediction trust in equivariant models). Paper 1's potential to restructure and unify a fragmented field gives it broader and deeper long-term scientific impact.
Paper 1 offers higher scientific impact by providing a rigorous, theoretically grounded certificate for the predictable horizon of world models. Bridging dynamical systems with deep learning, it solves a critical problem in AI safety: knowing when to trust a model's future predictions. Its ability to audit massive real-world models without retraining demonstrates profound practical utility for reliable autonomous systems. In contrast, Paper 2 provides valuable but narrower empirical insights into the parameter dynamics of on-policy distillation, which has a more limited theoretical and cross-disciplinary scope.
Paper 1 addresses a broadly impactful problem—making LLM post-training more transparent and controllable using interpretability—relevant to the massive and growing LLM alignment community. It offers a practical, actionable pipeline (data auditing, reward shaping) with immediate real-world applications for mitigating sycophancy and over-stylization. Paper 2, while theoretically rigorous with novel certified horizon guarantees for equivariant world models, targets a narrower audience (equivariant dynamics/world models). Paper 1's breadth of impact across AI safety, alignment, and practical ML deployment, combined with its timeliness given current LLM scaling trends, gives it higher estimated impact.
Paper 2 has higher potential scientific impact due to broader applicability and clearer downstream utility: recovering governing equations from noisy high-dimensional data directly serves scientific discovery across neuroscience, physics, and biology. Its multi-view contrastive setup targets a common real-world measurement regime and includes identifiability guarantees plus symbolic equation recovery, which is a strong, field-crossing contribution. Paper 1 is novel and rigorous in certifying predictability horizons for equivariant world models, but its impact is narrower (equivariance-dependent, auditing/prediction reliability in ML world models) and more specialized to symmetry-structured settings.
Paper 1 addresses a fundamental challenge in AI safety and foundation models by providing a computable predictability certificate for world models. Its theoretical guarantees, combined with successful empirical audits of large-scale public models, promise broad and immediate impact across AI and robotics. Paper 2, while mathematically rigorous, is highly specialized to geometric deep learning and physics-informed models, offering a narrower scope of application and impact.