When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

Wenjie Xi

Jun 8, 2026arXiv:2606.09705v1

cs.LGcond-mat.stat-mech

#524of 5669·cs.LG

#524 of 5669 · cs.LG

Tournament Score

1502±45

10501750

72%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7.5

Clarity8

Abstract

Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score. Through Tweedie's formula, far-away perturbations can influence local score components via posterior covariance, meaning a local model succeeds only if its receptive field covers the smoothed score's response range. We formalize this mechanism, proving a size-uniform comparison theorem for local marginals under reverse diffusion. We also introduce Finite-Depth Local Flow (FDLF), a white-box diagnostic benchmark with exact scores, densities, and controllable response ranges. Empirically, we validate the interplay between spatial mixing, smoothed-score quasi-locality, and model receptive fields. Under spatial mixing, the smoothed score remains quasi-local relative to the receptive field, enabling stable extrapolation. Conversely, when spatial mixing weakens, the score's locality rapidly degrades, causing size transfer to fail.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a fundamental question in scientific generative modeling: under what conditions can score-based diffusion models with local (finite receptive field) architectures, trained on small systems, reliably generate samples from larger systems? The central insight is that architectural locality (e.g., translation-invariant CNNs) is necessary but insufficient for stable size extrapolation. Instead, the governing factor is the quasi-locality of the Gaussian-smoothed score, which is determined by the posterior covariance structure under Tweedie's formula rather than the clean interaction graph.

The key equation linking smoothed-score response to posterior correlations (Eq. 9) — showing that the score Jacobian ∂s_{L,σ}(x)_i/∂x_j = σ^{-4} Cov_{π^x}(X_{0,i}, X_{0,j}) — serves as the conceptual hinge. This elegantly explains why short-range clean interactions (like nearest-neighbor Ising) can produce long-range smoothed-score dependencies near critical points, where posterior correlations diverge.

The paper makes three concrete contributions: (1) formalizing the response-tail diagnostic, (2) proving a size-uniform local-marginal comparison theorem (Theorem 2), and (3) introducing the FDLF benchmark with exact scores and controllable response ranges.

Methodological Rigor

Theoretical framework: The size-uniform comparison theorem (Theorem 2) is carefully constructed. The proof strategy — using backward Kolmogorov equations, Itô's formula along the learned process, and a dual-generator comparison — is mathematically sound. The key assumptions (dynamic quasi-locality, initial consistency, weighted on-rollout drift error) are physically motivated and clearly stated. The theorem is deliberately local, controlling bounded-Lipschitz distance on fixed patches uniformly in system size, which is the appropriate metric for thermodynamic-limit questions.

However, the theorem is *conditional*: it assumes dynamic quasi-locality (Assumption 4) of the exact reverse process, which is itself a strong condition whose verification for specific systems is non-trivial. The paper acknowledges this honestly — it is a comparison theorem rather than an unconditional guarantee.

Benchmark design (FDLF): The benchmark is well-motivated. Using normalizing-flow-based teachers with exact densities and scores addresses a genuine gap: existing benchmarks cannot isolate size-extrapolation mechanisms because exact scores are unavailable. The three-way diagnostic (positive control, receptive-field-limited, controlled failure) is a clean experimental design.

Experiments: The 2D continuous experiments (Figures 1-2) convincingly demonstrate the predicted three-regime behavior. The receptive-field sweep showing medium-range teachers improving with larger CNN radius while long-range stress teachers do not is particularly compelling. The Ising critical-point stress test (Figure 4) provides a physically grounded demonstration of the failure mechanism. However, the experimental scale is relatively modest (up to L=64 for 2D), and the paper uses only simple CNN architectures without testing more modern or practical architectures (transformers, message-passing networks with adaptive radii).

Potential Impact

Scientific generative modeling: The paper provides actionable diagnostic criteria for practitioners building local generative models for molecular dynamics, materials science, or lattice field theory. The message — check whether your model's receptive field covers the smoothed-score response range, not just the clean interaction range — is both practical and non-obvious.

Architecture design guidance: The connection between posterior correlation length and required receptive field provides principled guidance for choosing model depth/width in size-transfer applications, rather than relying on trial-and-error.

Benchmark utility: FDLF fills a diagnostic gap. While not meant for realism, it offers a controlled testbed that any proposed size-transfer method should pass. This could become a standard sanity check.

Limitations on broader impact: The paper primarily addresses lattice systems with translation invariance. Extension to irregular geometries (molecules, proteins) is not straightforward. The connection to practical scientific applications (where one might use GNNs on molecular graphs) remains indirect.

Timeliness & Relevance

This paper is highly timely. Size transfer is a critical bottleneck in applying diffusion models to scientific problems — training on small molecules or lattices and deploying on larger ones. The rapid adoption of diffusion models in molecular generation (GeoDiff, torsional diffusion) and materials science makes this theoretical grounding valuable. The paper connects to active areas: score-based convergence theory, spatial mixing in statistical mechanics, and neural operator generalization.

Strengths

1. Clean conceptual contribution: The Tweedie-covariance link (Eq. 9) is simple, powerful, and likely to become a standard reference point for understanding locality in diffusion models.

2. Falsifiable predictions: The paper makes specific, testable predictions about when size transfer should succeed or fail, and then tests them systematically.

3. Principled benchmark: FDLF addresses a genuine methodological gap with exact ground truth.

4. Physical insight: The Ising critical-point example beautifully illustrates how phase transitions in the posterior, not the prior, determine extrapolation difficulty.

5. Honest framing: The paper is careful about what it claims and what it doesn't (conditional theorem, diagnostic rather than realism benchmark).

Limitations

1. Architectural scope: Only pure CNNs are tested. Modern scientific generative models use graph neural networks, transformers, or hybrid architectures. The paper mentions sliding-attention results in passing but doesn't present them.

2. Scale of experiments: Lattice sizes up to 64×64 and training for 1500 steps are modest. Practical scientific applications involve much larger systems and more complex distributions.

3. Gap between theory and practice: The comparison theorem requires verifying dynamic quasi-locality, which may be as difficult as the original problem for realistic systems. The theorem's utility is primarily conceptual rather than directly applicable.

4. Single-author limitations: The benchmark, while well-designed, would benefit from community validation and extension to more diverse settings.

5. No guidance on fixing failures: When size transfer fails (long-range response), the paper diagnoses the problem but doesn't propose solutions (e.g., multi-scale architectures, hierarchical approaches).

6. Discrete variable handling: The paper explicitly avoids learning scores for discrete variables, limiting applicability to mixed systems common in scientific modeling.

Overall Assessment

This is a theoretically clean and experimentally well-designed paper that identifies and formalizes an important mechanism governing size extrapolation in local score-based models. The central insight connecting smoothed-score locality to posterior covariance is both novel and practically relevant. The FDLF benchmark is a useful methodological contribution. The main limitations are the narrow architectural scope of experiments and the gap between the conditional theoretical guarantee and practical verification. The paper should influence how the community thinks about and evaluates size transfer in scientific generative modeling.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7.5Clarity 8

Generated Jun 9, 2026

Comparison History (18)

Wonvs. Unifying Local Communications and Local Updates for LLM Pretraining

Paper 2 is likely higher impact: it introduces a broadly applicable diagnostic theory for size extrapolation in score-based generative modeling, with formal results (size-uniform comparison theorem) and a controllable benchmark (FDLF) enabling reproducible evaluation across domains (physics, chemistry, scientific ML). Its core concept—quasi-locality of the Gaussian-smoothed score—clarifies when locality-based architectures generalize, addressing a timely, widely encountered failure mode. Paper 1 is practically valuable for distributed LLM training, but its impact is more specialized to systems/optimization for pretraining and may be superseded by engineering advances.

gpt-5.2·Jun 10, 2026

Lostvs. A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

Paper 2 likely has higher impact: it reframes a widely used and timely practice (supervised fine-tuning of large language models) with a unifying theoretical lens (target distribution design) and proposes an actionable method (Target-SFT) with broad applicability across models and tasks. Its relevance to current ML deployment and alignment pipelines, plus potential to influence many downstream training recipes, suggests wide cross-field adoption. Paper 1 is rigorous and novel but more specialized to score-based generative modeling and size extrapolation, limiting breadth of immediate real-world uptake.

gpt-5.2·Jun 10, 2026

Wonvs. Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

Paper 1 addresses a fundamental theoretical question about when generative models can extrapolate across system sizes, providing formal guarantees (size-uniform comparison theorem) and a diagnostic benchmark with exact solutions. This has broad implications across scientific generative modeling (molecular dynamics, materials science, etc.). Paper 2 solves a more specific applied problem (geometric constraint satisfaction in LLMs) with a practical but narrower contribution. Paper 1's theoretical framework for understanding locality, spatial mixing, and score quasi-locality provides deeper foundational insights that could influence multiple research directions in score-based generative modeling.

claude-opus-4-6·Jun 9, 2026

Wonvs. Your GFlowNet Secretly Learns an Optimal Transport Plan

Paper 2 addresses a critical bottleneck in scientific generative modeling: size extrapolation. By providing a theoretical framework explaining when local score models can generalize to larger systems, alongside a diagnostic benchmark, it offers actionable insights for a widespread problem in applied ML (e.g., modeling large molecules or materials). Paper 1 offers an elegant theoretical link between GFlowNets and Optimal Transport, but Paper 2's direct relevance to practical, large-scale scientific applications gives it broader potential impact across disciplines.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

Paper 1 offers a foundational theoretical framework and diagnostic benchmark for a critical bottleneck in scientific AI (size extrapolation in generative models). While Paper 2 provides a valuable system-level optimization for LLM inference, Paper 1's rigorous mathematical approach to quasi-locality and spatial mixing offers deeper, longer-lasting scientific insights with profound implications for domains like molecular modeling and physics simulations.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

Paper 2 has higher potential impact due to its stronger conceptual novelty and breadth: it provides a diagnostic theory (quasi-locality of Gaussian-smoothed scores via Tweedie’s formula), formal guarantees (size-uniform comparison theorem), and a new benchmark (FDLF) with exact controllable ground truth. This combination can influence how size transfer is understood and evaluated across scientific generative modeling domains (physics, chemistry, PDEs), making it timely and broadly relevant. Paper 1 is impactful for compiler/ML systems, but is more incremental and narrower in cross-field reach despite strong empirical gains.

gpt-5.2·Jun 9, 2026

Lostvs. Perturbative Contrastive Physical Learning

Paper 2 introduces a broad, general framework for physical learning that bridges machine learning and physical hardware (analog computing, photonics). Its ability to enable learning without explicit backpropagation offers significant real-world impact for future AI hardware. Paper 1, while methodologically rigorous and valuable for generative modeling theory, is more narrowly focused on the specific problem of size extrapolation in score-based models.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Constrained user-item allocation for e-commerce marketing campaigns

Paper 2 addresses a fundamental theoretical question about when and why diffusion-based generative models can extrapolate across system sizes—a critical challenge in scientific machine learning. It provides rigorous theoretical contributions (size-uniform comparison theorem, formalization of quasi-locality conditions) and a diagnostic benchmark (FDLF) that will benefit researchers across physics, chemistry, and materials science who use generative models. Paper 1 solves a practical but narrower e-commerce optimization problem with relatively standard techniques (biclustering, local search, bandits). Paper 2's broader applicability, theoretical depth, and relevance to the rapidly growing field of scientific generative modeling give it higher impact potential.

claude-opus-4-6·Jun 9, 2026

Lostvs. Causal Modeling of Selection in Evolution

Paper 2 addresses a fundamental gap in causal discovery by distinguishing static from evolutionary selection—a distinction with broad implications across biology, social science, and machine learning. Its introduction of a new graphical model for evolutionary selection, along with sound and complete identification procedures, provides foundational theoretical contributions applicable to diverse fields (immunology, epidemiology, social science). Paper 1, while technically rigorous with its analysis of score-based diffusion model extrapolation, addresses a more specialized problem within generative modeling. Paper 2's cross-disciplinary reach and conceptual novelty give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. Escaping the KL Agreement Trap in On-Policy Distillation

Paper 2 addresses a fundamental challenge in scientific machine learning—extrapolating from small to large systems—with rigorous theoretical grounding. By formalizing the mechanisms of size transfer and providing a diagnostic benchmark, it offers foundational insights applicable across diverse fields like material science and computational chemistry. Paper 1 offers a valuable but more narrowly focused algorithmic improvement for LLM distillation, making Paper 2's potential breadth and depth of scientific impact significantly higher.

gemini-3.1-pro-preview·Jun 9, 2026

#524of 5669·cs.LG

#524 of 5669 · cs.LG

Tournament Score

1502±45

10501750

72%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7.5

Clarity8