Sergei Vorobyov, Eugene Ilyushin
Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, -CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the \texttt{auto\_LiRPA}\,/\,-CROWN verification framework. \textbf{Tensor Parallelism (TP)} shards both weight and -matrices across GPUs, achieving peak-memory reduction at ; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. \textbf{Fully Sharded Data Parallelism (FSDP)} shards only weight matrices with a per-layer \texttt{AllGather}, producing bounds that are \emph{bitwise identical} to the single-GPU baseline: baseline memory drops by 80--90\%, peak memory by 34--39\% on wide MLPs. FSDP integrates cleanly with complete verification (-CROWN + Branch-and-Bound) and with convolutional layers (\texttt{BoundConv}); a complete \emph{unsat} result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in -CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.
This paper addresses the GPU memory bottleneck in formal neural network verification by adapting two parallelism strategies—Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP)—from large-scale model training to the auto_LiRPA/α,β-CROWN verification framework. The key insight is that bound-propagation algorithms (IBP, CROWN, α-CROWN) require weight and relaxation-coefficient matrices to reside entirely on a single GPU, limiting the scale of verifiable networks. The paper implements TP (via custom BoundLinearTP_Col/Row operators) and FSDP (via AllGather-based weight reconstruction), achieving ~2× peak memory reduction with TP and 34–39% peak / 80–90% baseline memory savings with FSDP. A notable secondary finding is that per-neuron alpha tensors—not weights—constitute the true memory bottleneck in complete verification (BaB) mode, redirecting future research priorities.
The experimental methodology is generally sound, with clear protocols and appropriate baselines. Several strengths stand out:
However, there are notable limitations in experimental scope:
The practical impact is moderate but targeted. The work addresses a real bottleneck acknowledged by the verification community—GPU memory limits on verifiable network size. The FSDP implementation in ~100 lines of Python is notably lightweight, lowering the adoption barrier. Integration with VNN-COMP benchmarks and the dominant α,β-CROWN framework increases practical relevance.
However, the paper's own central finding significantly tempers the impact: in the most practically important mode (α-CROWN+BaB for complete verification), weights are a negligible fraction of memory (<0.1%), with alpha tensors dominating. This means the proposed weight-sharding techniques provide meaningful savings only for incomplete verification of wide MLPs—a somewhat narrow use case. The paper honestly acknowledges this, but it means the contribution solves only a secondary bottleneck for the most important verification mode.
The TP approach is further limited by bound-tightness degradation that grows dramatically with network depth (deviations of ~750 for 256×6 models), making it impractical for realistic deep networks.
The work is timely in two respects: (1) neural network verification is increasingly important as AI systems enter safety-critical domains, and (2) the scaling challenge is well-recognized in the VNN-COMP community. The cross-pollination from distributed training techniques to verification is a natural and timely idea. However, the mismatch between the solved problem (weight memory) and the actual bottleneck (alpha tensors) somewhat diminishes the immediate practical relevance.
1. Honest diagnosis: The paper's most valuable contribution may be the empirical identification of alpha tensors as the true bottleneck—this diagnostic finding redirects community effort more effectively than the parallelism implementations themselves.
2. Engineering completeness: Both TP and FSDP are implemented end-to-end, including ONNX compatibility, convolutional layer support, and BaB integration.
3. Reproducibility: Clear experimental protocols, specific hardware descriptions, and integration with standard benchmarks support reproducibility.
4. Practical fixes: The Transformer compatibility fixes (BoundReshape, BoundConcat, BoundConstantOfShape) have independent value for the auto_LiRPA community.
1. Limited scale: P=2 only; the most interesting regime (P≥4, where diminishing returns are predicted) is untested.
2. Narrow applicability window: FSDP savings are meaningful only for incomplete verification of wide MLPs; TP is practical only for very shallow networks.
3. No wall-clock timing analysis: Communication overhead (AllGather latency) is never quantified in terms of verification time impact—only memory is measured.
4. The solved problem isn't the binding constraint: The paper demonstrates that for complete verification, its own contribution provides essentially no benefit, since alpha tensors dominate.
5. No comparison with alternative approaches: Simple approaches like gradient checkpointing, mixed precision, or CPU offloading for weights are not compared.
6. Limited novelty in technique: The parallelism strategies are directly adapted from well-known training paradigms (Megatron-LM, ZeRO) with relatively straightforward engineering modifications.
This is a competent engineering paper that brings distributed training techniques to neural network verification. The execution is clean and the experimental methodology is careful. However, the scientific novelty is limited—the techniques are straightforward adaptations—and the paper's own findings reveal that the addressed bottleneck (weight memory) is secondary to the actual bottleneck (alpha tensors) in the most important verification mode. The most impactful contribution is arguably the diagnostic finding about alpha tensor dominance, which is more of an empirical observation than a technical innovation. The work opens a research direction but doesn't yet deliver transformative capability.
Generated Jun 9, 2026
Paper 2 is likely higher impact: it introduces a broadly useful, high-performance fused kernel (Flash-GMM) with large, demonstrated gains (20× speed, 100× larger datasets) and an immediate downstream application improving ANN search quality/efficiency—relevant across ML systems, clustering, retrieval, and databases. The open-source Triton kernel increases adoption potential. Paper 1 is valuable for verification but targets a narrower community; its strongest method (FSDP) is an adaptation of existing training parallelism, and it identifies that alphas (not weights) are the main bottleneck, limiting near-term scaling wins.
Paper 1 introduces a fundamental algorithmic innovation in explainable AI by shifting feature occlusion from the input space to the parameter space. This addresses major biases and instability in existing attribution methods, offering broad interdisciplinary impact, particularly in high-stakes fields like healthcare. While Paper 2 presents valuable systems-level optimizations for neural network verification, Paper 1's approach provides a more foundational contribution to understanding and interpreting model behavior.
Paper 2 likely has higher impact: it introduces a broadly applicable systems/methods advance that scales formal neural network verification—an area with strong timeliness for safety-critical ML. Adapting TP/FSDP to verification and demonstrating bitwise-identical bounds with large memory reductions, plus integration with complete verification and modern CNN/ResNet benchmarks, shows solid methodological rigor and immediate utility. Its applicability spans many domains using verified ML (autonomy, security, control). Paper 1 is valuable as a clinical benchmark but is narrower in scope and its main finding is a modality limitation rather than enabling new predictive performance.
Paper 1 addresses a fundamental theoretical question about the relationship between data symmetries and conservation laws in neural network training, establishing rigorous mathematical results applicable across broad classes of architectures. Its novelty lies in connecting symmetry principles (reminiscent of Noether's theorem) to deep learning theory, with potential to influence both theoretical understanding and practical training strategies. Paper 2 is a solid engineering contribution applying known parallelism techniques to neural network verification, but its impact is more incremental and narrowly scoped to the verification community.
Paper 2 likely has higher impact: it targets a broad, timely bottleneck (LLM inference latency) with a simple, general architectural principle (backbone-first token) plus a tiny predictor that achieves measurable speedups with no quality loss across multiple model scales. This is immediately applicable to real-world deployment and could influence many decoding/serving systems. Paper 1 is rigorous and valuable for verification tooling, but its impact is narrower (formal verification community), and the main gains are memory engineering with limited scaling (TP tightness degradation; FSDP shifts bottleneck to alpha tensors).
Paper 2 addresses a highly critical and timely problem: compressing Large Language Models (LLMs) via post-training quantization. Its exact algorithm (PiSO) replaces standard heuristics, offering broad and immediate real-world applicability for efficient LLM deployment. While Paper 1 makes solid contributions to neural network verification, formal verification currently has a narrower audience and application scope compared to the massive ecosystem surrounding LLM optimization.
Paper 2 has higher impact potential due to strong real-world applicability and timeliness: scaling formal verification to multi-GPU directly addresses a major bottleneck for safety-critical deployment. Methodologically, it contributes engineering/algorithmic advances (TP and especially FSDP with bitwise-identical bounds), integrates with complete verification, and demonstrates on competitive benchmarks including a large ResNet case. Its insights about alpha-tensor memory bottlenecks guide future work. Paper 1 is novel and useful for interpretability, but is more descriptive/diagnostic and less likely to unlock immediate, broad operational capabilities than scalable verification.
Paper 1 addresses a critical bottleneck in AI safety—memory constraints in formal neural network verification. By adapting large-scale training parallelism techniques (TP, FSDP) to verification, it enables the safety assessment of much larger models. This has broad, high-stakes implications for deploying AI in safety-critical domains (e.g., autonomous driving, healthcare). In contrast, Paper 2 is highly domain-specific, focusing on sports analytics (football tactical phases). While methodologically sound, Paper 1's contribution to foundational AI safety ensures a significantly wider and more profound scientific impact.
Paper 2 presents a novel mathematical framework for constructing VAE latent spaces with prescribed non-Euclidean topology, addressing a fundamental limitation in generative modeling. It offers broad applicability across manifold types, provides a principled constructive approach with closed-form KL divergences, and has clear implications for representation learning across many domains. Paper 1, while technically solid, addresses a more narrow engineering problem (memory scaling for neural network verification) and identifies that the main bottleneck (alpha tensors) isn't even solved by the proposed methods, limiting its immediate impact.
Paper 2 likely has higher scientific impact due to broad relevance and timeliness: scaling formal neural network verification is a major bottleneck for deploying ML in safety-critical settings. Adapting TP and especially FSDP to verification yields large memory reductions while preserving bitwise-identical bounds, integrates with complete verification and convnets, and enables new benchmark results (e.g., CIFAR-100 ResNet-large unsat). The methodology is concrete, reproducible, and broadly applicable across verification tools and model classes. Paper 1 is novel and valuable for ecology, but its impact is more domain-specific.