Pietro Cagnasso, Eugene Belilovsky, Edouard Oyallon
Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data centers, and lower-bandwidth links. Many practical methods reduce communication frequency but still rely on synchronous All-Reduce operations that maintain identical model states and tie progress to global collectives. This can become a bottleneck when bandwidth or worker speed is heterogeneous. We introduce GASLoC, a novel decentralized pre-training algorithm that generalizes the notion of communication acceleration to the recently popular "outer optimizer" to allow a practical gossip-based training framework that is compatible with adaptive optimizers, allows for local optimizer steps, and can utilize sparse randomized peer communication. Empirically, on a number of standard LLM training tasks, we demonstrate that GASLoC outperforms state-of-the-art decentralized algorithms in single step per communication setting for a number of topologies and, unlike existing decentralized methods in the LLM setting, it allows to obtain performance competitive with DiLoCo when utilizing multiple local steps. In the heterogeneous bandwidth setting we demonstrate the advantage of GASLoC showing that it can significantly outperform DiLoCo.
GASLoC proposes a decentralized training algorithm for LLMs that unifies three previously separate ideas: (a) local SGD-style multiple gradient steps between communications, (b) sparse randomized gossip-based peer-to-peer communication, and (c) an outer optimizer with momentum that serves as communication acceleration. The key insight is that the "outer optimizer" popularized by DiLoCo can be reinterpreted as a communication acceleration mechanism from the gossip optimization literature. When the communication graph is complete, GASLoC exactly recovers DiLoCo; when the graph is sparse, the outer momentum reduces the communication complexity from χ to √χ (where χ is the spectral gap), providing a principled bridge between federated/local SGD methods and decentralized gossip methods.
The randomized 1-Peer and 2-Peer communication schemes are particularly elegant: by sampling random permutations rather than using fixed topologies, the expected spectral gap scales as O(1) regardless of network size, compared to O(n²) for fixed cycles or O(log n) for exponential graphs. This is formalized in Proposition 1 with clean closed-form expressions for first and second moments.
Theory: The convergence analysis (Proposition 2) is standard but thorough, covering four communication regimes (standard gossip, accelerated gossip, 1-Peer, 2-Peer) under a unified Lyapunov framework. The proof in Appendix A spans ~15 pages and handles stochastic gradients, L-smoothness, and variable local step counts. The key theoretical novelty—showing that outer momentum improves communication complexity from χ to √χ in the gossip setting—is cleanly established through Lemma 1.
Experiments: The empirical evaluation is reasonably comprehensive, covering 134M and 551M parameter Llama-3-style models on FineWeb, with 8/16/32 workers, multiple topologies (ring, complete, 1-Peer, 2-Peer), and both H=1 and H=30 regimes. The comparison against DDP, DiLoCo, DAdam, and the newly introduced Local-DAdam baseline is appropriate. The heterogeneous bandwidth experiments in Figure 3 provide practical motivation.
Weaknesses in rigor: The paper lacks error bars or repeated runs, which is acknowledged but remains a limitation. The convergence result assumes homogeneous data across workers, which while matching the LLM pretraining setting, limits theoretical generality. The heterogeneous bandwidth experiments use simulated timing rather than actual heterogeneous hardware, though this is a common practice. The model scales (134M, 551M) are modest by current LLM standards, leaving open questions about behavior at 7B+ parameters.
Practical relevance: As LLM training increasingly spans multiple data centers and heterogeneous hardware (including over-the-internet connections), methods that avoid global All-Reduce synchronization become critical. GASLoC's ability to (a) use sparse peer communication, (b) allow variable local steps per worker, and (c) maintain competitive loss makes it practically attractive for geo-distributed training scenarios.
Algorithmic insight: The conceptual unification of DiLoCo's outer optimizer with gossip communication acceleration is valuable. This reframing could influence how the community thinks about and designs communication-efficient training methods, potentially spawning further work on adaptive topology selection, asynchronous variants, or integration with gradient compression.
Broader influence: The fault tolerance properties (failed exchanges only affect nearby workers) and the bandwidth-straggler adaptation (Figure 2) address real deployment concerns. The simulated compute utilization analysis (Figure 9) for 70B models suggests the benefits would amplify at scale.
This paper is highly timely. DiLoCo and its variants have emerged as the dominant paradigm for communication-efficient LLM pretraining, and the community is actively exploring ways to reduce its remaining dependencies on global synchronization. The concurrent work on Decoupled DiLoCo, NoLoCo, and streaming DiLoCo all target similar concerns from different angles. GASLoC provides a principled gossip-based alternative that complements these efforts. The growing interest in training across heterogeneous clusters (cloud + on-premise, multiple regions) makes this work directly relevant to current infrastructure trends.
Additional observations: The hyperparameter sensitivity analysis (Appendix E) is a welcome addition showing that sparse communication doesn't substantially increase tuning difficulty. The paper is well-written with clear notation and good use of illustrative figures.
Generated Jun 10, 2026
ATLAS introduces a novel framework for automated scientific discovery that combines active learning with mechanistic modeling, with broad applicability across cognitive science and other scientific domains. Its potential to fundamentally change how experiments are designed and theories are discovered represents a paradigm-shifting contribution. While Paper 1 (GASLoC) makes a solid engineering contribution to distributed LLM training with practical benefits in heterogeneous settings, it is more incremental—combining existing ideas (gossip protocols, local updates, outer optimizers) in a useful but narrower way. ATLAS's cross-disciplinary impact and novelty in automating the scientific method give it higher long-term impact potential.
Paper 1 presents a novel and highly interdisciplinary framework that bridges neuroscience and AI by using brain fMRI signals to directly enhance LLM reasoning, moving beyond correlational analysis to causal guidance. This represents a fundamentally new paradigm (brain-guided AI) with broad implications across cognitive science, neuroscience, and AI alignment. The demonstrated improvements across 10 LLMs of varying scales and transfer across reasoning types strengthen its impact. Paper 2, while practically valuable for distributed LLM training efficiency, is more incremental—optimizing existing decentralized training paradigms rather than opening a new research direction.
Paper 1 addresses a critical scalability bottleneck in LLM pretraining—communication efficiency in distributed training—which is of immense practical importance given the trend toward larger models and heterogeneous compute. GASLoC introduces a novel decentralized algorithm unifying local updates with gossip-based communication, demonstrating strong empirical results. Paper 2 provides valuable mechanistic insights into SAE feature stability, but its impact is more niche, primarily relevant to the interpretability subcommunity. Paper 1's potential to enable more efficient large-scale training has broader and more immediate real-world impact across the field.
Paper 2 likely has higher impact: it tackles a central, timely bottleneck in LLM pretraining—communication and heterogeneity—relevant across industry and academia. GASLoC’s decentralized, gossip-based framework that works with adaptive optimizers and local steps offers broad real-world applicability to distributed training infrastructure and could influence many systems/optimization efforts. Methodologically it appears empirically validated across topologies and settings against strong baselines (DiLoCo). Paper 1 is novel for operator learning and interpretability, but is narrower in scope (PDE/operator tasks) and likely affects a smaller community.
Paper 2 addresses a critical and highly timely bottleneck in AI—communication overhead in distributed LLM pre-training. Its novel decentralized algorithm offers immediate, practical improvements for large-scale AI training across heterogeneous clusters, ensuring high real-world applicability and broad impact. Paper 1, while conceptually valuable, focuses on a narrower theoretical discussion of uncertainty in dynamical systems, which is less likely to drive widespread, immediate technological advances.
Paper 2 likely has higher impact due to a more broadly applicable, timely contribution to scaling LLM pretraining under real distributed-systems constraints. GASLoC proposes a novel decentralized algorithm compatible with adaptive optimizers, local steps, and sparse randomized communication, and shows empirical gains over strong baselines (including heterogeneous bandwidth settings). This targets a major bottleneck for frontier training and can influence both ML systems and optimization research. Paper 1 is rigorous and valuable as a cautionary interpretability result, but its impact is narrower (MoE pruning/interpretability methodology) and primarily negative/diagnostic rather than enabling new capabilities.
Paper 2 addresses a critical bottleneck in the highly resource-intensive field of LLM pre-training. By enabling efficient decentralized training across low-bandwidth and heterogeneous clusters, it dramatically reduces the infrastructure barriers for training large foundation models. While Paper 1 offers excellent efficiency gains and state-of-the-art results in image generation, solving communication bottlenecks in distributed LLM training currently has broader economic and scientific implications across the AI ecosystem, making Paper 2's real-world applicability and timeliness more impactful.
Paper 1 likely has higher scientific impact due to stronger novelty and broader, timely relevance: it proposes a decentralized, gossip-based LLM pretraining framework compatible with adaptive optimizers and local steps, directly addressing a major scaling bottleneck (communication/heterogeneity) in frontier model training. The potential applications span most large-scale distributed training settings across industry and research, with cross-field impact in distributed optimization/systems and ML. Paper 2 is a solid applied architecture contribution, but is more incremental (Transformer variants + preprocessing) and narrower in scope to EEG emotion recognition.
Paper 2 likely has higher scientific impact: it creates a large, public, leakage-audited clinical-genomic benchmark with locked tasks, splits, and an evaluation harness—an enabling resource that can standardize and accelerate work across oncology, ML, and biomarker development. It also clearly identifies a modality ceiling and sets concrete requirements for improved data collection (serial ctDNA), which can steer the field. Paper 1 is technically novel and timely for distributed LLM training, but impacts may be narrower to systems/ML training infrastructure and may compete with fast-moving proprietary implementations.
Paper 2 has higher estimated scientific impact because it surfaces and theoretically explains a broadly applicable training-dynamics mechanism (early negative weight drift from biased activations) that spans many architectures and losses, linking it to activation sparsity, accuracy cliffs, and spike pathologies, and proposes simple mitigations (squared/clipped activations). This kind of general insight can influence activation design, initialization, optimization, interpretability, and sparsity/efficiency work across fields. Paper 1 is timely and practically valuable for distributed LLM pretraining, but its impact is more specialized to systems/optimization setups and may be superseded by infrastructure-specific solutions.