Tenghui Ma, Jihu Guo, Wei Gao, Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39 compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
ResiHP addresses a critical operational challenge in large-scale LLM training: maintaining high throughput when hardware failures (both fail-stop and fail-slow) occur across thousands of GPUs. The paper makes two principal contributions:
Workload-aware failure detection: A lightweight detector that disentangles genuine fail-slow failures from iteration-time fluctuations caused by variable sequence lengths. The key insight is that sequence packing with self-attention's quadratic complexity creates legitimate execution time variability that existing detectors (notably Greyhound) misinterpret as failures. ResiHP models micro-batch execution time as T_MB ≈ αN + β∑l_i², then uses a DAG-based analytical simulator to predict healthy iteration times, filtering false positives before triggering expensive validation.
Progressive hybrid-parallel adaptation: A scheduler that jointly adapts across TP, PP, and DP dimensions rather than optimizing each individually. This includes selective device exclusion within TP groups (preserving healthy devices rather than discarding entire groups), layer repartitioning across PP stages, and progress-aware workload migration across DP groups.
Practical relevance is high. Large-scale LLM training is among the most expensive computational workloads today. Meta reported 178,000 wasted GPU hours during OPT-175B training, and recent measurements show 59.2% of large training jobs encounter fail-slow failures. Any system that meaningfully reduces failure-induced waste has substantial economic impact.
The 1.04–4.39× throughput improvement is significant, particularly the ability to sustain training under the aggressive 30-minute failure frequency where all baselines abort. This resilience to cascading failures is practically important for multi-week training runs.
Broader influence: The workload-aware detection approach could influence monitoring systems beyond LLM training — any distributed workload with input-dependent execution variability faces similar false-positive challenges. The progressive adaptation framework across parallelism dimensions provides a template for future resilient distributed systems.
Limitations on generalizability: The system is implemented on a specific internal training framework (similar to Megatron-LM). While the authors claim portability, the actual implementation effort for other frameworks (e.g., DeepSpeed, FSDP-based systems) is unstated. The restriction to power-of-two TP degrees limits flexibility, though this is a practical constraint of current GPU communication libraries.
This work is highly timely. As LLM training scales to tens of thousands of GPUs (MegaScale reports 10,000+ GPU training), failure management is a recognized bottleneck. The paper directly addresses limitations of very recent systems (Greyhound from USENIX ATC 2025, Adaptra from 2025), positioning it at the frontier of this rapidly evolving space.
The observation that sequence length variability confounds fail-slow detection is novel and practically important, especially as training moves toward longer contexts (32K–128K tokens) where attention cost variability becomes more pronounced.
ResiHP represents a solid systems contribution that advances the state of resilient LLM training through a well-motivated combination of workload-aware detection and progressive hybrid-parallel adaptation. The experimental evaluation is thorough within its scope, though larger-scale validation would strengthen claims. The work addresses a real and growing need in the LLM training ecosystem.
Generated May 8, 2026
ResiHP addresses a fundamental and growing challenge in large-scale LLM training—hardware failure resilience across thousands of GPUs—which is highly relevant given the explosive growth of LLM training infrastructure. It introduces novel techniques (workload-aware failure detection, dynamic hybrid parallelism adaptation) with broad applicability. Paper 1, while practically useful, is primarily a benchmarking study of LDPC offloading on a specific NVIDIA platform using existing tools (Sionna), offering incremental engineering insights rather than fundamental methodological contributions. ResiHP's broader impact across the AI/systems community and its novelty give it higher scientific impact potential.
Paper 1 addresses a critical bottleneck in large-scale LLM training by introducing a novel system to accurately detect and dynamically mitigate hardware failures and performance skew. Its methodological rigor is demonstrated through substantial throughput improvements (up to 4.39x) on a large 256-GPU cluster. In contrast, Paper 2 is primarily an empirical benchmarking study of existing software (NVIDIA Sionna) on new hardware for 5G Open RAN. Due to the massive scale, high cost, and broad relevance of foundational AI training, Paper 1 offers significantly higher novelty and broader potential scientific impact.
Paper 2 likely has higher impact: it addresses a timely, high-stakes bottleneck in large-scale LLM training—resilience under hybrid parallelism at hundreds/thousands of GPUs—directly relevant to industry and large research labs. The Detector+Scheduler framework is broadly applicable across models, datasets with variable sequence lengths, and heterogeneous failure modes, with substantial throughput gains (up to 4.39×). Paper 1 is technically strong with impressive GPU speedups, but it targets a more specialized hypergraph partitioning variant and may have narrower cross-domain adoption compared to resilient LLM training infrastructure.
ResiHP addresses a critical and timely problem in large-scale LLM training—hardware failure resilience across massive GPU clusters. Given the explosive growth of LLM training at scale, this work has broad practical relevance to industry and research. Its novel workload-aware failure detection and dynamic hybrid parallelism adaptation are methodologically rigorous, demonstrated on a 256-GPU cluster with significant throughput gains (up to 4.39×). Paper 2 addresses GPU-accelerated hypergraph partitioning with specific constraints, which is more niche. While it achieves impressive speedups, its narrower application scope and audience limit its broader scientific impact compared to Paper 1's relevance to the rapidly growing LLM training ecosystem.
ResiHP addresses a critical and timely problem in large-scale LLM training resilience, which is highly relevant given the explosive growth of LLM development across industry and academia. Its practical impact on training efficiency (1.04-4.39× throughput improvement) at scale (256+ GPUs) addresses a widespread pain point. While Proteus makes a solid contribution to distributed ledger consensus by combining CFT and BFT protocols for TEE environments, its scope is narrower. ResiHP's broader applicability to the rapidly expanding LLM training ecosystem and its novel workload-aware failure detection approach give it higher potential impact.
Paper 2 likely has higher scientific impact: it introduces a new consensus protocol that bridges CFT and embedded BFT “with no additional messages,” addressing a timely and broadly relevant trust gap in TEE-assisted ledgers. The contribution is more foundational, with applicability across distributed systems, security, and blockchain/ledger deployments, and directly tackles real-world risks of TEE compromise. Paper 1 is valuable and practical for large-scale LLM training resilience, but is more domain-specific (training infrastructure) and its gains may be bounded by evolving training stacks and hardware reliability improvements.
ResiHP addresses a critical and increasingly important problem in large-scale LLM training: fault tolerance across massive GPU clusters. As training scales to tens of thousands of GPUs, hardware failures become inevitable, making resilient training systems essential. The paper tackles both detection and adaptation with a comprehensive system design, achieving significant throughput improvements (up to 4.39×). Paper 2's TimelyFreeze addresses pipeline bubble optimization through parameter freezing—a narrower problem with more incremental impact (up to 40% improvement). ResiHP's broader applicability to real-world large-scale training infrastructure and its systems-level contributions give it higher potential impact.
Paper 2 (ResiHP) likely has higher impact due to addressing a core bottleneck in real-world LLM training at scale: reliability and performance under frequent failures and skew in hybrid-parallel, multi-thousand-GPU settings. Its detector+scheduler framework is broadly applicable across models, datasets (sequence-length variability), and parallelism strategies, with large reported gains (up to 4.39×) and clear operational relevance for production clusters. Paper 1 is innovative and useful but targets a narrower optimization (pipeline bubbles via freezing) with smaller, more context-dependent gains and potential accuracy/validation complexity.
Paper 2 likely has higher impact: it tackles a timely, widely relevant problem (resilience and efficiency in large-scale LLM training) with clear real-world applicability in production GPU clusters. The proposed Detector+Scheduler addresses concrete gaps in prior systems (sequence-length-induced variance and hybrid-parallel skew) and reports substantial throughput gains on a sizable 256-GPU setup, suggesting methodological rigor and practical validation. Its potential breadth spans ML systems, distributed computing, and HPC operations. Paper 1 is novel within distributed robotics/anonymous graphs but is more specialized and likely narrower in immediate adoption.
ResiHP addresses a critical and timely challenge in large-scale LLM training—hardware failure resilience across massive GPU clusters. Its novel workload-aware failure detection and dynamic hybrid parallelism adaptation are directly relevant to the rapidly growing field of LLM training infrastructure. The 1.04-4.39× throughput improvements on a 256-GPU cluster demonstrate significant practical impact. Paper 1 proposes a digital twin framework for data center energy optimization but is evaluated only in a small-scale constrained environment, limiting its demonstrated impact. Paper 2's methodology is more rigorous, its problem more timely, and its potential industry impact broader.