ResiHP: Taming LLM Training Failures with Dynamic Hybrid

Tenghui Ma, Jihu Guo, Wei Gao, Sitian Lu, Zhisheng Ye, Hanjing Wang, Dahua Lin

May 7, 2026arXiv:2605.06374v1

cs.DC

#332of 1008·Distributed Computing

#332 of 1008 · Distributed Computing

Tournament Score

1436±39

10501750

77%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.8

Novelty6.5

Clarity7.5

Abstract

Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39 $\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ResiHP

1. Core Contribution

ResiHP addresses a critical operational challenge in large-scale LLM training: maintaining high throughput when hardware failures (both fail-stop and fail-slow) occur across thousands of GPUs. The paper makes two principal contributions:

Workload-aware failure detection: A lightweight detector that disentangles genuine fail-slow failures from iteration-time fluctuations caused by variable sequence lengths. The key insight is that sequence packing with self-attention's quadratic complexity creates legitimate execution time variability that existing detectors (notably Greyhound) misinterpret as failures. ResiHP models micro-batch execution time as T_MB ≈ αN + β∑l_i², then uses a DAG-based analytical simulator to predict healthy iteration times, filtering false positives before triggering expensive validation.

Progressive hybrid-parallel adaptation: A scheduler that jointly adapts across TP, PP, and DP dimensions rather than optimizing each individually. This includes selective device exclusion within TP groups (preserving healthy devices rather than discarding entire groups), layer repartitioning across PP stages, and progress-aware workload migration across DP groups.

2. Methodological Rigor

Strengths in experimental design:

Evaluation spans 6 model configurations (LLaMA 2 7B–70B, Qwen 2.5 7B–72B) across multiple parallelism settings on 256 A100 GPUs — a non-trivial testbed.

Systematic comparison against four baselines (Greyhound, Adaptra, ReCycle, Oobleck) covering different failure-handling capabilities.

Ablation study (Figure 11) quantifying the contribution of each component (device exclusion, layer repartition, workload migration).

Convergence validation (Figure 12) showing loss curves overlap between fault-free and ResiHP-recovered training.

The failure amplification analysis (Figure 2) effectively motivates the progressive adaptation approach, showing 25.43× additional GPU idle time at the DP level from a single fail-slow GPU.

Concerns:

The micro-batch time predictor (Eq. 1) assumes a simple two-parameter linear model. While MAPE of 1.19–1.58% is reported, this is evaluated on only three configurations. The model's robustness to diverse architectures (e.g., mixture-of-experts, multi-modal models) is unclear.

The 25% threshold for triggering validation after change-point detection appears somewhat ad hoc, with no sensitivity analysis provided.

Fail-slow injection via `nvidia-smi` frequency locking and side-channel bandwidth contention, while standard practice, may not capture the full spectrum of real-world fail-slow behaviors (e.g., intermittent, oscillating degradation).

The progress-aware heuristic solver (Algorithm 1) is greedy and may not approximate the optimal solution well in all scenarios. No optimality gap analysis is provided.

Statistical reporting mentions 95% confidence intervals but only for aggregate results; large-scale experiments appear to be single runs.

3. Potential Impact

Practical relevance is high. Large-scale LLM training is among the most expensive computational workloads today. Meta reported 178,000 wasted GPU hours during OPT-175B training, and recent measurements show 59.2% of large training jobs encounter fail-slow failures. Any system that meaningfully reduces failure-induced waste has substantial economic impact.

The 1.04–4.39× throughput improvement is significant, particularly the ability to sustain training under the aggressive 30-minute failure frequency where all baselines abort. This resilience to cascading failures is practically important for multi-week training runs.

Broader influence: The workload-aware detection approach could influence monitoring systems beyond LLM training — any distributed workload with input-dependent execution variability faces similar false-positive challenges. The progressive adaptation framework across parallelism dimensions provides a template for future resilient distributed systems.

Limitations on generalizability: The system is implemented on a specific internal training framework (similar to Megatron-LM). While the authors claim portability, the actual implementation effort for other frameworks (e.g., DeepSpeed, FSDP-based systems) is unstated. The restriction to power-of-two TP degrees limits flexibility, though this is a practical constraint of current GPU communication libraries.

4. Timeliness & Relevance

This work is highly timely. As LLM training scales to tens of thousands of GPUs (MegaScale reports 10,000+ GPU training), failure management is a recognized bottleneck. The paper directly addresses limitations of very recent systems (Greyhound from USENIX ATC 2025, Adaptra from 2025), positioning it at the frontier of this rapidly evolving space.

The observation that sequence length variability confounds fail-slow detection is novel and practically important, especially as training moves toward longer contexts (32K–128K tokens) where attention cost variability becomes more pronounced.

5. Strengths & Limitations

Key strengths:

Clean problem decomposition: detection → TP adaptation → PP adaptation → DP adaptation follows the natural failure propagation path

The selective device exclusion within TP groups (rather than discarding entire groups) is an important practical contribution that prior work overlooked

P2P communication optimization for heterogeneous TP degrees is a necessary engineering contribution that makes the theoretical framework practical

Comprehensive evaluation across model families, sizes, failure types, and severities

Notable weaknesses:

No evaluation on truly large-scale clusters (256 GPUs is substantial but small relative to production deployments of 10,000+ GPUs). Scalability claims are not fully validated.

Silent data corruption (SDCs) is acknowledged but not addressed — a significant gap given recent attention to this failure mode.

The system assumes persistent fail-slow failures; transient or oscillating degradation patterns are not evaluated.

Reconfiguration overhead (particularly communication group reconstruction and layer transfer) could be problematic at larger scales, though bounded to ~2s in current experiments.

The paper does not discuss interaction with checkpointing strategies, which is essential for practical deployment.

Overall Assessment

ResiHP represents a solid systems contribution that advances the state of resilient LLM training through a well-motivated combination of workload-aware detection and progressive hybrid-parallel adaptation. The experimental evaluation is thorough within its scope, though larger-scale validation would strengthen claims. The work addresses a real and growing need in the LLM training ecosystem.

Rating:7/ 10

Significance 7.5Rigor 6.8Novelty 6.5Clarity 7.5

Generated May 8, 2026

Comparison History (30)

Wonvs. Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

ResiHP addresses a fundamental and growing challenge in large-scale LLM training—hardware failure resilience across thousands of GPUs—which is highly relevant given the explosive growth of LLM training infrastructure. It introduces novel techniques (workload-aware failure detection, dynamic hybrid parallelism adaptation) with broad applicability. Paper 1, while practically useful, is primarily a benchmarking study of LDPC offloading on a specific NVIDIA platform using existing tools (Sionna), offering incremental engineering insights rather than fundamental methodological contributions. ResiHP's broader impact across the AI/systems community and its novelty give it higher scientific impact potential.

claude-opus-4-6·May 8, 2026

Wonvs. Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN

Paper 1 addresses a critical bottleneck in large-scale LLM training by introducing a novel system to accurately detect and dynamically mitigate hardware failures and performance skew. Its methodological rigor is demonstrated through substantial throughput improvements (up to 4.39x) on a large 256-GPU cluster. In contrast, Paper 2 is primarily an empirical benchmarking study of existing software (NVIDIA Sionna) on new hardware for 5G Open RAN. Due to the massive scale, high cost, and broad relevance of foundational AI training, Paper 1 offers significantly higher novelty and broader potential scientific impact.

gemini-3.1-pro-preview·May 8, 2026

Wonvs. Incidence Constraints in Hypergraph Partitioning on GPU

Paper 2 likely has higher impact: it addresses a timely, high-stakes bottleneck in large-scale LLM training—resilience under hybrid parallelism at hundreds/thousands of GPUs—directly relevant to industry and large research labs. The Detector+Scheduler framework is broadly applicable across models, datasets with variable sequence lengths, and heterogeneous failure modes, with substantial throughput gains (up to 4.39×). Paper 1 is technically strong with impressive GPU speedups, but it targets a more specialized hypergraph partitioning variant and may have narrower cross-domain adoption compared to resilient LLM training infrastructure.

gpt-5.2·May 8, 2026

Wonvs. Incidence Constraints in Hypergraph Partitioning on GPU

ResiHP addresses a critical and timely problem in large-scale LLM training—hardware failure resilience across massive GPU clusters. Given the explosive growth of LLM training at scale, this work has broad practical relevance to industry and research. Its novel workload-aware failure detection and dynamic hybrid parallelism adaptation are methodologically rigorous, demonstrated on a 256-GPU cluster with significant throughput gains (up to 4.39×). Paper 2 addresses GPU-accelerated hypergraph partitioning with specific constraints, which is more niche. While it achieves impressive speedups, its narrower application scope and audience limit its broader scientific impact compared to Paper 1's relevance to the rapidly growing LLM training ecosystem.

claude-opus-4-6·May 8, 2026

Wonvs. Proteus: Append-Only Ledgers for (Mostly) Trusted Execution Environments

ResiHP addresses a critical and timely problem in large-scale LLM training resilience, which is highly relevant given the explosive growth of LLM development across industry and academia. Its practical impact on training efficiency (1.04-4.39× throughput improvement) at scale (256+ GPUs) addresses a widespread pain point. While Proteus makes a solid contribution to distributed ledger consensus by combining CFT and BFT protocols for TEE environments, its scope is narrower. ResiHP's broader applicability to the rapidly expanding LLM training ecosystem and its novel workload-aware failure detection approach give it higher potential impact.

claude-opus-4-6·May 8, 2026

Lostvs. Proteus: Append-Only Ledgers for (Mostly) Trusted Execution Environments

Paper 2 likely has higher scientific impact: it introduces a new consensus protocol that bridges CFT and embedded BFT “with no additional messages,” addressing a timely and broadly relevant trust gap in TEE-assisted ledgers. The contribution is more foundational, with applicability across distributed systems, security, and blockchain/ledger deployments, and directly tackles real-world risks of TEE compromise. Paper 1 is valuable and practical for large-scale LLM training resilience, but is more domain-specific (training infrastructure) and its gains may be bounded by evolving training stacks and hardware reliability improvements.

gpt-5.2·May 8, 2026

Wonvs. TimelyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism

ResiHP addresses a critical and increasingly important problem in large-scale LLM training: fault tolerance across massive GPU clusters. As training scales to tens of thousands of GPUs, hardware failures become inevitable, making resilient training systems essential. The paper tackles both detection and adaptation with a comprehensive system design, achieving significant throughput improvements (up to 4.39×). Paper 2's TimelyFreeze addresses pipeline bubble optimization through parameter freezing—a narrower problem with more incremental impact (up to 40% improvement). ResiHP's broader applicability to real-world large-scale training infrastructure and its systems-level contributions give it higher potential impact.

claude-opus-4-6·May 8, 2026

Wonvs. TimelyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism

Paper 2 (ResiHP) likely has higher impact due to addressing a core bottleneck in real-world LLM training at scale: reliability and performance under frequent failures and skew in hybrid-parallel, multi-thousand-GPU settings. Its detector+scheduler framework is broadly applicable across models, datasets (sequence-length variability), and parallelism strategies, with large reported gains (up to 4.39×) and clear operational relevance for production clusters. Paper 1 is innovative and useful but targets a narrower optimization (pipeline bubbles via freezing) with smaller, more context-dependent gains and potential accuracy/validation complexity.

gpt-5.2·May 8, 2026

Wonvs. Location-Aware Dispersion on Anonymous Graphs

Paper 2 likely has higher impact: it tackles a timely, widely relevant problem (resilience and efficiency in large-scale LLM training) with clear real-world applicability in production GPU clusters. The proposed Detector+Scheduler addresses concrete gaps in prior systems (sequence-length-induced variance and hybrid-parallel skew) and reports substantial throughput gains on a sizable 256-GPU setup, suggesting methodological rigor and practical validation. Its potential breadth spans ML systems, distributed computing, and HPC operations. Paper 1 is novel within distributed robotics/anonymous graphs but is more specialized and likely narrower in immediate adoption.

gpt-5.2·May 8, 2026

Wonvs. A Scalable Digital Twin Framework for Energy Optimization in Data Centers

ResiHP addresses a critical and timely challenge in large-scale LLM training—hardware failure resilience across massive GPU clusters. Its novel workload-aware failure detection and dynamic hybrid parallelism adaptation are directly relevant to the rapidly growing field of LLM training infrastructure. The 1.04-4.39× throughput improvements on a 256-GPU cluster demonstrate significant practical impact. Paper 1 proposes a digital twin framework for data center energy optimization but is evaluated only in a small-scale constrained environment, limiting its demonstrated impact. Paper 2's methodology is more rigorous, its problem more timely, and its potential industry impact broader.

claude-opus-4-6·May 8, 2026

#332of 1008·Distributed Computing

#332 of 1008 · Distributed Computing

Tournament Score

1436±39

10501750

77%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.8

Novelty6.5

Clarity7.5