Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng
Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top- samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.
OrderDP addresses a well-known problem in data pruning: existing dynamic methods that select informative samples introduce biased gradient estimation relative to full-dataset training, leading to instability and performance degradation, especially under aggressive pruning. The key insight is to reformulate the biased pruning process as unbiased optimization of a *surrogate loss* . The method is elegantly simple: at each iteration, uniformly sample candidates from the dataset, then retain the top- samples ranked by loss. The surrogate loss is defined via combinatorial weights that depend only on , and the paper proves that the gradient estimator is unbiased with respect to this surrogate. Convergence and generalization bounds are then established, showing the gap between surrogate and true loss is controlled and vanishes as .
The contribution is both conceptual and practical. Conceptually, it provides a clean theoretical lens through which to understand the bias-performance tradeoff in data pruning. Practically, it delivers a plug-and-play method that achieves near-lossless accuracy with 40%+ training cost reduction.
The theoretical analysis is well-structured and builds on established tools from stochastic optimization and spectral risk measures:
A notable limitation is that Theorem 3 requires convexity, which does not hold for deep networks. The authors acknowledge this implicitly by providing empirical validation (Appendix D.3), but the gap between theory and practice is non-trivial. The convergence analysis is essentially inherited from standard SGD theory and does not provide novel insights specific to the non-convex setting where deep learning operates.
The experimental evaluation is thorough: CIFAR-10/100 and ImageNet-1K, multiple architectures (ResNet-18/50, Swin-T, ViT-B), multiple optimizers (SGD, AdamW, LARS, LAMB), and comprehensive ablations. The comparison against 15+ static and 4+ dynamic baselines is commendable. The stability analysis (Table 9, Jaccard similarity in Table 8) and gradient direction analysis (Table 11) provide convincing empirical support.
Practical impact: OrderDP's simplicity (no architectural changes, no auxiliary networks, no annealing) and plug-and-play nature make it immediately deployable. The 40%+ compute savings on ImageNet with no accuracy loss is practically significant. Compatibility with multiple optimizers and architectures (including ViTs) enhances its applicability.
Theoretical impact: The surrogate loss framework provides a principled way to analyze and control pruning bias. The connection to spectral risk measures and ordered statistics opens pathways for designing new pruning strategies with different weight structures , as the authors note in their limitations section.
Broader applicability: While demonstrated only on image classification, the framework is general enough to extend to NLP, speech, and other modalities. The theoretical guarantees are architecture-agnostic.
Data efficiency is a pressing concern given the scaling of modern deep learning. This work addresses a genuine bottleneck: how to prune training data without introducing optimization instability. The paper is timely in the context of growing training costs and the push toward efficient training methods. The fact that it was accepted at ICLR 2026 confirms its relevance to the community.
The empirical finding that gradient norm strongly correlates with test accuracy (Pearson's ) is interesting but not deeply explored theoretically. The connection between OrderDP's stability and this correlation could be developed further. The ablation on exploration vs. exploitation decomposition (Figure 4) is a nice practical contribution, showing that the method is robust to the specific decomposition as long as the effective prune ratio is fixed.
Generated Jun 9, 2026
OrderDP addresses the broadly applicable problem of data pruning with theoretical guarantees (convergence, generalization, unbiasedness), making it relevant across many training scenarios and datasets. Its plug-and-play nature and strong theoretical foundations give it wider applicability beyond a single domain. FlowBP, while technically solid, is narrowly focused on reward backpropagation for text-to-image flow matching models—a more niche area. OrderDP's 40%+ training cost reduction with lossless performance, theoretical rigor, and cross-domain applicability suggest broader scientific impact.
Paper 1 investigates the underlying mechanisms of task vectors, LoRA, and activation steering in LLMs, which are critical areas in modern AI research. By challenging the fixed-task-plane hypothesis and providing new theoretical frameworks for local linear geometries, it offers profound insights that bridge mechanistic interpretability and efficient fine-tuning. While Paper 2 presents a rigorous and useful data pruning method, Paper 1's focus on the internal workings of large language models gives it higher timeliness, novelty, and potential to shape future research in model control and alignment.
Paper 2 introduces a highly timely and novel benchmark for personal autonomous agents, addressing a critical gap in cross-app reasoning and personalization. Benchmarks in rapidly growing areas like device agents typically drive significant future research and have broad real-world applicability. While Paper 1 offers strong theoretical contributions to training efficiency, Paper 2's potential to shape the development of personal AI assistants gives it a broader and higher potential scientific impact.
Paper 2 has higher potential scientific impact due to its broader cross-field relevance (mechanistic interpretability, geometry, computational neuroscience) and timeliness amid intense interest in understanding LLM internals. Its probe-free, model-agnostic trajectory-geometry framework and multiple controlled findings across several transformer families could become a widely used analysis paradigm, enabling downstream applications in interpretability, debugging, safety, and model design. Paper 1 is rigorous and practically valuable for training efficiency, but data pruning is a more mature area and the impact is likely narrower to training optimization compared with a general interpretability lens.
Paper 2 offers a foundational machine learning framework with theoretical guarantees for data pruning. Its ability to reduce training costs by over 40% while maintaining performance provides broad, cross-disciplinary impact applicable to any field utilizing deep learning. While Paper 1 is highly innovative and valuable for meteorology, Paper 2's methodological rigor and universal applicability give it higher potential for widespread scientific adoption and impact.
OrderDP addresses a fundamental problem in machine learning (data pruning) with theoretical guarantees including convergence and generalization analyses, applicable broadly across domains. It demonstrates strong empirical results on standard benchmarks (CIFAR-10/100, ImageNet-1K) with 40%+ training cost reduction while maintaining near-lossless performance. Its plug-and-play design and public code enhance adoptability. Paper 1, while practically useful, addresses a narrower niche (Text-to-Cypher benchmarking for enterprise graph databases) with more limited cross-field impact and less theoretical contribution.
Paper 1 addresses the universally critical challenge of high training costs in machine learning through data pruning. By offering a plug-and-play framework with strong theoretical guarantees (unbiasedness, convergence) and demonstrating over 40% cost reduction on major datasets without performance loss, it presents a highly practical and broadly applicable solution. Paper 2's focus on conformal risk control for selective predictors is mathematically rigorous and important for AI safety, but its applications are more specialized, making Paper 1 likely to have a broader and more immediate impact across the field.
Paper 1 addresses a critical bottleneck in modern LLM development (long-context RL and chain-of-thought training) by introducing a dynamic sparse attention schedule. Given the massive computational costs of LLM training, this approach offers highly relevant and immediately applicable real-world benefits. While Paper 2 provides rigorous theoretical guarantees for data pruning, it primarily evaluates on traditional vision benchmarks, whereas Paper 1's focus on scalable LLM RL has greater timeliness and broader impact across the current AI landscape.
Paper 2 likely has higher scientific impact: it proposes a practical, plug-and-play training-acceleration method with explicit theoretical guarantees (unbiasedness w.r.t. a surrogate loss, convergence and generalization bounds) and demonstrates sizable compute savings (>40%) on widely used benchmarks up to ImageNet-1K, making it immediately actionable across many training pipelines. Paper 1 is novel and timely in framing LLM decoding as critical phenomena, but its impact may be limited by methodological ambiguity (non-equilibrium generation, interpretation of “phase transition”) and less direct real-world utility compared to a deployable efficiency framework.
Paper 2 has higher likely impact: it addresses continual learning for LLMs—a timely, high-demand problem with broad relevance to NLP, personalization, and on-device/enterprise model updating. Its sparse expert/subspace sharing with routing-aware regularization offers a modular, scalable mechanism that can generalize across tasks and domains, potentially influencing MoE training, parameter-efficient tuning, and lifelong learning. Paper 1 is rigorous and useful for training efficiency, but it is more incremental within data pruning and likely narrower in cross-field adoption compared to a task-agnostic continual learning framework for large foundation models.