Back to Rankings

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

Vanessa Schmidt, Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Anke Schmeink

cs.LGcs.AI
Share
#2633 of 5669 · cs.LG
Tournament Score
1410±43
10501750
57%
Win Rate
13
Wins
10
Losses
23
Matches
Rating
5.5/ 10
Significance5.5
Rigor5
Novelty5
Clarity7

Abstract

Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This survey proposes a constraint-centric lifecycle framework that organizes LLM efficiency literature around three coupled bottlenecks: data efficiency ("what to train on"), memory efficiency ("how to fit training"), and compute budget awareness ("when and where to spend FLOPs"). The central thesis is that these three dimensions form an interacting system rather than independent optimization targets, and that optimizing one dimension in isolation merely shifts the bottleneck elsewhere.

The paper's most distinctive conceptual contributions are: (a) the "compute governor" formalism—a control policy π(S_t, B_t) → a_t that maps system state and remaining budget to continue/reallocate/stop decisions based on marginal gain per FLOP; (b) identification of the "Static-to-Dynamic Gap" in data selection, arguing that leading methods like LESS remain predominantly static and that truly adaptive influence estimation during training is a critical open problem; and (c) the marginal utility unification, connecting data filtering, parameter updates, and compute allocation under a shared principle of maximizing performance gain per unit of constrained resource.

2. Methodological Rigor

As a survey, rigor is assessed by coverage, taxonomy quality, and analytical depth rather than experimental validation.

Coverage is solid for post-2022 work across data selection (LIMA, GraNd/EL2N, S2L, STAFF, LESS, GREATS, BIDS, DART, IFD), memory efficiency (CoLM, Addax, HiFT, BAdam, QLoRA, DQT, PEQA, SubZero, LOZO), and compute governance (Chinchilla scaling laws, CADS, speculative decoding, MoE, Mixture-of-Depths). The taxonomy in Figures 3-6 is well-structured.

Analytical depth is a strength. The paper goes beyond mere cataloging: it decomposes memory into M_θ + M_O + M_A (Eq. 15) and maps each method to specific terms; it provides mathematical formulations for key methods (GraNd, EL2N, Adam Influence, GREATS Taylor expansion, BAdam memory equation); and Table I offers a useful engineering comparison across methods. The discussion of noise dynamics when stacking DQT with ZO estimators (Section IV-E) demonstrates genuine cross-method analysis.

Weaknesses in rigor: The compute governor formalization (Eq. 29-30) remains largely conceptual. The case study in Section V-A illustrates the framework but provides no empirical validation, ablation, or even simulation. The marginal gain signal G_t is defined but its estimation in practice is hand-waved. The proposed "research roadmap" (drift-aware refresh schedules, damped governor updates, etc.) lacks specificity about feasibility. Additionally, some claimed cross-pillar interactions lack quantitative evidence—statements like "optimizing one dimension merely shifts the bottleneck" would benefit from concrete measurements.

3. Potential Impact

The paper could serve as a useful reference and conceptual guide for practitioners navigating efficiency trade-offs when fine-tuning LLMs under resource constraints. Table I and the decision framework (Figures 1, 4, 6) have practical value. The identification of the Static-to-Dynamic Gap may stimulate research on dynamic influence estimation with memory-efficient approximations.

However, the impact is limited by the lack of empirical grounding for the proposed unified framework. Without demonstrating that the compute governor actually improves resource allocation in practice, the framework risks remaining an organizational metaphor rather than a actionable system. The field already has several efficiency surveys (Bai et al., 2024, cited as [1]), and this paper's differentiation depends heavily on the lifecycle/governor framing proving useful beyond conceptual elegance.

The edge deployment angle is repeatedly invoked but never substantiated with edge-specific experiments or case studies, weakening this claimed application domain.

4. Timeliness & Relevance

The paper addresses a genuinely pressing need. As LLM training costs escalate and democratization of fine-tuning becomes increasingly important, a unified view of efficiency trade-offs is valuable. The timing is appropriate—enough individual efficiency methods now exist (2022-2025) to warrant synthesis. The data-constrained scaling perspective is particularly timely given emerging concerns about data availability.

The coverage of very recent work (ICLR 2025, ICML 2025, NeurIPS 2025 papers) demonstrates currency, though some methods cited are still preprints without peer review.

5. Strengths & Limitations

Key Strengths:

  • Well-organized taxonomy with clear visual aids (Figures 1-6)
  • Mathematical precision in presenting individual methods, enabling side-by-side comparison
  • Table I provides actionable engineering guidance
  • The cross-bottleneck analysis (Table II) and noise-dynamics discussion when combining methods are genuinely insightful
  • The Static-to-Dynamic Gap is a well-articulated research direction
  • Good treatment of the "pay-back threshold" concept from Yin et al., connecting data selection cost to training benefit
  • Notable Limitations:

  • The compute governor is conceptual only—no implementation, simulation, or empirical validation
  • The "unifying principle" of marginal utility per resource is sensible but not formalized rigorously enough to generate testable predictions
  • Limited discussion of distributed training (ZeRO, FSDP are mentioned briefly in one paragraph)
  • Missing coverage of important efficiency areas: mixture of experts during training, knowledge distillation as an efficiency mechanism, architecture search/pruning
  • The edge deployment narrative is aspirational without substantiation
  • No systematic comparison methodology—methods in Table I are acknowledged as non-comparable across setups
  • The paper would benefit from a concrete worked example showing how the three pillars interact quantitatively (e.g., a Pareto analysis)
  • 6. Additional Observations

    The paper is well-written but lengthy (21 pages). Some mathematical detail for individual methods (e.g., full derivation of GREATS Taylor expansion) may be excessive for a survey, while the novel contributions (governor, roadmap) receive comparatively less formal development. The distinction between "feasibility" and "optimality" is useful but could be developed more systematically. The paper would be substantially strengthened by even a small-scale empirical demonstration of the governor concept.

    Rating:5.5/ 10
    Significance 5.5Rigor 5Novelty 5Clarity 7

    Generated Jun 10, 2026

    Comparison History (23)

    Wonvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

    Paper 1 provides a comprehensive survey unifying three critical bottlenecks in LLM training efficiency—data, memory, and compute—under a constraint-centric framework. Given the massive interest in LLM efficiency across academia and industry, this survey has broad applicability and timeliness. Paper 2, while technically solid in introducing a new benchmark and method for power system forecasting, addresses a more niche domain. The survey's potential to shape research directions across the entire LLM training ecosystem gives it substantially broader impact across multiple fields and larger audience reach.

    claude-opus-4-6·Jun 12, 2026
    Lostvs. Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

    Paper 2 is more likely to have higher scientific impact: it proposes a novel oversight protocol (bootstrapped monitoring) addressing a timely, high-stakes problem in AI safety/control, with direct real-world applicability to deploying stronger agents. It includes an evaluative methodology on a concrete benchmark and considers adversarial collusion, increasing rigor and relevance. Paper 1 is a valuable unifying survey, but surveys typically have less transformative impact than new, empirically tested mechanisms, and its contributions are primarily organizational rather than introducing a new technique.

    gpt-5.2·Jun 11, 2026
    Lostvs. Implicit Neural Representations of Individual Behavior

    Paper 1 is an original methodological contribution: it adapts implicit neural representations to learn latent policy identities from unlabeled multi-policy behavioral data, introduces a principled generative prior over policies, and proposes policy-level OOD shift axes. It is evaluated across diverse synthetic, simulated, and real-world domains, suggesting strong rigor and cross-domain applicability in robotics, games, and sequential decision-making. Paper 2, while timely and broadly useful, is a survey (synthesizing rather than creating new techniques), so its novelty and direct scientific advance are typically lower than a solid new model + problem formulation.

    gpt-5.2·Jun 11, 2026
    Lostvs. Harness In-Context Operator Learning with Chain of Operators

    Paper 2 introduces a highly novel, cross-disciplinary approach by adapting LLM prompting concepts (Chain of Thought) to neural operators (Chain of Operators) for solving PDEs. This methodological innovation offers significant improvements in out-of-distribution generalization without retraining, presenting immense potential for real-world applications in physics and engineering. Paper 1, while highly relevant and timely, is a survey that systematizes existing knowledge rather than introducing a breakthrough novel methodology.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

    Paper 1 is a comprehensive survey that unifies three major bottlenecks in LLM training—data, memory, and compute efficiency—under a novel constraint-centric framework. Its breadth of impact is far greater, as it addresses the entire LLM training ecosystem and provides a conceptual unification (resource-conditioned decision-making) relevant across many research communities. Paper 2 presents a narrower contribution—using explainability for data selection in ECG classification—which, while novel and useful, impacts a more limited audience. The survey's timeliness given the explosive growth of LLM research amplifies its potential citation impact and influence on future work.

    claude-opus-4-6·Jun 11, 2026
    Wonvs. Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

    Paper 2 likely has higher scientific impact: it proposes a unifying framework across data, memory, and compute efficiency in LLM training, synthesizing diverse methods and connecting scaling laws, budget-aware training, and adaptive inference—broadly applicable across academia and industry. Its breadth and timeliness (efficiency as a central constraint) increase cross-field reach and citation potential. Paper 1 is methodologically detailed and practically useful for diffusion model quantization on consumer GPUs, but it is narrower (model/hardware-specific) and more incremental relative to existing quantization literature.

    gpt-5.2·Jun 11, 2026
    Wonvs. Overcoming Rank Collapse in Feedback Alignment

    The survey on unifying data, memory, and compute efficiency in LLM training addresses a broadly impactful topic at the center of current AI research. Its constraint-centric framework synthesizing data efficiency, memory optimization, and compute budgeting for LLMs has wide applicability across industry and academia. While Paper 2 presents interesting mechanistic insights about feedback alignment's rank collapse and proposes remedies, it addresses a more niche problem (biologically plausible learning) with limited practical adoption compared to backpropagation. The LLM efficiency survey's timeliness, breadth, and practical relevance give it higher potential impact.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization

    Paper 1 resolves a longstanding open problem in optimization theory by proving matching lower bounds for higher-order smooth nonconvex optimization, completing the complexity picture for an important class of problems. This is a definitive theoretical contribution with lasting impact. Paper 2 is a survey that organizes existing work on LLM training efficiency under a unified framework, which is useful but inherently synthesizes rather than creates new knowledge. The sharp, novel theoretical result in Paper 1 is more likely to be cited as a foundational reference and influence future algorithmic work across optimization and machine learning.

    claude-opus-4-6·Jun 10, 2026
    Wonvs. How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

    While Paper 1 presents rigorous, paradigm-shifting empirical findings for EEG denoising, Paper 2 addresses a universally urgent bottleneck in AI: LLM training efficiency. As a unifying survey in a rapidly expanding and resource-intensive field, Paper 2 has a broader target audience, higher potential for widespread cross-disciplinary citations, and immediate relevance to both academic researchers and industry practitioners optimizing large-scale AI systems.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. A Unified Framework for Locality in Scalable MARL

    Paper 1 provides a comprehensive survey unifying three critical bottlenecks in LLM training—data, memory, and compute efficiency—under a resource-constrained framework. Given the enormous and growing interest in LLM efficiency across academia and industry, this survey addresses an extremely timely topic with broad practical impact. Paper 2 makes a solid theoretical contribution to scalable multi-agent RL with tighter locality bounds, but its impact is narrower, targeting a more specialized community. The breadth of applicability, timeliness, and practical relevance of Paper 1 give it significantly higher potential impact.

    claude-opus-4-6·Jun 10, 2026