Back to Rankings

Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

Haolin Pan, Lianghong Huang, Xvlin Zhou, Mingjie Xing, Yanjun Wu

cs.LGcs.PL
Share
#1625 of 5669 · cs.LG
Tournament Score
1448±41
10501750
63%
Win Rate
20
Wins
12
Losses
32
Matches
Rating
6.8/ 10
Significance7
Rigor6.5
Novelty7.5
Clarity7.5

Abstract

Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37×\times on GPU and 1.54×\times on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10×\times fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61×\times/3.67×\times geometric mean.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper introduces a "compiler world model" framework that reformulates tensor program candidate evaluation as action-conditioned latent dynamics over program states, rather than treating each candidate schedule as a static code snapshot. The key insight is that scheduling transformations form a sequential process where later decisions depend on earlier ones, and this trajectory information is lost when only the final program text is evaluated. The framework consists of three components: (1) a contrastive-learning-based TensorIR encoder (CodeBERT-based), (2) a TransH-based multi-step latent state transition model that rolls out scheduling actions in representation space, and (3) an XGBoost-based ranking model that combines predicted terminal-state representations with action and hardware features.

The conceptual contribution—framing compiler optimization as a latent dynamics problem analogous to world models in RL—is the paper's most distinctive element. This bridges two previously disconnected research areas and provides a principled justification for why sequential state modeling should outperform static evaluation.

2. Methodological Rigor

Strengths: The experimental evaluation is thorough and multi-layered: end-to-end model latency, model-level weighted latency, per-subgraph analysis across 22 representative subgraphs, sample efficiency comparisons against large-budget Ansor, and a comprehensive ablation study. The evaluation spans both CPU (Intel Xeon Gold 6430) and GPU (RTX 4090), seven neural network models, and multiple budget regimes. The ablation study (Figure 9) convincingly demonstrates that the transition model is the most important component, with its removal causing median latency increases of 56.5-69%.

Concerns: Several methodological aspects warrant scrutiny. First, the TransH-based transition model is a relatively simple geometric translation mechanism borrowed from knowledge graph embedding. While it works empirically, there's limited analysis of why this particular architecture is appropriate or whether more sophisticated transition models would improve results. Second, the contrastive learning objective for the state encoder uses a somewhat unconventional setup where the positive is a stochastic augmentation of the same text—the paper doesn't thoroughly justify why this particular contrastive design captures performance-relevant program semantics. Third, the state encoder is frozen during transition model training, creating a potential information bottleneck. Fourth, the paper trains separate transition models for CPU and GPU, raising questions about generalization to new hardware targets.

The training data construction from TenSet logs, while practical, means the model is bootstrapped from a specific distribution of schedules. The paper doesn't analyze how sensitive the approach is to distribution shift when the online search explores regions not well-covered by training data.

3. Potential Impact

The practical impact is significant for the tensor compilation community. The headline result—matching Ansor-10K quality within 2.2% geometric mean using only 1K trials—represents a 10× reduction in expensive hardware measurements. The end-to-end speedups over PyTorch (4.61× geomean) and PyTorch-opt with cuDNN (3.67× geomean) are substantial and practically relevant.

The broader conceptual impact of treating compiler optimization as a world-model problem could influence how the community thinks about cost models in general. This perspective could extend to other compiler optimization domains (e.g., LLVM pass ordering, polyhedral optimization) where transformations are sequential and context-dependent. However, the instantiation is currently specific to TVM's AutoScheduler, limiting immediate transferability.

4. Timeliness & Relevance

This work addresses a genuine bottleneck in tensor compilation: the cost of hardware measurements during autotuning. As models grow larger and more diverse, and as hardware accelerators proliferate, reducing tuning cost becomes increasingly critical. The paper is timely in connecting world models (a trending concept in RL/planning) with compiler optimization, and it appears concurrently with other work on LLM-assisted and agent-driven kernel optimization, positioning it within an active research front.

5. Strengths & Limitations

Key Strengths:

  • Novel and well-motivated conceptual framing that connects world models to compiler optimization
  • Comprehensive evaluation across multiple dimensions (models, subgraphs, budgets, hardware targets)
  • Strong sample efficiency results (10× fewer measurements for comparable quality)
  • Practical integration into an existing production-quality system (TVM AutoScheduler)
  • Honest and detailed ablation study that isolates component contributions
  • The paper acknowledges that the evaluator cannot recover schedules not proposed by the search space
  • Notable Limitations:

  • The TransH transition model is relatively simple and may not scale well to longer or more complex scheduling trajectories; the paper acknowledges error accumulation but doesn't quantify it
  • No analysis of computational overhead of the learned evaluator relative to Ansor's default cost model
  • The approach is evaluated only on TVM/Ansor; generalization to other frameworks (e.g., Triton, MLIR-based systems) is unexplored
  • The comparison against TenSet in Table 3 shows mixed results on some subgraphs (e.g., r50-t0, r50-t26 where TenSet ratios are <0.4), suggesting the method may have blind spots
  • The dataset construction methodology, while described, doesn't discuss how trajectory coverage affects model quality
  • GPT-Neo is consistently an outlier where Ansor outperforms the proposed method at higher budgets, and this isn't deeply investigated
  • TensorRT comparison is somewhat superficial—included as reference but not deeply analyzed
  • Additional Observations:

    The paper's dataset contribution (TVM/TenSet-based state-prediction dataset with action-state trajectories) could be independently valuable for the community, though availability and release plans are not discussed. The choice of XGBoost for the final ranking stage, while practical, creates an interesting hybrid where deep learned representations feed into a gradient-boosted tree model—this design choice deserves more discussion regarding potential end-to-end alternatives.

    Rating:6.8/ 10
    Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

    Generated Jun 9, 2026

    Comparison History (32)

    Wonvs. SupraBench: A Benchmark for Supramolecular Chemistry

    Paper 2 introduces a novel world-model-inspired approach to tensor program optimization that achieves significant practical speedups (up to 4.61× over PyTorch) with dramatically fewer measurements. This has immediate, broad impact across all ML systems requiring efficient compilation. Paper 1, while valuable as a benchmark for supramolecular chemistry LLM evaluation, serves a narrower community and primarily documents that LLMs underperform on these tasks rather than proposing a transformative solution. Paper 2's methodological innovation—modeling schedule evaluation as latent dynamics—is more broadly applicable and offers concrete performance gains.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

    Paper 1 introduces a highly novel application of world models to compiler optimization, addressing a critical bottleneck in ML systems. By drastically improving tensor program search efficiency and hardware execution speed, it offers broad, immediate impact across the entire machine learning ecosystem, potentially inspiring a new paradigm in auto-scheduling.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

    Paper 2 is likely higher impact: it introduces a novel world-model/latent-dynamics formulation for compiler auto-scheduling that directly addresses a major bottleneck in ML systems (tensor program search) with clear, strong empirical gains and practical integration (TVM). The applications are immediate and broad across hardware, compilers, and ML deployment, with compelling end-to-end speedups and reduced measurement budgets, suggesting strong adoption potential. Paper 1 is innovative and scalable for multi-system forecasting, but its impact may be narrower and dependent on validating equilibrium assumptions across diverse real-world domains.

    gpt-5.2·Jun 12, 2026
    Wonvs. Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

    Paper 1 introduces a novel world-model-inspired approach to tensor program optimization that fundamentally changes how candidates are evaluated—modeling schedule trajectories as latent dynamics rather than static snapshots. This represents a conceptual innovation bridging reinforcement learning world models with compiler optimization, with broad applicability across ML compilation. The significant speedups (up to 4.61× over PyTorch) with fewer measurements demonstrate strong practical impact. Paper 2, while achieving impressive engineering results (20× speedup for GMMs), is more narrowly focused on optimizing an existing algorithm and its application to ANN search, representing incremental rather than conceptual advancement.

    claude-opus-4-6·Jun 10, 2026
    Wonvs. Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

    Paper 2 demonstrates higher potential scientific impact due to its extreme timeliness and broad applicability in machine learning systems. While Paper 1 provides strong rate-optimal bounds for a novel two-sided market problem, Paper 2 introduces an innovative cross-disciplinary approach by applying world models to compiler optimization. By significantly accelerating tensor program search and full-model inference, Paper 2 directly addresses a critical bottleneck in modern AI deployment, offering immediate, widespread real-world utility that will impact both hardware and AI research communities.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

    Paper 2 addresses a more fundamental and broadly applicable problem—supervised fine-tuning of LLMs—which impacts a vast community. Its unifying framework (Q-target) reinterprets existing SFT methods under a common lens and opens a new design space for training objectives, with broad implications for all LLM fine-tuning. Paper 1, while technically strong with impressive speedups for tensor program optimization, addresses a narrower compiler optimization niche. Paper 2's conceptual contribution and breadth of applicability across models and datasets give it higher potential for widespread scientific impact.

    claude-opus-4-6·Jun 10, 2026
    Wonvs. Algorithmic and Minimax Complexities in Kernel Bandits

    Paper 1 introduces a highly novel application of world models to compiler optimization, addressing a critical bottleneck in modern ML systems. Its empirical results demonstrating significant speedups and reduced measurement costs suggest immediate, broad real-world impact across deep learning deployment. While Paper 2 offers valuable theoretical insights unifying bandit algorithms, Paper 1's combination of conceptual innovation and direct, practical utility in accelerating ML inference gives it a higher potential for broad scientific and technological impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

    Paper 2 introduces a novel world-model-inspired approach to compiler optimization that bridges reinforcement learning concepts with tensor program search, offering broad impact across ML systems and compiler design. Its 10x measurement reduction and significant speedups over established baselines (Ansor, PyTorch) demonstrate strong practical utility for the entire ML community. Paper 1, while achieving state-of-the-art on NLB'21 and addressing an important BCI recalibration problem, operates in a narrower neuroscience/BCI domain with incremental advances on existing Transformer-based neural modeling approaches.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Unifying Local Communications and Local Updates for LLM Pretraining

    Paper 2 likely has higher scientific impact: it targets a central, timely bottleneck in LLM pretraining—communication efficiency and heterogeneity—where improvements can affect large-scale training across industry and academia. The proposed GASLoC unifies local updates and decentralized gossip communication while remaining compatible with adaptive optimizers, broadening applicability and practical adoption potential. Its relevance spans distributed systems, optimization, and ML scaling, with clear real-world implications for multi-cluster training. Paper 1 is novel and strong for compiler/autoscheduling, but its impact is narrower and more domain-specific.

    gpt-5.2·Jun 10, 2026
    Wonvs. AuRA: Internalizing Audio Understanding into LLMs as LoRA

    Paper 2 introduces a highly innovative approach by applying world-model-inspired latent dynamics to compiler optimization. This cross-disciplinary methodology not only advances reinforcement learning and systems research but also provides fundamental efficiency gains for all machine learning workloads. Paper 1 offers a valuable but more incremental architectural improvement for audio-language models.

    gemini-3.1-pro-preview·Jun 10, 2026