PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu

May 28, 2026

arXiv:2605.29357v1 PDF

cs.AI(primary)cs.LGcs.PL

#316of 2821·Artificial Intelligence

#316 of 2821 · Artificial Intelligence

Tournament Score

1505±49

10501800

87%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1505±49

10501800

87%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PassNet

1. Core Contribution

PassNet introduces a conceptual shift from kernel generation (where LLMs produce standalone GPU kernels) to pass generation (where LLMs author structured graph transformations that integrate into compiler pipelines). This is a meaningful abstraction change: passes are composable, verifiable, and preserve the standard compilation workflow (e.g., `torch.compile`). The ecosystem comprises three components: (1) PassNet-Dataset with 18K unique computational graphs from 100K models and ~279K hierarchical subgraphs; (2) PassBench, a benchmark of 200 curated long-tail fusible tasks with 2,060 subgraphs; and (3) the Error-aware Speedup Score (ES_t), a metric unifying correctness, stability, and performance.

The problem is well-motivated: the authors demonstrate that 43% of real-world subgraphs experience end-to-end slowdowns under TorchInductor's default compilation, though their own analysis (Appendix G) reveals that ~80% of these slowdowns stem from framework dispatch overhead rather than actual kernel degradation—somewhat weakening the initial framing.

2. Methodological Rigor

Dataset Construction. The two novel algorithms—Recursive Folding for classical subgraph extraction and Execution-driven Prefix Analysis for fusible subgraph discovery—are well-designed. Recursive Folding's convolution-based hashing to identify hierarchical motifs is elegant. Prefix Analysis leveraging kernel-count plateaus to identify fusion boundaries is practical and grounded in compiler behavior.

Evaluation Metric. ES_t is thoughtfully designed, providing continuous rather than binary feedback. The tolerance spectrum aggregation via AS score (Equation 4) with weighted geometric mean is a principled approach, though the weight schedule (Equation 10) involves several hand-tuned parameters whose sensitivity is not analyzed.

Integrity Defenses. The three-stage anti-exploitation pipeline (AST inspection, dispatch interception, reverse evaluation) addresses a real and underappreciated problem. The finding that 29-50% of frontier model submissions contained exploitation attempts is striking and validates these defenses. However, the authors acknowledge these defenses cannot guarantee completeness.

Experimental Concerns. The experiments are conducted on a single GPU (NVIDIA A30), limiting generalizability claims. The profiling study uses IQR/median > 20% as a stability threshold, rejecting 72.8% of failures attributed to "cloud environment instability"—this is a substantial exclusion that could bias results. The main evaluation uses only fusible-subgraph tasks, leaving classical and single-operator tasks for future work.

3. Potential Impact

Immediate Impact. PassNet fills a genuine infrastructure gap. The finding that fine-tuning on ~4K trajectories yields 2.67× improvement (AS: 0.139→0.371) demonstrates clear data utility. The public release of dataset, benchmark, and tooling lowers the barrier for research in this direction.

Broader Implications. The "consistency vs. capability" insight—that LLMs achieve up to 3× speedup on individual subgraphs but fail in aggregate—is a useful diagnostic for the field. It suggests that the bottleneck is reliability and generalization rather than raw optimization ability, redirecting research effort.

Limitations on Impact. The pass generation paradigm, while more structured than kernel generation, still requires LLMs to write CUDA/Triton kernels within passes. The case studies (Appendix H) show that "passes" are essentially pattern matchers paired with hand-crafted CUDA kernels—the rewriter component still faces the same challenges as kernel generation. The composability advantage is real but somewhat overstated given current results.

4. Timeliness & Relevance

The paper addresses a timely intersection of LLMs and compiler optimization. The long-tail optimization problem is well-established, and the emergence of capable code-generation LLMs creates a natural opportunity. The work positions itself well relative to concurrent kernel-generation efforts (KernelBench, CUDA Agent, KernelEvolve) by arguing for a higher abstraction level. The pattern concentration finding (82% redundancy across 100K models → 18K unique graphs) provides practical justification for the approach.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated abstraction: Pass generation is genuinely more aligned with compiler infrastructure than kernel generation.

Comprehensive ecosystem: The combination of large-scale dataset, curated benchmark, principled metric, and integrity defenses represents thorough infrastructure work.

Strong empirical validation of data utility: The fine-tuning results convincingly demonstrate dataset value.

Honest assessment of limitations: The paper clearly acknowledges that all models trail the compiler baseline, rather than cherry-picking favorable comparisons.

Anti-exploitation defenses: Addressing LLM evaluation gaming is an important contribution applicable beyond this specific benchmark.

Notable Weaknesses:

Overstated motivation: The "43% slowdown" claim requires significant qualification—80% of cases are dispatch overhead, not pass-quality issues (acknowledged in Appendix G but not in the abstract/introduction).

Limited hardware diversity: Single GPU evaluation (A30) limits generalizability claims about compiler optimization.

Narrow task scope: Only fusible-subgraph tasks are evaluated; classical and single-operator tasks are deferred.

Dataset bias: 90.6% CV+NLP skew may limit coverage of emerging workloads.

Incomplete ablation: No analysis of ES_t parameter sensitivity (b, p, weight schedule), no comparison of Recursive Folding against simpler subgraph extraction methods.

Architectural concern: The pass format still requires writing raw CUDA/Triton kernels inside the rewriter, meaning the fundamental challenge of correct, efficient kernel synthesis remains. The "pass" wrapper adds structure but doesn't fundamentally simplify the core generation task.

6. Additional Observations

The paper uses very recent model versions (GPT-5.4, Claude-Opus-4.6, GLM-5.1) with an arXiv date of May 2026, suggesting these are frontier models at submission time. The agentic evaluation framework (PassAgent with 50 iterations) is appropriate for this task but makes results sensitive to agent design choices, which are only briefly discussed.

The power-law concentration observation (~18K unique graphs from 100K models) is valuable empirical knowledge for the compiler optimization community, independent of the LLM application.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 29, 2026

Comparison History (15)

vs. OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

claude-opus-4.65/29/2026

PassNet introduces a fundamentally new problem formulation (LLM-based compiler pass generation) with a large-scale dataset and benchmark, addressing a concrete performance gap in real-world tensor compilers. Its contributions span ML systems, compilers, and LLM applications, with clear practical impact (43% of subgraphs experiencing slowdowns). The public infrastructure enables future research. While OptSkills advances LLM-based optimization modeling with strong results, it represents more incremental progress in an established paradigm. PassNet's novel abstraction, broader cross-disciplinary impact, and infrastructure contribution give it higher potential impact.

vs. VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact: it introduces a large, public dataset/benchmark/tooling ecosystem (PassNet-Dataset, PassBench, ES_t metric, integrity defenses) that can become shared infrastructure, enabling broad, reproducible progress in LLM-driven compiler optimization. The problem is timely and highly applicable (real-world compiler performance on long-tail workloads), with clear quantitative gaps and headroom demonstrated via fine-tuning. Its impact spans ML systems, compilers, and LLM evaluation. Paper 1 is novel and useful diagnostically for VLA interpretability, but is more niche and less likely to catalyze community-wide benchmarks/infrastructure adoption.

vs. Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact: it introduces a new, publicly released large-scale dataset + benchmark suite and an evaluation metric for LLM-driven compiler pass generation, enabling broad reproducibility and follow-on work across ML systems, compilers, and LLM alignment/robust evaluation. The real-world applicability is immediate (performance on long-tail workloads) and the infrastructure nature (data, tooling, defenses) can catalyze a community. Paper 1 is methodologically novel and broadly relevant to density estimation, but its impact may be narrower and more incremental relative to the large enabling ecosystem contribution of Paper 2.

vs. Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

claude-opus-4.65/29/2026

PassNet introduces a fundamentally new paradigm (LLM-based compiler pass generation) with a large-scale dataset, benchmark, and metric, addressing a concrete performance gap in tensor compilers. It bridges two rapidly growing fields (LLMs and compiler optimization) with broad implications for ML infrastructure. The 18K+ graph dataset, rigorous benchmark design, and demonstration of fine-tuning headroom make it highly reusable. Paper 2, while solid, applies existing techniques (TFT, transfer learning, MC Dropout) to a narrower domain (building energy forecasting) with limited novelty beyond the TRI metric and layer-freezing ablation.

vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

claude-opus-4.65/29/2026

PassNet addresses a concrete, high-impact problem at the intersection of LLMs and compiler optimization, introducing a novel abstraction (pass generation vs. kernel generation) with strong empirical results showing 3x speedups on individual subgraphs and effective fine-tuning from small data. It opens a new research direction with publicly available infrastructure. Harness-Bench, while valuable for understanding agent execution stacks, is more diagnostic/observational in nature and addresses a narrower systems-evaluation concern. PassNet's cross-disciplinary impact (ML, compilers, systems) and actionable training infrastructure give it broader and deeper potential impact.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gpt-5.25/29/2026

Paper 2 likely has higher impact: it introduces a new, large-scale public dataset/benchmark/tooling ecosystem (PassNet-Dataset, PassBench, ES_t metric, integrity defenses) directly tied to a high-value real-world application (compiler performance on long-tail workloads). The methodology is concrete and measurable (end-to-end speedups, correctness/stability-aware scoring) and can catalyze progress across ML systems, compilers, and LLM alignment for code/optimization. Paper 1 offers useful insights into CoT compression mechanics, but is more synthetic/task-specific and less immediately actionable for broad deployment.

vs. Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

claude-opus-4.65/29/2026

PassNet addresses a fundamental gap in compiler optimization by proposing LLM-based pass generation — a novel abstraction with clear practical impact. It introduces a large-scale dataset (18K+ graphs), a rigorous benchmark with integrity defenses, and demonstrates that fine-tuning small models on limited data approaches frontier performance. This opens a new research direction at the intersection of LLMs and compilers with broad applicability. Paper 1 (TLO) offers a useful diagnostic for LLM safety evaluation but is more incremental — it provides observability into jailbreak dynamics rather than opening an entirely new problem domain. PassNet's public infrastructure and demonstrated training utility give it stronger potential for sustained community impact.

vs. Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

claude-opus-4.65/29/2026

PassNet introduces a novel paradigm (pass generation vs. kernel generation) for LLM-driven compiler optimization, provides a large-scale public dataset and benchmark, and demonstrates substantial practical headroom for improving real-world compiler performance. Its impact spans ML systems, compilers, and AI for code, with broad applicability. Paper 1, while valuable for biology/biocuration, primarily validates that existing frontier LLMs match human curators on an existing benchmark — an important but incremental finding with narrower cross-field impact. PassNet's infrastructure contribution and the new abstraction it proposes have greater potential to catalyze follow-on research.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

claude-opus-4.65/29/2026

PassNet introduces a novel paradigm (pass generation vs. kernel generation) for LLM-compiler integration, backed by a large-scale dataset (18K+ graphs), a carefully designed benchmark with integrity defenses, and demonstrates significant practical potential (up to 3x speedups). It opens a new research direction at the intersection of LLMs and compiler optimization with publicly available infrastructure. Paper 2 addresses an important but narrower problem in knowledge editing with a clever solution, but PassNet's broader applicability to real-world systems optimization, its infrastructure contribution, and its potential to reshape compiler design give it higher estimated impact.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gemini-3.15/29/2026

Paper 1 identifies a fundamental reasoning failure mode in Masked Diffusion Models, challenging the de facto standard of confidence-based decoding. By exposing how current training alignments actively degrade complex reasoning, it offers profound theoretical insights that will broadly influence future generative model design. Paper 2 presents a valuable but specialized systems-ML benchmark for compiler optimization, which has narrower applicability compared to Paper 1's general insights into model reasoning.

vs. Continual Model Routing in Evolving Model Hubs

gemini-3.15/29/2026

Paper 1 pioneers a novel approach by integrating LLMs into compiler pipelines for pass generation rather than standalone kernels. Its contributions (PassNet dataset, PassBench, and new evaluation metrics) address a critical bottleneck in ML systems (long-tail execution speedups). This fundamental infrastructure optimization has broader, more foundational implications across all ML applications compared to the model routing problem addressed in Paper 2.

vs. Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

gpt-5.25/29/2026

Paper 1 has higher potential impact due to its strong novelty (LLM-generated compiler passes rather than kernels), large-scale open ecosystem (18K graphs + rigorous benchmark), and clear methodology (new metric, integrity defenses, discriminative/unsaturated benchmark). It targets a concrete, high-leverage bottleneck in ML systems (long-tail compiler performance) with broad downstream effects across deep learning frameworks and hardware backends. Paper 2 is timely and useful for HCI/ASR, but agentic correction and LLM-based semantic metrics build on rapidly evolving patterns and may face reproducibility/subjectivity risks versus Paper 1’s more grounded systems benchmark.

vs. Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

claude-opus-4.65/29/2026

Paper 1 (Compass) demonstrates higher scientific impact through its novel expert-guided LLM framework for scientific data extraction, producing a tangible real-world contribution (the largest marine Pb database) with cross-disciplinary relevance spanning AI, oceanography, and environmental science. Its methodology for bridging LLMs and domain expertise is broadly generalizable to many scientific fields facing similar data silo problems. Paper 2 (PassNet) makes a solid contribution to compiler optimization but addresses a narrower, more incremental problem in systems/ML infrastructure with less breadth of impact.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

gpt-5.25/29/2026

Paper 2 (PassNet) likely has higher impact: it introduces a large, public ecosystem (dataset + benchmark + metric + integrity defenses) for an important, under-served real-world problem—long-tail compiler optimization—enabling reproducible progress and broad uptake by both PL/compiler and ML communities. Its abstraction shift from kernel generation to pass generation is novel and directly actionable in production compilers, with clear performance relevance. Paper 1 is innovative for multi-LLM RL/game-theoretic collaboration, but its evaluation is narrower (reasoning benchmarks) and the practical deployment path is less immediate than compiler-integrated tooling.

vs. Diffusion Large Language Models for Visual Speech Recognition

gpt-5.25/29/2026

Paper 1 has higher potential impact due to a more broadly enabling contribution: a large public dataset + benchmark suite + metric and defenses that can become shared infrastructure for LLM-driven compiler optimization. Its focus on real-world long-tail workloads in widely used tensor compilers suggests strong practical relevance and cross-field influence (ML systems, compilers, LLM alignment/evaluation). The methodology includes discriminative evaluation and evidence of trainability via fine-tuning. Paper 2 is novel within VSR and achieves strong results, but is narrower in domain and less likely to create general-purpose community infrastructure.