AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

Jaber Jaber, Osama Jaber

Jun 8, 2026arXiv:2606.09682v1

cs.LGcs.DCcs.PF

#1260of 5669·cs.LG

#1260 of 5669 · cs.LG

Tournament Score

1462±43

10501750

68%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor6

Novelty5.5

Clarity4.5

Abstract

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AutoMegaKernel

1. Core Contribution

AutoMegaKernel (AMK) addresses the problem of compiling an entire LLM forward pass into a single persistent cooperative CUDA kernel, eliminating inter-operator launch overhead and HBM round-trips. The core novelty is not raw performance but the *system design*: a four-layer architecture where a frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom before any GPU launch, enabling an automated agent to propose schedule modifications safely. The key insight is that forward passes are DAGs with monotonic counter synchronization, reducing safety verification to a small set of static graph checks.

The system auto-generates megakernels for Llama-family models with zero per-model hand-written CUDA, retargets across three NVIDIA architectures (sm_80/sm_90/sm_120) from one codebase, and includes an autonomous self-improvement loop.

2. Methodological Rigor

Strengths in methodology:

The validator soundness evaluation is thorough: 7,160 adversarial schedules (6,091 unsafe across 8 classes) with zero false-accepts and all 360 real lowerings accepted. The use of an independent oracle for labeling is a good practice.

Correctness verification is multi-layered: logit equivalence to fp32 tolerance, token-for-token greedy decode agreement, and perplexity matching to 2.5×10⁻⁷.

The paper is remarkably transparent about limitations: it explicitly states where AMK loses (A100/H100, vLLM comparison), acknowledges the precision-asymmetric nature of the int8 vs bf16 comparison, reports that clocks were unpinned, and notes the absence of hardware counters.

Methodological concerns:

The validator is empirically tested, not formally verified — the paper acknowledges this but the distinction is important. Empirical soundness over 7,160 schedules, while impressive, does not guarantee absence of false-accepts on unseen schedule patterns.

All decode latencies are measured at position 0 with empty KV cache — the most favorable scenario for the bandwidth-bound thesis. Long-context behavior is entirely unmeasured.

The cuBLAS comparison is precision-asymmetric (W8A16 vs bf16). While disclosed honestly, the headline "beats cuBLAS" claim requires careful qualification.

No hardware counter data (ncu unavailable), so all utilization figures are derived analytically — a notable gap for a systems paper.

The largest real checkpoint is TinyLlama-1.1B; scaling behavior to production-scale models (7B+) on real weights is undemonstrated.

3. Potential Impact

Positive directions:

The static safety gate for agent-generated kernel schedules is a genuinely useful contribution for the emerging paradigm of AI-assisted systems programming. As coding agents become more capable, having compile-time guarantees that prevent GPU hangs is valuable infrastructure.

The self-retargeting capability across architectures from a single codebase addresses a real portability pain point in GPU programming.

The autonomous self-improvement loop (propose→validate→measure→keep/revert) with honesty guards (roofline floor, interleaved measurement) is a well-designed pattern for agent-driven optimization.

Limitations on impact:

The system is restricted to Llama-family architectures — a significant scope limitation for practical adoption.

On the highest-performance hardware (A100/H100) where production inference actually runs, AMK loses to cuBLAS and vLLM substantially. The wins are on inference-class GPUs (L4, L40S), which is relevant but less compelling.

Batch-1 decode at position 0 is a narrow operating point; production systems care about batched inference and long contexts.

The ~63% of measured HBM peak (vs cuBLAS's ~90%) represents a substantial kernel quality gap that limits the fundamental approach's competitiveness.

4. Timeliness & Relevance

The paper addresses a timely intersection of two trends: (1) the push toward lower-latency LLM inference, particularly for interactive applications, and (2) the emerging use of AI agents for systems-level code optimization. The megakernel approach for single-stream decode is well-motivated by the bandwidth-bound nature of batch-1 inference.

However, the field is moving rapidly toward batched inference, speculative decoding, and mixture-of-experts — regimes where the single-stream, single-kernel paradigm may be less relevant. The paper acknowledges these as orthogonal but doesn't address how the approach composes with them.

5. Strengths & Limitations

Key strengths:

Exceptional transparency and honesty — perhaps the most forthright performance reporting I've seen in a systems paper. Every limitation, loss, and methodology caveat is explicitly stated.

The validator soundness testing is well-designed with meaningful adversarial evaluation.

The system design separating trusted frozen base (Layer 0) from agent-editable surface (Layer 2) is architecturally clean.

Open-source release with reproducible artifacts.

Key weaknesses:

The scope is narrow: Llama-family only, batch-1 only, position-0 only, models up to 1.1B only.

The headline cuBLAS win is precision-asymmetric; on equal precision, AMK consistently loses.

On training-class GPUs (where most production inference runs), AMK substantially underperforms.

The validator provides empirical soundness, not formal guarantees — for a paper whose central claim is "correctness by construction," this is a meaningful gap.

The self-improvement loop improves over AMK's own baseline (1.25-1.72×), which is less impressive than it sounds since the starting point is substantially below cuBLAS.

The paper is extremely long and repetitive, restating the same results and caveats multiple times, which obscures rather than clarifies the contribution.

Summary

AMK makes a genuine contribution to the intersection of compiler design and agent-driven optimization, with a well-designed static safety gate and honest evaluation. However, the narrow scope (Llama-only, small models, batch-1, position-0), the precision-asymmetric nature of its best results, and substantial losses on high-end hardware limit its practical impact. The system design ideas — particularly the validator and agent edit surface — are more impactful than the current performance results, but need demonstration at broader scale and on more diverse architectures to realize that potential.

Rating:4.5/ 10

Significance 4.5Rigor 6Novelty 5.5Clarity 4.5

Generated Jun 9, 2026

Comparison History (22)

Wonvs. Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

Paper 2 has higher likely scientific impact due to its broad applicability and timeliness: a statically-checked, agent-driven framework for synthesizing correct CUDA megakernels for LLM inference across GPU generations. It offers concrete, reproducible systems contributions (validator with extensive adversarial testing, cross-arch retargeting, correctness vs HuggingFace outputs, open-source release) and immediate real-world deployment relevance for inference efficiency. Paper 1 is novel within computational neuroscience, but its impact is narrower (fMRI/task generalization) and depends heavily on dataset/validation constraints and downstream adoption.

gpt-5.2·Jun 11, 2026

Lostvs. Flexible Kernels for Protein Property Prediction

Paper 1 addresses a fundamental challenge in computational biology—predicting protein properties from sparse data—with a novel kernel approach that outperforms foundation model embeddings. Its contributions (evolutionary substitution matrix kernels, structure-aware kernels, multi-task learning) have broad applicability across protein engineering and drug design. Paper 2, while technically impressive in GPU kernel compilation, is narrowly focused on inference optimization for small Llama-family models, shows mixed results (trails on high-bandwidth GPUs), and its precision-asymmetric comparisons limit generalizability. Paper 1's methodological novelty and breadth of biological applications give it higher scientific impact potential.

claude-opus-4-6·Jun 10, 2026

Lostvs. When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Paper 1 provides a fundamental theoretical framework (phase diagram) for understanding when different multimodal learning paradigms succeed or fail, with broad applicability across scientific domains. It offers principled, data-driven diagnostic tools that can guide practitioners before expensive training. Paper 2, while technically impressive in GPU kernel engineering, addresses a narrower systems optimization problem for specific model architectures and hardware, with limited generalizability. Paper 1's theoretical contributions and cross-disciplinary relevance (biomedicine, astrophysics, vision-language) give it substantially broader and longer-lasting scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

Paper 2 introduces a highly innovative system combining AI agents, static safety validation, and low-level kernel synthesis to autonomously optimize LLM inference. This cross-disciplinary approach (AI self-improvement applied to systems engineering) offers a novel paradigm for hardware retargeting without manual CUDA tuning. While Paper 1 presents a useful algorithmic speedup, its quality degradation may limit deployment, whereas Paper 2 achieves exact matches and demonstrates real-world acceleration on modern GPUs, promising broader impact in automated systems optimization.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Tight Sample Complexity of Transformers

Paper 1 provides foundational theoretical results by tightly characterizing the VC dimension and sample complexity of Transformers, including the mechanics of Chain-of-Thought learning. These fundamental mathematical bounds have long-lasting, broad scientific implications for understanding why LLMs work. While Paper 2 presents an innovative and highly practical systems engineering feat for automating kernel synthesis, Paper 1's theoretical contributions are more likely to have a profound and enduring impact on the core science of machine learning.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

AutoMegaKernel presents a concrete, novel system with demonstrated results—a single persistent CUDA megakernel for entire LLM forward passes with static safety verification, cross-architecture retargeting, and an agent-driven self-improvement loop. It offers measurable speedups on real hardware, open-source code, and addresses the practically important problem of efficient LLM inference. Paper 1 is a position/review paper highlighting calibration issues in electricity price forecasting but proposes no new methods or concrete solutions, limiting its direct impact. Paper 2's novelty in compiler/systems design for AI has broader cross-field relevance.

claude-opus-4-6·Jun 9, 2026

Wonvs. Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Paper 2 introduces a highly innovative paradigm for hardware-specific optimization, utilizing AI agents to synthesize whole-model CUDA megakernels with static safety guarantees. Its automated, self-improving approach to systems engineering presents a significant methodological leap with broad real-world applications in deploying efficient AI inference across diverse hardware architectures, giving it a higher potential impact than the algorithmic efficiency improvements in Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

Paper 1 presents a complete, practical system (AutoMegaKernel) with extensive empirical validation across multiple GPU architectures, demonstrating real speedups for LLM inference. It addresses a timely problem (efficient LLM deployment), includes novel contributions in static verification of GPU kernel safety, agent-driven code synthesis, and cross-architecture retargeting. Paper 2 provides interesting theoretical insights into neural network training dynamics via kernel decompositions, but its impact is more incremental within a well-studied theoretical area (NTK-style analyses) and lacks immediate practical applications. Paper 1's breadth of impact across systems, ML, and compiler communities, combined with its open-source release, gives it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. An Information-Theoretic Definition for Open-Ended Learning

Paper 2 has higher likely impact due to strong real-world applicability (practical, portable LLM inference megakernel generation), demonstrated end-to-end correctness, extensive empirical validation across architectures, and a reusable statically-checked harness enabling safe agent-driven optimization. Its contributions are timely for LLM deployment and span systems, compilers, GPU programming, and AI tooling. Paper 1 is conceptually novel and potentially foundational, but its impact is less immediately verifiable and appears scoped to a constructed bandit setting; broader adoption depends on subsequent empirical and theoretical development.

gpt-5.2·Jun 9, 2026

Lostvs. Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events

Reactive Flux Matching introduces a fundamentally novel theoretical framework for studying rare events in molecular systems, combining ideas from flow matching in generative modeling with path sampling methods. It addresses a longstanding challenge in computational chemistry/physics—extracting mechanistic insight from reactive trajectories without requiring knowledge of dynamics or stationary distributions. Its breadth of impact spans molecular simulation, chemical physics, drug design, and materials science. Paper 1, while technically impressive as a systems engineering contribution for GPU kernel compilation, is narrower in scope (specific to Llama-family LLM inference) and acknowledges limited gains on high-end hardware.

claude-opus-4-6·Jun 9, 2026

#1260of 5669·cs.LG

#1260 of 5669 · cs.LG

Tournament Score

1462±43

10501750

68%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor6

Novelty5.5

Clarity4.5