Jaber Jaber, Osama Jaber
AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel
AutoMegaKernel (AMK) addresses the problem of compiling an entire LLM forward pass into a single persistent cooperative CUDA kernel, eliminating inter-operator launch overhead and HBM round-trips. The core novelty is not raw performance but the *system design*: a four-layer architecture where a frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom before any GPU launch, enabling an automated agent to propose schedule modifications safely. The key insight is that forward passes are DAGs with monotonic counter synchronization, reducing safety verification to a small set of static graph checks.
The system auto-generates megakernels for Llama-family models with zero per-model hand-written CUDA, retargets across three NVIDIA architectures (sm_80/sm_90/sm_120) from one codebase, and includes an autonomous self-improvement loop.
The paper addresses a timely intersection of two trends: (1) the push toward lower-latency LLM inference, particularly for interactive applications, and (2) the emerging use of AI agents for systems-level code optimization. The megakernel approach for single-stream decode is well-motivated by the bandwidth-bound nature of batch-1 inference.
However, the field is moving rapidly toward batched inference, speculative decoding, and mixture-of-experts — regimes where the single-stream, single-kernel paradigm may be less relevant. The paper acknowledges these as orthogonal but doesn't address how the approach composes with them.
AMK makes a genuine contribution to the intersection of compiler design and agent-driven optimization, with a well-designed static safety gate and honest evaluation. However, the narrow scope (Llama-only, small models, batch-1, position-0), the precision-asymmetric nature of its best results, and substantial losses on high-end hardware limit its practical impact. The system design ideas — particularly the validator and agent edit surface — are more impactful than the current performance results, but need demonstration at broader scale and on more diverse architectures to realize that potential.
Generated Jun 9, 2026
Paper 2 has higher likely scientific impact due to its broad applicability and timeliness: a statically-checked, agent-driven framework for synthesizing correct CUDA megakernels for LLM inference across GPU generations. It offers concrete, reproducible systems contributions (validator with extensive adversarial testing, cross-arch retargeting, correctness vs HuggingFace outputs, open-source release) and immediate real-world deployment relevance for inference efficiency. Paper 1 is novel within computational neuroscience, but its impact is narrower (fMRI/task generalization) and depends heavily on dataset/validation constraints and downstream adoption.
Paper 1 addresses a fundamental challenge in computational biology—predicting protein properties from sparse data—with a novel kernel approach that outperforms foundation model embeddings. Its contributions (evolutionary substitution matrix kernels, structure-aware kernels, multi-task learning) have broad applicability across protein engineering and drug design. Paper 2, while technically impressive in GPU kernel compilation, is narrowly focused on inference optimization for small Llama-family models, shows mixed results (trails on high-bandwidth GPUs), and its precision-asymmetric comparisons limit generalizability. Paper 1's methodological novelty and breadth of biological applications give it higher scientific impact potential.
Paper 1 provides a fundamental theoretical framework (phase diagram) for understanding when different multimodal learning paradigms succeed or fail, with broad applicability across scientific domains. It offers principled, data-driven diagnostic tools that can guide practitioners before expensive training. Paper 2, while technically impressive in GPU kernel engineering, addresses a narrower systems optimization problem for specific model architectures and hardware, with limited generalizability. Paper 1's theoretical contributions and cross-disciplinary relevance (biomedicine, astrophysics, vision-language) give it substantially broader and longer-lasting scientific impact.
Paper 2 introduces a highly innovative system combining AI agents, static safety validation, and low-level kernel synthesis to autonomously optimize LLM inference. This cross-disciplinary approach (AI self-improvement applied to systems engineering) offers a novel paradigm for hardware retargeting without manual CUDA tuning. While Paper 1 presents a useful algorithmic speedup, its quality degradation may limit deployment, whereas Paper 2 achieves exact matches and demonstrates real-world acceleration on modern GPUs, promising broader impact in automated systems optimization.
Paper 1 provides foundational theoretical results by tightly characterizing the VC dimension and sample complexity of Transformers, including the mechanics of Chain-of-Thought learning. These fundamental mathematical bounds have long-lasting, broad scientific implications for understanding why LLMs work. While Paper 2 presents an innovative and highly practical systems engineering feat for automating kernel synthesis, Paper 1's theoretical contributions are more likely to have a profound and enduring impact on the core science of machine learning.
AutoMegaKernel presents a concrete, novel system with demonstrated results—a single persistent CUDA megakernel for entire LLM forward passes with static safety verification, cross-architecture retargeting, and an agent-driven self-improvement loop. It offers measurable speedups on real hardware, open-source code, and addresses the practically important problem of efficient LLM inference. Paper 1 is a position/review paper highlighting calibration issues in electricity price forecasting but proposes no new methods or concrete solutions, limiting its direct impact. Paper 2's novelty in compiler/systems design for AI has broader cross-field relevance.
Paper 2 introduces a highly innovative paradigm for hardware-specific optimization, utilizing AI agents to synthesize whole-model CUDA megakernels with static safety guarantees. Its automated, self-improving approach to systems engineering presents a significant methodological leap with broad real-world applications in deploying efficient AI inference across diverse hardware architectures, giving it a higher potential impact than the algorithmic efficiency improvements in Paper 1.
Paper 1 presents a complete, practical system (AutoMegaKernel) with extensive empirical validation across multiple GPU architectures, demonstrating real speedups for LLM inference. It addresses a timely problem (efficient LLM deployment), includes novel contributions in static verification of GPU kernel safety, agent-driven code synthesis, and cross-architecture retargeting. Paper 2 provides interesting theoretical insights into neural network training dynamics via kernel decompositions, but its impact is more incremental within a well-studied theoretical area (NTK-style analyses) and lacks immediate practical applications. Paper 1's breadth of impact across systems, ML, and compiler communities, combined with its open-source release, gives it higher potential impact.
Paper 2 has higher likely impact due to strong real-world applicability (practical, portable LLM inference megakernel generation), demonstrated end-to-end correctness, extensive empirical validation across architectures, and a reusable statically-checked harness enabling safe agent-driven optimization. Its contributions are timely for LLM deployment and span systems, compilers, GPU programming, and AI tooling. Paper 1 is conceptually novel and potentially foundational, but its impact is less immediately verifiable and appears scoped to a constructed bandit setting; broader adoption depends on subsequent empirical and theoretical development.
Reactive Flux Matching introduces a fundamentally novel theoretical framework for studying rare events in molecular systems, combining ideas from flow matching in generative modeling with path sampling methods. It addresses a longstanding challenge in computational chemistry/physics—extracting mechanistic insight from reactive trajectories without requiring knowledge of dynamics or stationary distributions. Its breadth of impact spans molecular simulation, chemical physics, drug design, and materials science. Paper 1, while technically impressive as a systems engineering contribution for GPU kernel compilation, is narrower in scope (specific to Llama-family LLM inference) and acknowledges limited gains on high-end hardware.