Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
Dmitry Redko, Albert Fazlyev, Konstantin Sozykin, Maria Ivanova, Evgeny Burnaev, Egor Shvetsov
Abstract
LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper addresses a timely and important question: in the increasingly popular propose-evaluate-revise loop used by LLM-based discovery systems, how much does the LLM actually learn from feedback versus simply regurgitating pretrained priors? The authors answer this through three controlled experiments spanning pure black-box optimization (BBO) and hardware-aware kernel optimization, arriving at a unified thesis: LLMs are "strong prior exploiters and weak open-ended searchers."
The three key findings are: (1) LLMs behave as greedy optimizers in BBO, anchoring to the current best point rather than exploring; (2) zero-shot kernel generation produces size-agnostic solutions regardless of explicit dimensional instructions in the prompt—the model collapses to fixed parameter modes; (3) iterative feedback helps only when the LLM has a dense prior over the target representation (CUDA improves, TVM IR degrades).
The paper also provides a formal framework (Section 2) casting LLM agents and BBO within the same optimization loop, distinguishing them by what adapts: BBO updates optimizer state (with an entropy reduction guarantee), while LLMs update context against frozen weights (subject to an irreducible entropy floor).
Methodological Rigor
The experimental design is generally well-controlled. The BBO experiments use four task families (Functions, Physical, BBOB-2D, BBOB-5D) with 50-trial budgets and meaningful metrics (best step, coverage, trajectory length). The comparison against CMA-ES, Centaur, and MCTS provides useful baselines spanning different exploration-exploitation tradeoffs.
The kernel optimization experiments are particularly well-designed: using the same computation expressed in CUDA vs. TVM IR cleanly isolates the effect of prior density, and the base vs. small-input regimes test distribution shift. The shape bias analysis (20 samples per grid point × 3 temperatures × 2 prompt conditions) provides reasonable statistical power.
However, there are methodological concerns. The formal framework in Section 2, while intuitive, makes somewhat loose claims. The entropy floor argument (Appendix B) is presented as a theorem but relies on assumptions that may not hold in practice (finite vocabulary and context length don't necessarily imply the entropy floor is meaningfully large). The BBO entropy reduction "proof" in Equation 5 assumes a Bayesian posterior, which doesn't apply to CMA-ES despite the paper's earlier claims. The paper acknowledges this only in a footnote.
The study uses only two LLM backbones (gpt-oss-120b and DeepSeek-V3.2/Qwen3-Coder-Next), and the finding that LLMs are "greedy" could be sensitive to prompting strategies not explored. The MCTS hyperparameter choices are acknowledged as potentially suboptimal, weakening that comparison.
Potential Impact
The findings have significant practical implications for the rapidly growing field of LLM-based optimization systems. The specific takeaways—that feedback is representation-conditional, that sequential feedback outperforms parallel sampling, and that hybrid approaches should route between LLM refinement and classical search based on prior density—provide actionable design principles.
The "prior collapse" finding in kernel generation (Section 4.1) is particularly striking and practically relevant: if LLMs genuinely ignore dimensional specifications in prompts when generating kernels, this has immediate implications for systems like KernelBench, AlphaEvolve, and production systems like KernelEvolve. The observation that dominant tiling parameters are identical across all input sizes and temperatures is a concrete, falsifiable finding that others can verify.
The work could influence adjacent fields wherever LLM-based discovery loops are deployed (scientific equation discovery, materials science, combinatorics), by encouraging practitioners to carefully assess whether their domain falls in the "dense prior" or "sparse prior" regime before building complex agentic pipelines.
Timeliness & Relevance
This paper is highly timely. The explosion of LLM-based agentic systems for optimization (AlphaEvolve, KernelBench, CUDA Agent, FunSearch) has created an urgent need for principled understanding of when and why these systems work. Many recent systems combine multiple components (tree search, evolutionary algorithms, retrieval, feedback loops) without ablation, making it difficult to attribute performance. This paper's controlled experimental approach directly addresses this gap.
The kernel optimization domain is particularly relevant given the industry push toward AI-generated GPU kernels, with both academic (KernelBench) and industrial (KernelEvolve at Meta) systems under active development.
Strengths
1. Clean experimental design: The CUDA vs. TVM IR comparison is an elegant way to control for prior density while keeping the underlying computation identical.
2. Unified perspective: Framing LLM agents and BBO within the same optimization loop with different adaptation mechanisms is conceptually valuable.
3. Actionable findings: The three-way routing recommendation (dense prior → LLM refinement; distribution shift → classical search; between → hybrid) is directly useful for system designers.
4. The "invisible instruction" finding: Demonstrating that explicit dimensional instructions have no measurable effect on kernel parameters is a striking and important result about the limits of prompt conditioning.
5. Comprehensive evaluation: Multiple task families, models, temperatures, prompt conditions, and agent architectures provide thorough coverage.
Limitations
1. Limited model diversity: Only 2-3 models tested; results may not generalize to models with different training data distributions or architectures (e.g., models fine-tuned on TVM code).
2. Formal framework is somewhat superficial: The entropy-based analysis provides intuition but the mathematical claims (particularly the entropy floor) are not rigorously connected to the empirical observations.
3. Missing baselines: No comparison with RL-finetuned models (e.g., CUDA Agent) to test whether the entropy floor can be lowered through weight updates, which the paper hypothesizes but doesn't verify.
4. Greedy behavior attribution: The paper speculates this arises from the next-token prediction objective but acknowledges it could equally be from training data bias—this ambiguity limits the mechanistic insight.
5. Kernel benchmark scope: Only KernelBench levels 1-2 tested; the findings may not extend to more complex fusion patterns or different hardware targets.
Overall Assessment
This is a solid empirical contribution that addresses a genuine gap in understanding LLM-based optimization systems. The central finding—that pretrained priors dominate over feedback and agentic structure—is well-supported and practically important, even if the formal framework is somewhat underdeveloped. The work provides a useful cautionary perspective for the rapidly growing LLM-for-optimization community and offers concrete design recommendations. The main limitations are the restricted model family and the gap between the formal claims and empirical evidence.
Generated May 20, 2026
Comparison History (18)
Paper 2 provides fundamental insights into how LLM agents actually work in optimization tasks, revealing that they rely heavily on pretrained priors rather than feedback or agentic structure. These findings have broad implications across the rapidly growing field of LLM-based agents and optimization systems, challenging common assumptions about agentic AI. Paper 1, while technically impressive in automating visualization pipelines, represents more of an engineering contribution with narrower applicability. Paper 2's controlled experimental methodology and generalizable conclusions about LLM behavior are likely to influence agent design across many domains.
Paper 1 provides novel empirical insights into a fundamental question about LLM-based code optimization: whether LLMs truly leverage feedback and search or primarily rely on pretrained priors. Its controlled experiments reveal surprising findings (greedy optimization behavior, insensitivity to input specifications, degradation with low-density languages) that have broad implications for the growing field of LLM-driven optimization and discovery systems. Paper 2 proposes useful evaluation taxonomies for LLM agents but is more incremental in nature—extending existing diagnostic frameworks—and explicitly positions itself as a methodology demonstration rather than a benchmark contribution.
Paper 2 is more likely to have higher scientific impact: it delivers controlled experiments that isolate failure modes of LLM agents in hardware-aware optimization, producing actionable, generalizable insights (dependence on pretrained priors; limits of feedback; degradation with low-density languages like TVM IR). These findings are timely for agentic LLM evaluation and practical for compiler/ML systems research. Paper 1 is an important systems/platform contribution with real-world deployment, but its impact is more infrastructural and may depend on adoption rather than yielding broadly reusable scientific conclusions.
Paper 1 offers more novel, generalizable scientific insight into how LLM agents behave under controlled optimization settings, identifying failure modes (greedy black-box behavior, weak conditioning on size, degradation under low-density IR) with clear implications for LLM-based code generation, compilers, and agent design. It is methodologically stronger (controlled experiments, comparative conditions) and timely for reliable LLM tooling. Paper 2 is a useful systems/platform contribution with real-world deployment, but its impact is more infrastructural and may hinge on adoption rather than broadly transferable scientific findings.
Paper 2 (PRISM) likely has higher impact because it introduces a large, multilingual benchmark and a multi-metric evaluation framework that can become shared infrastructure for the community, enabling standardized comparison and driving progress across program synthesis, vision-language, and temporal/spatial reasoning. Its dataset scale and the identified “Execution–Spatial Gap” provide actionable, broadly relevant insights. Paper 1 offers useful controlled experiments and negative findings about LLM agent behavior in code optimization, but its scope is narrower and more diagnostic than enabling, which may limit cross-field uptake.
Paper 1 offers a novel, formal framework for trust calibration in agentic tool use, connecting it to preferential Bayesian optimization and providing a principled, sample-efficient querying strategy with clear methodological grounding (GP classification, uncertainty-based escalation). Its applications span safety, human-in-the-loop autonomy, and policy gating across many agentic systems, giving broad cross-field impact and strong timeliness as tool-using agents proliferate. Paper 2 is valuable and timely but is primarily an empirical diagnostic of current LLM agent behavior in a narrower domain (hardware-aware optimization) with less generalizable methodological innovation.
Paper 1 addresses a universal and critical bottleneck in generative AI: scalable, human-aligned evaluation. Its framework for bridging human judgment with automated assessment is broadly applicable across diverse modalities (text, image) and domains, promising widespread adoption. In contrast, Paper 2 provides valuable but highly domain-specific insights into LLM behavior within hardware-aware code optimization. Because robust evaluation methodologies are foundational to advancing all GenAI research, Paper 1 has a significantly larger potential audience and broader cross-disciplinary impact.
Paper 2 has higher likely impact because it delivers controlled, diagnostic experiments that clarify when LLM agent optimization works or fails in a high-stakes, real-world domain (hardware-aware code optimization). Its negative/limitation findings (greedy black-box behavior, ineffectiveness of size conditioning, degradation in low-density languages/IR) are broadly actionable for ML systems, compilers, and agent design, and are timely for current agentic coding efforts. Paper 1 is a solid incremental framework contribution, but is narrower and depends on a bespoke benchmark, making generalization less certain.
Paper 2 likely has higher scientific impact: it proposes a new self-supervised objective (dual alignment for mask-invariance) plus a practical adaptation method (conv-linear-probing) and reports broad, state-of-the-art gains across diverse EEG benchmarks—suggesting strong methodological contribution and real-world BCI/health applications. Its ideas may generalize to other masked-view representation learning settings. Paper 1 is valuable and timely as a diagnostic/negative-result study of LLM agents in hardware-aware optimization, but its impact is more specialized and primarily characterizes limitations rather than delivering a broadly reusable method.
Paper 2 provides fundamental insights into LLM behavior in optimization tasks, challenging widespread assumptions about their ability to utilize feedback over pretrained priors. This discovery has broad implications for agent design across multiple disciplines. Paper 1 is a valuable but domain-specific audit of trading agents, making its impact narrower and more focused on correcting methodological flaws within a specific niche.
Paper 1 identifies a highly specific, generalizable structural failure mode (the 4x4 threshold) that fundamentally informs our understanding of LLM working memory and reasoning limits. Its rigorous forensic pipeline and concrete findings offer broader implications for cognitive modeling of LLMs and architecture design compared to Paper 2's domain-specific findings on prior knowledge in code optimization.
Paper 2 has higher likely impact: it studies LLM agents in hardware-aware code optimization, a timely, high-stakes real-world domain (compilers, CUDA/TVM, performance engineering) with broad applicability to agent design, RL/black-box optimization, and systems research. Its controlled experiments isolate failure modes (greedy behavior, instruction insensitivity, degradation under low-density IR) that can generalize beyond one benchmark. Paper 1 is rigorous and valuable for debunking chess-LM claims and promoting verifier-in-the-loop, but the domain is narrower and closer to prior critiques of memorization in constrained games.
Paper 2 is more novel in framing “proactive document-guided actions” as a distinct capability and contributes a benchmark (DocOS) that can standardize evaluation and drive follow-on work. Its applications span web automation, enterprise tooling, accessibility, and general agentic RAG, giving broader cross-field impact and timeliness as GUI agents rapidly evolve. Paper 1 offers valuable, rigorous negative/diagnostic findings about LLM optimization limits in hardware-aware code, but its impact is narrower (compiler/kernel optimization) and mainly characterizes failure modes rather than enabling a new scalable research direction or widely reusable artifact.
Paper 2 introduces a novel framework (ChemVA) and a new benchmark dataset (OCRD-Bench) that directly solves a major multimodal bottleneck in chemistry. By enabling open-weight models to rival proprietary ones in chemical reaction understanding, it offers high utility for downstream applications like drug discovery. While Paper 1 provides valuable empirical insights into LLM limitations in coding, Paper 2 delivers foundational tools and datasets that typically drive broader, more immediate real-world scientific adoption and higher citation counts.
Paper 1 introduces a novel framework (LAR) addressing a fundamental bottleneck in LLM agent efficiency—action space representation—with broad applicability across agent benchmarks. It offers a concrete, generalizable method with demonstrated improvements in inference efficiency and task success. Paper 2 provides valuable empirical insights about LLM behavior in code optimization but is more diagnostic/analytical in scope, focused on a narrower domain (hardware-aware optimization). Paper 1's contribution is more actionable, broadly applicable, and opens a new research direction (action representation learning for agents), giving it higher potential impact.
Paper 2 has higher estimated impact due to broader relevance and clearer scientific contribution: it provides controlled experiments that isolate failure modes and behavioral properties of LLM agents (greedy black-box optimization, weak use of explicit constraints, sensitivity to representation density). These findings generalize across agent design, program synthesis, and hardware-aware optimization, informing both research and deployment. Paper 1 is practically valuable but more narrowly scoped to a specific training-control mechanism and evaluation setup; without deeper theoretical grounding and wider validation, its impact is likely more incremental and systems-specific.
Paper 2 provides fundamental insights into how LLM agents actually work in optimization tasks, revealing they rely on pretrained priors rather than feedback—a finding with broad implications across the rapidly growing field of LLM-based agents and automated code optimization. This challenges core assumptions about agentic LLM systems and has wider cross-domain relevance. Paper 1, while methodologically sound, addresses a narrower problem (HAR with KAN-MLP hybrids) with incremental architectural improvements, limiting its broader scientific impact.
Paper 2 has higher impact potential due to broader applicability (LLM agents, code optimization, hardware-aware performance), timely relevance to AI systems evaluation, and clearer real-world implications for compiler/toolchain and agent design. It offers controlled experiments that isolate failure modes (greedy behavior, ignoring size instructions, degradation in low-language-density IR), which can generalize across optimization settings and inform future methods. Paper 1 is a valuable, reproducible case study in AI-assisted formalization, but its scope is narrower and centers on an incomplete main proof, limiting immediate cross-field impact.