Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Simon Dennis, Rivaan Patil, Kevin Shabahang, Hao Guo

May 21, 2026

arXiv:2605.22502v1 PDF

cs.AI(primary)cs.LG

#689of 2292·Artificial Intelligence

#689 of 2292 · Artificial Intelligence

Tournament Score

1453±47

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty5

Clarity8.5

Tournament Score

1453±47

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses the gap between academic work on compiling procedural knowledge into LLM weights and industry practice, which overwhelmingly favors external orchestration frameworks. The authors identify three perceived barriers to adoption—quality, cost, and flexibility—and systematically address each across three domains of increasing complexity (travel booking: 14 nodes, Zoom support: 14 nodes with domain knowledge, insurance claims: 55 nodes). The key claim is that fine-tuning small (3B–8B) models on synthetic conversations generated from procedure flowcharts yields agents that achieve 87–98% of frontier in-context quality at 128–462× lower cost, with a recompile cycle of 30–50 minutes.

The conceptual contribution—"persistent structure belongs in the weights, transient state belongs in the prompt"—is a clean architectural principle that could reshape how practitioners think about deploying procedural agents.

Methodological Rigor

Strengths. The experimental design has several commendable features:

A controlled same-model comparison (3B compiled vs. 3B orchestrated) that isolates the effect of compilation from model capacity, which prior work lacked.

Three domains of increasing complexity, demonstrating scaling behavior.

n=200 scenarios per condition per domain with bootstrap CIs, effect sizes, and Holm–Bonferroni correction.

Cross-judge validation using both Claude Sonnet 4.5 and GPT-4.1 to address self-preference bias concerns.

Detailed failure rate analysis showing compiled models have lower failure rates than orchestrators in 2/3 domains.

Weaknesses. Several methodological concerns temper enthusiasm:

1. LLM-as-judge evaluation. All quality assessments rely on LLM judges. The authors use Claude as both data generator and primary judge, which creates a circularity concern despite the GPT-4.1 replication. No human evaluation is conducted, which is a notable gap for claims about "naturalness" and "graceful handling."

2. Synthetic-only evaluation. All scenarios are generated synthetically. There's no evaluation on real user interactions, which could reveal failure modes not captured by a simulated user that, by design, produces "contextually appropriate" responses.

3. User simulator limitations. The user simulator (Claude Sonnet 4.5) likely produces more cooperative, coherent users than real customers. The claim that compiled models handle edge cases well (87–98% of frontier) may not generalize to adversarial or genuinely confused users.

4. Cost analysis assumptions. The cost comparison assumes self-hosted inference on reserved A100s at $2.50/hr with optimal batching. Real deployment involves engineering overhead, monitoring, failover, and utilization inefficiencies not captured here.

5. The GPT-4.1 judge tells a somewhat different story. Under GPT-4.1, the LangGraph orchestrator leads the compiled model on more metrics, and the quality advantage is less clear-cut. The authors acknowledge this but frame both judges as supporting their conclusions.

Potential Impact

Practical impact could be substantial. The paper directly addresses practitioners deploying customer-facing conversational agents, a large and growing market. The 128–462× cost reduction is compelling for high-volume deployments (call centers, support chatbots). The 30–50 minute recompile cycle makes the approach viable for organizations with evolving procedures.

Conceptual impact is moderate. The idea of distilling agent capabilities into weights is not new (SimpleTOD, FireAct, etc.), and the authors acknowledge this. The contribution is primarily empirical—demonstrating the technique works at practical scale with comprehensive baselines—rather than introducing a novel method.

Influence on adjacent fields may include: (1) enterprise AI deployment patterns, shifting from API-dependent to self-hosted architectures; (2) data privacy applications where exposing procedures to third-party APIs is unacceptable; (3) edge deployment scenarios where latency matters.

Timeliness & Relevance

The timing is excellent. Agent frameworks are proliferating rapidly (290K+ GitHub stars cited), yet reliability remains poor (the paper cites 60% pass@1 agents showing only 25% consistency). There is clear industry demand for more reliable, cheaper agent deployment. The paper positions compilation as a practical alternative at precisely the moment when orchestration fatigue is setting in.

The paper also arrives as fine-tuning infrastructure has matured (vLLM, DeepSpeed ZeRO-3), making the proposed pipeline technically accessible. The framing as a "CI/CD cycle" rather than "retraining" is strategically effective for practitioner audiences.

Strengths & Limitations

Key strengths:

Clean experimental design with controlled comparisons that prior work lacked

Three domains spanning a meaningful complexity range (14 to 55 nodes)

Practical focus: cost quantification, recompile timing, and deployment considerations

The failure rate analysis (Table 5) is particularly informative—24% orchestrator failure in travel vs. 5.5% for compiled

Conversation examples (Appendix A) effectively illustrate qualitative differences

Notable limitations:

No human evaluation whatsoever

No real-user deployment or A/B testing

The quality ceiling (87–98%) may matter significantly for high-stakes domains like insurance

Limited to English, task-oriented dialogue

Full parameter fine-tuning requirement (LoRA reportedly fails) increases the hardware barrier

The claim about LoRA failure cites the authors' own concurrent work, which is not yet peer-reviewed

No analysis of what happens when users deviate significantly from expected flowchart paths

The 55-node insurance procedure, while more complex, still represents a structured, well-defined workflow; truly open-ended or multi-system procedures remain untested

Overall Assessment

This is a well-executed empirical study that packages known techniques (synthetic data generation from flowcharts, full fine-tuning) into a compelling practical narrative. The main contribution is not methodological novelty but rather the systematic, barrier-by-barrier comparison that makes the case for adoption. The cost analysis is the strongest element—two orders of magnitude is hard to ignore. The quality analysis, while thorough statistically, is undermined by reliance on LLM judges and synthetic users. The paper would be significantly strengthened by even a small-scale human evaluation or real-world deployment study.

The work is most impactful as an engineering-oriented position paper backed by solid experiments, rather than as a fundamental scientific advance.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 5Clarity 8.5

Generated May 22, 2026

Comparison History (20)

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gpt-5.25/22/2026

Paper 1 is more novel by enabling autonomous agents to self-evolve via verified source-code rewriting, expanding adaptation beyond prompt/workflow layers to the actual harness (routing, invariants, hooks). It proposes a rigorous, safety-aware pipeline (evidence batching, deterministic stages, replay-based verification, gated rollout/rollback) with concrete performance gains, and has broad implications for reliable long-lived agent systems and software maintenance. Paper 2 is timely and practical, but “compiling workflows into weights” is a more established direction (several cited predecessors), making its incremental novelty and cross-field impact comparatively lower.

vs. Look Before You Leap: Autonomous Exploration for LLM Agents

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to a more foundational contribution: it formalizes autonomous exploration for LLM agents with a verifiable metric (Exploration Checkpoint Coverage) and proposes a general training paradigm (Explore-then-Act) applicable across environments and agent embodiments. This advances methodology for robustness and generalization, with broad relevance to RL, embodied AI, and agent evaluation. Paper 2 is timely and practically important for cost/privacy in procedural workflows, but appears more applied/engineering-focused and task-domain bounded, building on an existing “compile workflows into weights” line rather than introducing a new general capability metric/paradigm.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

gpt-5.25/22/2026

Paper 1 is more scientifically novel and cross-disciplinary: it proposes a modular LLM augmentation framework tightly integrating topology-aware molecular perception, diffusion-based molecular generation, and reaction-aware reasoning via learned interfaces, directly addressing representation gaps between text and chemical structures. This enables broad, high-impact applications in drug discovery and synthesis, with stronger methodological breadth across multiple chemical tasks and an open-source 8B system. Paper 2 is timely and practically valuable for LLM deployment cost/efficiency, but it is more incremental relative to prior “compile workflows into weights” work and is narrower in domain generality despite strong engineering relevance.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

claude-opus-4.65/22/2026

Paper 1 addresses a highly practical and timely problem—replacing expensive agentic orchestration frameworks with fine-tuned small models—backed by empirical evidence across multiple real-world domains, offering two orders of magnitude cost reduction near frontier quality. Its direct relevance to the massive developer community using agent frameworks (290K+ GitHub stars) gives it enormous potential for real-world adoption. Paper 2 presents a solid incremental improvement to KV cache compression, but operates in a more crowded research space with narrower immediate impact. Paper 1's novelty in bridging the gap between agent architectures and model fine-tuning, combined with its practical implications, gives it higher estimated impact.

vs. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

gpt-5.25/22/2026

Paper 1 targets a broad, timely shift in agentic AI: replacing external orchestration with “compiled” procedures in model weights, promising major cost, privacy/IP, and deployment benefits. If validated, it could materially change how production agents are built across many domains, impacting both research and industry practice. It also directly addresses adoption barriers with multi-workflow evaluations (including a larger 55-node case). Paper 2 is a solid, practical efficiency method for long-video MLLMs, but is more incremental within an active line of token pruning/selection and may have narrower cross-field impact.

vs. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to a more novel, generalizable algorithmic contribution (exploration-aware RL with variationally derived rewards and action grouping) that can transfer across many agent settings and benchmarks, with immediate relevance to test-time scaling and adaptive tool use. Its methodology appears more rigorous and broadly applicable (text + GUI agents, public code/models), enabling follow-on work. Paper 1 targets an important engineering/economics problem (compiling workflows into weights) with clear applications, but is more incremental relative to prior “workflow distillation” lines and may be narrower/less theoretically foundational.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gemini-3.15/22/2026

Paper 2 proposes a foundational architectural shift for agentic systems by introducing event-sourced reactive graphs. This paradigm offers novel theoretical properties like deterministic replay, cheap forking, and causal lineage, which are critical for debugging, auditing, and self-improving agents. While Paper 1 provides a highly practical and cost-effective optimization for deploying existing workflows, Paper 2's fundamental rethinking of state management and coordination is more likely to inspire broader methodological changes and future research across the field of autonomous agents.

vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

claude-opus-4.65/22/2026

Paper 2 addresses a more fundamental and timely problem—compiling agentic workflows into model weights versus relying on external orchestration—with clear practical implications for the rapidly growing AI agent ecosystem. It tackles specific adoption barriers empirically across multiple real-world domains, bridges the gap between academic fine-tuning research and industry practice, and demonstrates two orders of magnitude cost reduction. Paper 1 offers useful modular specialization but is more incremental in its contribution to parameter-efficient fine-tuning. Paper 2's broader relevance to the agent framework ecosystem gives it higher potential impact.

vs. A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

gpt-5.25/22/2026

Paper 1 targets a broad, fast-moving area (LLM agents) with a potentially paradigm-shifting idea: compiling agentic workflows into small-model weights to cut inference/context costs and reduce data exposure. If validated, this impacts many applications beyond the three evaluated domains and could change how agent systems are deployed (edge, privacy, cost). Paper 2 is solid and practical for ISAC/UAV sensing, but is more domain-specific with narrower cross-field spillover. Overall, Paper 1 appears more timely with wider applicability and larger potential ecosystem impact.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental architectural question in AI agent design—whether agentic workflows can be compiled into model weights instead of relying on external orchestration—with broad applicability across all LLM agent applications. It directly challenges the dominant paradigm used by frameworks with 290K+ GitHub stars, offering practical cost reductions (two orders of magnitude) while maintaining quality. Paper 1, while valuable as a benchmark for drug design, serves a narrower community. Paper 2's findings about subterranean agents could reshape how the entire industry builds and deploys AI agents, giving it broader cross-field impact.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

gemini-3.15/22/2026

Paper 1 presents a highly actionable technical advancement with immense practical and economic implications, offering a two-orders-of-magnitude cost reduction for LLM agents. Its rigorous empirical approach addresses a critical bottleneck in scalable AI deployment. While Paper 2 provides timely sociological insights into AI workplace impact, its qualitative nature and small sample size limit its broader scientific footprint compared to the transformative engineering paradigm introduced in Paper 1.

vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

gemini-3.15/22/2026

Paper 1 offers a foundational architectural shift for LLM applications, proposing a method to embed agentic workflows directly into model weights. This addresses critical bottlenecks in current AI development—cost, privacy, and latency—potentially reducing costs by two orders of magnitude while maintaining near-frontier quality. Because it applies universally to the booming field of AI agents (disrupting popular frameworks like LangGraph), its potential for widespread, cross-disciplinary impact significantly outweighs Paper 2, which presents a valuable but domain-specific application of AI agents for topology optimization in mechanical design.

vs. Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

gemini-3.15/22/2026

Paper 1 proposes a fundamental architectural shift in AI agent deployment, moving from external orchestration to weight-compiled workflows. Demonstrating near-frontier quality at a 100x cost reduction has massive implications for enterprise adoption, scalability, and privacy. While Paper 2 offers valuable insights into dynamic LLM evaluation, Paper 1 addresses a structural bottleneck in real-world AI systems, giving it broader cross-field impact and significantly higher potential for immediate, widespread industrial and academic application.

vs. Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental architectural question in LLM deployment—whether agentic workflows can be compiled into model weights rather than orchestrated externally—with clear practical implications (100x cost reduction) and broad applicability across industries. It tackles a widely-adopted paradigm (290K+ GitHub stars worth of frameworks), identifies adoption barriers, and provides empirical evidence across diverse domains. Paper 1, while methodologically interesting in evaluating LLMs as live agents in a game setting, has narrower scope (Risk gameplay) and more incremental findings about provider differences. Paper 2's potential to shift how agentic systems are built gives it substantially higher impact.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

gemini-3.15/22/2026

Paper 1 addresses a critical bottleneck in the booming field of LLM agents: the high cost and latency of orchestration frameworks. By proving workflows can be compiled into smaller model weights at a 100x cost reduction, it offers immense real-world applicability, timeliness, and potential to reshape how industry builds AI agents. While Paper 2 presents rigorous theoretical contributions to the sim-to-real gap, Paper 1's immediate economic and practical implications give it a higher potential for widespread, disruptive impact across both academia and industry.

vs. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental architectural question in LLM-based agent systems—whether agentic workflows can be compiled into model weights rather than orchestrated externally. This has broad implications across the rapidly growing AI agents ecosystem (290K+ GitHub stars mentioned), offering 100x cost reduction while maintaining quality. The practical impact on deployment economics, privacy, and efficiency is substantial. Paper 1, while valuable as a benchmark for document parsing, addresses a narrower problem domain. Paper 2's findings could reshape how the entire industry builds and deploys AI agents, giving it significantly broader potential impact.

vs. A Causal Argumentation Method for Explainability of Machine Learning Models

gemini-3.15/22/2026

Paper 1 addresses a highly relevant, rapidly growing field (LLM agent orchestration) and proposes a paradigm shift that drastically reduces costs. Its direct applicability to real-world tasks and potential to disrupt current frameworks give it significantly higher breadth of impact, real-world utility, and timeliness compared to Paper 2's more niche integration of argumentation frameworks into XAI.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

claude-opus-4.65/22/2026

Paper 1 addresses a highly practical and timely problem—compiling agentic workflows into LLM weights to replace costly orchestration frameworks—with clear cost-efficiency gains (two orders of magnitude) and near-frontier quality. It tackles real adoption barriers empirically across multiple domains, directly challenging the dominant paradigm used by hundreds of thousands of developers. Paper 2 introduces a valuable benchmark for emotional intelligence but is more incremental in scope, focusing on evaluation methodology rather than enabling a fundamental architectural shift. Paper 1's broader applicability to the booming AI agent ecosystem gives it higher potential impact.

vs. Forecasting Scientific Progress with Artificial Intelligence

gemini-3.15/22/2026

Paper 2 offers broader and more profound scientific impact. While Paper 1 provides a highly practical, cost-saving engineering solution for LLM agents, Paper 2 tackles a fundamental question: whether AI can forecast scientific breakthroughs. By introducing a massive multi-disciplinary benchmark across biology, chemistry, physics, and AI, Paper 2 reveals critical systemic flaws and overconfidence in frontier models. This benchmark will likely drive significant future research in AI for Science, R&D evaluation, and temporal reasoning, making its scientific footprint broader and longer-lasting than the workflow optimization presented in Paper 1.

vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

gemini-3.15/22/2026

Paper 1 proposes a fundamental architectural shift in AI agent design, moving from external orchestration to weight-compiled workflows. This offers substantial scientific and practical breakthroughs in cost reduction, privacy, and context efficiency. In contrast, Paper 2 presents a valuable software engineering framework for reducing API boilerplate, which, while highly practical for developers, represents an incremental tooling improvement rather than a broad scientific advancement.