A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

Vasundra Srinivasan

May 19, 2026

arXiv:2605.20173v1 PDF

cs.AI(primary)cs.SE

#1381of 2292·Artificial Intelligence

#1381 of 2292 · Artificial Intelligence

Tournament Score

1388±40

10501800

61%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance5.5

Rigor4

Novelty4.5

Clarity7

Tournament Score

1388±40

10501800

61%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper introduces the stochastic-deterministic boundary (SDB) as a named architectural primitive for production LLM agent systems — a four-part contract (proposer, verifier, commit step, reject signal) that governs how an LLM output becomes a system action. Around this primitive, the paper organizes agent runtime design into three orthogonal concerns (Coordination, State, Control), presents a catalog of six runtime patterns mapped to distributed-systems antecedents, and provides a five-step selection methodology for choosing patterns given a workload.

The paper also names replay divergence — a failure mode where LLM-based consumers of a deterministic event log produce different outputs under model-version or prompt changes — and proposes a stylized reliability decomposition y(t) = μt + σξ(t) separating per-call model variance from "architectural momentum."

The fundamental insight is straightforward but useful: as LLM per-call variance decreases with model improvements, the architectural scaffolding surrounding the model becomes the dominant lever for system reliability. This is an engineering observation rather than a scientific discovery, but it is well-articulated and timely.

2. Methodological Rigor

The paper's empirical grounding is thin. The primary evidence consists of:

An audit of 5 open-source agent frameworks finding verifier-and-commit logic at 19/21 LLM-to-action call sites

A classification of 21 published failure post-mortems (71% localizing to SDB weaknesses)

Five worked application examples, of which one has a runnable reference implementation

Several concerns arise:

The reliability model is a metaphor, not a derivation. The authors acknowledge this, but y(t) = μt + σξ(t) is presented with enough formalism to suggest analytical weight it doesn't carry. The claim that μ "dominates" as σ shrinks is trivially true of any linear-plus-noise model, so the decomposition adds framing rather than insight. No empirical measurement of μ or σ is provided.

The post-mortem analysis is the author's own classification with no inter-rater reliability or independent validation. Twenty-one cases is a small sample, and the classification methodology is not described in detail.

The worked applications are all constructed by the author. The paper acknowledges this as a threat to validity, but the absence of independent practitioner validation is a significant gap for a methodology paper. The methodology's decision predicates use threshold values (one hour, three retries) that are acknowledged as conventions rather than empirically derived.

The framework audit (19/21 call sites) is more compelling as evidence that the SDB is independently rediscovered, though the range from "one-line JSON parse" to "multi-stage review loop" suggests the concept may be too broad to be actionable without further refinement.

3. Potential Impact

The paper's strongest potential impact is practical rather than scientific. It provides vocabulary and a structured decision procedure for engineering teams building production LLM agents. The SDB naming, the three-concern taxonomy, the pattern catalog, and especially the failure-signature catalog in Section 5.2 are directly actionable artifacts.

The connection to distributed-systems primitives (sagas, actors, CAS, supervision trees) is well-drawn and will help practitioners recognize that many "novel" agent architecture problems have known solutions with known trade-offs. This knowledge transfer function is valuable.

The replay divergence concept is genuinely useful. It names a real failure mode that is specific to LLM-based event-sourced systems and not addressed in prior distributed-systems literature. This is arguably the paper's most novel technical contribution.

The scope of influence is primarily software engineering and MLOps for LLM-based systems, not core AI/ML research. Adjacent fields like enterprise architecture, reliability engineering, and DevOps for AI systems would benefit.

4. Timeliness & Relevance

The paper addresses a genuine current bottleneck. As of 2025-2026, the industry is rapidly deploying multi-agent LLM systems with minimal architectural guidance. The gap between ML research (which focuses on model capabilities) and production deployment (which requires reliability, auditability, and failure recovery) is real and growing. Frameworks like AutoGen, CrewAI, and LangChain provide composition mechanisms but not selection guidance.

The timing is strong. The observation that architectural concerns will dominate model concerns as models improve is prescient and aligns with industry trajectory. However, the field is moving fast enough that some patterns may be obsoleted by framework evolution before adoption.

5. Strengths & Limitations

Key Strengths:

Naming power: The SDB, replay divergence, and architectural momentum are useful conceptual handles that fill genuine vocabulary gaps.

Actionable methodology: The five-step procedure with written gate artifacts and the failure-signature catalog are immediately usable by engineering teams.

Intellectual honesty: The paper is careful about its claims, acknowledges threats to validity explicitly, and distinguishes metaphor from derivation.

Open catalog design: The pattern-discovery procedure (Section 3.1) and the expectation of catalog evolution show good epistemological design.

Companion repository: Runnable code against a public dataset is valuable for a methodology paper.

Key Limitations:

No independent validation: All applications are author-constructed. A methodology paper's strongest evidence would be adoption by independent teams with measured outcomes.

Weak formalism: The reliability decomposition is acknowledged as metaphor but positioned as a contribution. No formal properties of the SDB contract are proved.

Limited novelty in the patterns themselves: The six patterns are explicitly acknowledged as pre-existing distributed-systems patterns applied to a stochastic setting. The adaptation is useful but not deep.

Scalability of evidence: 21 post-mortems and 5 frameworks is a modest evidence base for the universality claims made.

The paper is long and repetitive: Key claims are restated many times, and the worked examples, while useful for demonstration, share substantial structural similarity.

Overall Assessment

This is a well-organized practitioner-oriented methodology paper that names useful concepts and provides actionable guidance for a timely problem. Its scientific contributions are modest — the SDB is more a crystallization of existing practice than a new discovery, and the reliability model is a framing device rather than a technical result. Its practical contributions are stronger: the failure-signature catalog, the selection methodology, and the vocabulary are likely to see adoption. The paper would benefit significantly from independent validation and from tighter formalization of the SDB contract's properties.

Rating:5.2/ 10

Significance 5.5Rigor 4Novelty 4.5Clarity 7

Generated May 20, 2026

Comparison History (23)

vs. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

gemini-3.15/22/2026

Paper 1 offers foundational architectural primitives for production LLM agents, bridging software engineering and AI. By formalizing the 'stochastic-deterministic boundary' and identifying novel failure modes like 'replay divergence', it addresses a critical, widespread bottleneck in deploying reliable AI systems. Paper 2, while presenting strong results for a specific NLP task (persuasive dialogue via Theory of Mind), represents a standard dataset-and-method contribution. Paper 1 has significantly broader potential impact, as its methodologies will be relevant to virtually any researcher or engineer building production-level agentic systems across diverse domains.

vs. Scaling Observation-aware Planning in Uncertain Domains

gemini-3.15/22/2026

Paper 1 addresses a highly timely and critical bottleneck in modern AI: the reliable deployment of LLM agents in production. By formalizing the 'stochastic-deterministic boundary' and mapping established distributed-systems patterns to LLM architectures, it bridges software engineering and AI. This provides broad, immediate real-world utility across industries. While Paper 2 offers impressive algorithmic scaling for POMDPs, its impact is largely confined to classical planning and robotics, whereas Paper 1 shapes the rapidly expanding, cross-disciplinary field of agentic AI systems.

vs. The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

claude-opus-4.65/22/2026

Paper 2 introduces novel architectural concepts (SDB, runtime patterns) for production LLM agents—a rapidly growing field with broad industry and research impact. It provides a reusable methodology, failure diagnostics, and a new failure mode (replay divergence) relevant across all LLM-agent deployments. Paper 1, while methodologically sound and valuable for digital humanities, addresses a narrower domain (Holocaust oral history archives) and primarily validates existing qualitative claims computationally. Paper 2's timeliness, breadth of applicability across AI engineering, and foundational conceptual contributions give it higher potential impact.

vs. Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

gpt-5.25/22/2026

Paper 2 has higher likely impact: it introduces a broadly applicable architectural abstraction (the stochastic-deterministic boundary) plus a pattern catalog and selection/diagnostic methodology that can influence how production LLM agents are built across domains. Its concepts transfer to software engineering, distributed systems, reliability, and HCI, and it targets a timely, rapidly expanding area (agentic systems in production). Paper 1 is novel and rigorous with a useful benchmark, but its scope is narrower (KG construction from statistical tables) and impact is more specialized.

vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

gemini-3.15/22/2026

Paper 1 proposes a novel algorithmic approach to optimize LLM agent efficiency by utilizing idle time, backed by strong empirical gains on standard benchmarks. This concrete, measurable improvement in inference methodology is highly relevant to current AI research and likely to drive immediate citations and follow-up work, whereas Paper 2 offers a conceptual architectural framework that, while valuable for engineering, may have less direct scientific impact.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

gemini-3.15/21/2026

Paper 2 addresses a critical and rapidly growing domain—software architecture for production LLM agents. By bridging classical distributed systems concepts with stochastic AI outputs, it offers a foundational framework (the stochastic-deterministic boundary) that can influence software engineering practices across countless industries. While Paper 1 provides a strong, rigorous technical contribution to autonomous driving safety, Paper 2's broader applicability and timeliness in standardizing LLM agent design give it a higher potential for widespread scientific and practical impact.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

gemini-3.15/21/2026

While Paper 1 offers a rigorous, targeted advancement in autonomous driving safety simulation, Paper 2 has a wider breadth of impact and exceptional timeliness. By formalizing the 'stochastic-deterministic boundary' for LLM agents, it addresses a critical, universal bottleneck in deploying generative AI. Its translation of distributed systems concepts into a novel architectural framework establishes foundational software engineering principles that will likely heavily influence both academic research and broad industry adoption across multiple domains.

vs. When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

claude-opus-4.65/20/2026

Paper 2 addresses a novel and well-defined problem at the intersection of tabular foundation models and strategic classification, contributing a concrete method (SPN) with theoretical grounding and empirical validation. It opens a new research direction by bridging foundation model generalization with game-theoretic data shifts. Paper 1, while practically useful, is primarily a methodology/pattern catalog for LLM agent architectures—more of an engineering framework than a scientific contribution. Paper 2's rigorous problem formulation, inference-time solution, and experimental evaluation give it broader scientific impact across ML, fairness, and decision-making communities.

vs. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

gemini-3.15/20/2026

Paper 1 addresses a critical and highly timely challenge: defining robust software architectures for LLM agents. By formalizing the stochastic-deterministic boundary and offering a concrete methodology for runtime patterns, it bridges AI and distributed systems. This foundational work has immense potential for widespread adoption across academia and industry as LLM agents scale. Paper 2 is highly innovative and shows impressive results in constraint programming, but its scope and potential audience are significantly narrower compared to the booming field of agentic AI architecture.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gpt-5.25/20/2026

Paper 2 is likely higher impact due to a concrete, scalable benchmark addressing a timely, high-stakes problem (privacy/utility in agentic LLMs). It offers clearer methodological rigor (large multi-domain dataset, deterministic scoring, controlled axes producing diagnostic surfaces) and enables broad adoption across academia/industry for model evaluation and privacy alignment. Paper 1 provides valuable architectural framing and patterns for production systems, but is more methodology/design-oriented with narrower scientific generalizability and fewer standardized, reusable artifacts for comparative research.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

claude-opus-4.65/20/2026

Paper 2 presents a rigorous empirical evaluation with quantitative results on a real-world dataset, introduces a novel imputation method (A-TLM) that outperforms established baselines, and addresses concrete methodological challenges in survey research with broad applicability across social sciences. It also contributes actionable findings (subgroup-stratified bias auditing) and demonstrates practical LLM integration. Paper 1, while intellectually interesting in formalizing architectural patterns for LLM agents, is primarily a conceptual/methodological framework without strong empirical validation, limiting its immediate measurable impact.

vs. DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

claude-opus-4.65/20/2026

Paper 1 presents a novel foundation model (DARE-EEG) with concrete technical innovations (dual-aligned representation learning, conv-linear-probing) validated through extensive experiments across diverse EEG benchmarks, demonstrating state-of-the-art results. It addresses a fundamental challenge in EEG representation learning with broad applicability across brain-computer interface applications. Paper 2, while practically useful, is primarily a methodological framework/catalog for LLM agent architectures without empirical validation at scale—it organizes existing concepts rather than introducing fundamentally new scientific contributions. Paper 1's rigorous experimental methodology and novel technical contributions give it higher scientific impact potential.

vs. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

gemini-3.15/20/2026

Paper 1 has higher scientific impact as it directly accelerates fundamental scientific discovery. By creating an AI-driven, evidence-grounded hypothesis generation system for nanomedicine, it contributes to the transformative 'AI for Science' paradigm. While Paper 2 offers highly valuable software engineering and architectural patterns for LLM production, Paper 1's methodology for autonomously mapping research frontiers and generating valid hypotheses has profound implications for how future scientific research is conducted.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

gemini-3.15/20/2026

Paper 1 introduces foundational architectural concepts and design patterns for integrating LLMs into production software, addressing a widespread engineering challenge across multiple domains. Its broad applicability to general agent design gives it a higher potential for widespread scientific and practical impact compared to Paper 2, which focuses on a narrower, albeit valuable, application of LLMs to symbolic regression.

vs. Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

claude-opus-4.65/20/2026

Paper 1 addresses a broader, more fundamental architectural challenge in production LLM systems—how to structure the boundary between stochastic and deterministic components—offering a reusable methodology, pattern catalog, and diagnostic framework applicable across diverse agent workloads. Its breadth of impact across the rapidly growing field of LLM agent engineering, combined with novel conceptual contributions (SDB, replay divergence) grounded in established distributed-systems theory, gives it wider applicability. Paper 2 tackles an important but narrower problem (commitment validation in personalized systems) with modest availability results (0.49-0.60), limiting its practical adoption.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/20/2026

Paper 2 offers foundational contributions to reinforcement learning and cognitive science by mathematically distinguishing volatility and stochasticity in exploration. Its rigorous theoretical derivations and broad implications for both artificial intelligence and computational psychiatry suggest a lasting and cross-disciplinary scientific impact. In contrast, Paper 1 presents a highly applied, domain-specific software engineering framework for LLM agents, which, while timely, is more likely to have a transient impact as engineering practices evolve.

vs. Interference-Aware Multi-Task Unlearning

gpt-5.25/20/2026

Paper 2 is more scientifically novel and broadly impactful: it introduces a formal multi-task unlearning problem (full- vs partial-task) and a concrete, general optimization framework (task-aware gradient projection + instance-level orthogonalization) with quantitative improvements on established benchmarks, suggesting methodological rigor and reproducibility. Its applications align with timely regulatory and safety needs (data deletion rights, privacy, model governance) and extend beyond CV to multi-task foundation models. Paper 1 is highly practical for LLM agent engineering but is more of an architectural taxonomy/methodology with narrower academic generality.

vs. Probabilistic Tiny Recursive Model

claude-opus-4.65/20/2026

Paper 1 demonstrates a concrete, novel contribution with strong empirical results: a task-agnostic stochastic exploration framework that dramatically improves reasoning accuracy (e.g., 87.4%→98.75% on Sudoku-Extreme) while using only 7M parameters—outperforming frontier LLMs at <0.0001x cost. This has immediate practical impact and broad applicability to efficient AI reasoning. Paper 2 offers a valuable architectural taxonomy for LLM agent systems but is primarily a conceptual/methodological framework without rigorous empirical validation, limiting its measurable scientific impact despite its practical relevance to engineering.

vs. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact due to a clearer conceptual contribution (the stochastic-deterministic boundary as a first-class contract), a broadly applicable catalog of reusable architecture patterns with ties to distributed-systems theory, and a systematic selection/diagnostic methodology grounded in production failure modes (e.g., replay divergence). Its scope generalizes across many agent types and organizations, making it timely and widely transferable. Paper 1 is strong and implementable, but its contribution is more runtime/engineering-specific and potentially narrower in cross-field influence.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

gemini-3.15/20/2026

Paper 1 offers higher potential impact by addressing a critical, foundational gap in the rapidly expanding field of LLM agents: software architecture. By formally defining the 'stochastic-deterministic boundary' and mapping LLM runtimes to distributed systems concepts, it provides a much-needed theoretical framework for productionizing AI. In contrast, while Paper 2 tackles an important security issue with an innovative cross-modal emotion approach, its contribution is more narrowly focused and incremental (a 2.1% AUC improvement in deepfake detection). Paper 1's broad applicability across AI and software engineering ensures wider scientific and industrial relevance.