M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter
We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the -Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.
This paper compares three orchestration paradigms for tool-using AI agents in customer-service workflows: (1) an unscaffolded baseline LLM agent, (2) a DeclarativeAgent that appends natural-language "skill files" (markdown documents describing procedures, conversational structure, and knowledge-discovery strategies) to the system prompt, and (3) an ImperativeAgent built on a finite-state machine with deterministic phase transitions, verification gates, and retry policies. The comparison is grounded in the τ-Knowledge benchmark, a challenging 97-task customer-service evaluation requiring retrieval over unstructured knowledge bases, tool discovery, and multi-step state-changing operations.
The key findings are: (a) declarative skill files consistently improve accuracy under high-quality ("golden") retrieval, especially for weaker models; (b) the imperative state machine reliably *hurts* performance due to policy-class shrinking and phase misclassification brittleness; and (c) retrieval quality dominates all orchestration choices—under noisy embedding retrieval, benefits from skill files largely vanish.
Formal framing. The authors formalize the three agents as policy classes within a Dec-POMDP, which provides a clean theoretical lens. Propositions 1–4 offer information-theoretic predictions (skill files reduce action entropy, imperative restriction shrinks the policy class, retrieval noise degrades the observation channel). However, these propositions are largely intuitive restatements rather than deep theoretical results—Proposition 1 is essentially "more relevant information reduces uncertainty," and Proposition 2 states that restricting the action space restricts the policy class. The data-processing inequality argument in Proposition 4 is sensible but not formally proven in a rigorous sense.
Experimental design. The study covers 5 models × 3 agents × 2 retrieval regimes = 30 conditions on 97 tasks. This breadth is commendable, though each condition appears to use only a single trial (pass₁), which limits statistical confidence. No confidence intervals, significance tests, or variance estimates are reported. Given the stochastic nature of LLM inference, the observed differences (e.g., +0.008 for DeclarativeAgent on DeepSeek-Flash golden) may not be statistically significant. The exclusion of infrastructure errors from averages is reasonable but warrants transparency about how many tasks were affected.
Ablation quality. The compliance ablation (Table 6) is the most surprising and valuable result: the ImperativeAgent's verification gate fails in practice because phase misclassification routes actions past the deterministic gate. The unauthorized-write rate is *not* lower for the imperative agent, and its over-retry rate is 4–7× higher. This is a genuinely useful empirical finding.
The paper addresses a practical question facing every team deploying LLM agents: should procedural knowledge be encoded as code (state machines, graphs) or as natural-language instructions in the prompt? The finding that declarative skill files outperform deterministic orchestration is directly actionable for practitioners. The result that imperative state machines can be *counterproductive* challenges a common engineering assumption and could influence framework design decisions in LangGraph, AutoGen, and similar ecosystems.
The retrieval-as-bottleneck finding, while not novel (τ-Knowledge itself made this point), is reinforced here across orchestration paradigms and adds weight to the argument that retrieval improvements should be prioritized over agent scaffolding.
However, the impact is somewhat bounded: the absolute performance levels remain low (best pass₁ ≈ 48.4%), the benchmark is specific to banking customer service, and the declarative gains are modest (typically +2–5 percentage points under golden retrieval). The skill files themselves are hand-crafted for this specific domain, raising questions about generalizability.
The paper is highly timely. The debate between declarative (Anthropic's SKILL.md, LlamaIndex's file-centric agents) and imperative (LangGraph, RLMs) orchestration paradigms is actively occurring in the agent-building community. Providing empirical evidence on this question addresses a genuine gap. The use of the recent τ-Knowledge benchmark (2026) demonstrates engagement with cutting-edge evaluation infrastructure.
1. Well-motivated comparison: The three-way comparison (baseline, declarative, imperative) isolates the orchestration variable cleanly. The agents share the same model, tools, and user simulator.
2. Practical relevance: The finding that skill files help weaker models more (scaling with procedural-competence gap) provides guidance for cost-effective deployment—cheaper models + skill files may substitute for more expensive models.
3. Negative result on imperative orchestration: The demonstration that deterministic gates fail due to phase misclassification is a valuable cautionary finding. The over-retry rate data is particularly convincing.
4. Multi-model breadth: Testing across 5 models spanning different capability tiers strengthens generalizability claims.
5. Cost analysis: Including per-task cost data alongside accuracy enables practical cost-benefit reasoning.
1. No statistical testing: With single-trial pass₁ on 93–97 tasks, many reported differences (especially +0.008 or +0.021) are likely within noise. The paper would benefit greatly from multiple trials and confidence intervals.
2. Skill file design is ad hoc: The three markdown files are hand-crafted for τ-Banking. There is no analysis of what makes a skill file effective, no ablation over individual skill files, and no discussion of how to write good skill files for new domains.
3. Limited retrieval regimes: Only golden and one embedding model (all-MiniLM-L6-v2, a relatively weak retriever) are tested. The gap between these extremes is large; intermediate-quality retrieval (e.g., text-embedding-3-large as used in the original benchmark) would strengthen the analysis.
4. Theoretical contributions are thin: The propositions are informal and largely restate intuitions. The Dec-POMDP formalization, while clean, doesn't yield novel predictions beyond what common sense suggests.
5. Missing baselines from τ-Knowledge: The paper's models (DeepSeek, Qwen, Gemini-Flash-Lite) are mostly cheaper/weaker than the frontier models in Table 1. Direct comparison with GPT-5.2 or Claude-4.5-Opus under the same orchestration paradigms would be informative.
6. No code released yet: "Code will be provided upon publication" limits reproducibility assessment.
This is a solid applied contribution that provides useful empirical evidence on a timely question in agent orchestration. The main value lies in the practical finding that declarative skill files are a low-cost, effective intervention under good retrieval, and the surprising negative result on imperative state machines. The theoretical framing is adequate but not deep, and the experimental methodology would benefit from statistical rigor. The work is most impactful as an engineering guide rather than a fundamental scientific advance.
Generated Jun 8, 2026
Paper 1 presents a rigorous empirical and theoretical comparison of agent orchestration paradigms with formal POMDP analysis, experiments across multiple models and retrieval regimes, and actionable findings about declarative vs. imperative agent design. It offers concrete, reproducible methodology and practical insights for building tool-using AI agents. Paper 2, while addressing an important topic (responsible non-compliance), is a position/sketch paper that outlines issues without providing concrete methods, experiments, or evaluations, limiting its immediate scientific impact despite its conceptual relevance.
Paper 1 presents a systematic literature review establishing a unified definition, taxonomy, and levels framework for Self-Explainability in complex systems—a foundational contribution with broad cross-disciplinary impact spanning AI, self-adaptive systems, and trustworthy computing. It identifies major research gaps and provides a roadmap for future work. Paper 2, while methodologically rigorous with its POMDP formalization and empirical evaluation of agent orchestration paradigms, addresses a narrower problem (tool-use in customer service workflows) with more incremental findings. Paper 1's broader scope and framework-setting nature give it higher potential for lasting scientific impact.
Paper 1 offers higher scientific impact due to its broader generalizability. It formalizes agent orchestration using a POMDP framework and tackles the fundamental architectural debate of declarative versus imperative design. While Paper 2 provides rigorous empirical evidence on the importance of proprietary data, its focus is highly domain-specific (drug-asset valuation) and confirms a relatively intuitive premise. Paper 1's insights into declarative scaffolding and retrieval bottlenecks will influence foundational AI agent development across numerous fields and applications.
Paper 1 presents a completed empirical study with formal mathematical modeling, testing across multiple language models, and concrete results. In contrast, Paper 2 is presented as a proposal or work-in-progress ('will be developed', 'anticipated results') lacking empirical validation. Consequently, Paper 1 offers greater methodological rigor and immediate scientific utility.
Paper 2 addresses the fundamental and timely question of AI alignment, introducing the novel concept of 'emergent alignment' as a counterpart to emergent misalignment, and proposes 'projectability' as a new desideratum for alignment strategies. It directly engages with the persona selection hypothesis using rigorous experimental methodology across multiple ethical frameworks. Its findings have broad implications for AI safety research and alignment practices. Paper 1, while methodologically sound, addresses a more narrow engineering problem (agent orchestration for customer service) with more limited generalizability and theoretical contribution to the broader field.
Paper 2 has higher potential impact: it tackles timely, high-demand problems in tool-using LLM agents for real customer-service workflows, with broad applicability across AI, HCI, and software systems. It contributes both conceptual framing (declarative vs imperative orchestration as policy classes in a Dec-POMDP) and empirical evidence across multiple models and retrieval regimes, highlighting retrieval as the dominant bottleneck—an actionable insight for system design. Paper 1 is a solid incremental method in MCDM/TOPSIS with narrower cross-field reach and more limited validation (toy examples).
Paper 1 provides a highly innovative, GPU-accelerated approach to SAT solving, a fundamental and universally applicable problem in computer science. By successfully mapping pseudo-Boolean SAT to continuous local search via Fourier transforms and utilizing JAX for massive parallelism, it offers significant methodological rigor and performance breakthroughs. While Paper 2 is timely and relevant to the booming field of LLM agents, its prompt-based declarative skills approach is more transient and lacks the foundational mathematical and algorithmic depth that gives Paper 1 a higher potential for long-term, cross-disciplinary impact.
Paper 1 introduces a novel methodology (synthetic contrastive reasoning traces with CPO) that addresses a clear gap in multi-table QA—the lack of reasoning supervision. It demonstrates strong empirical improvements (9.7-16.3% absolute gains) across multiple LLMs with rigorous ablations. The approach of generating heterogeneous positive/negative reasoning traces for preference optimization is broadly applicable beyond multi-table QA. Paper 2, while insightful about declarative vs. imperative agent orchestration, primarily offers an empirical comparison finding that retrieval quality dominates agent design, which is a somewhat expected conclusion with narrower methodological contribution.
Paper 1 is likely to have higher impact due to stronger timeliness and broader applicability: orchestration of tool-using LLM agents over knowledge bases is a central, rapidly evolving problem with immediate deployment relevance. It combines theoretical formalization (policy classes in a Dec-POMDP with information-theoretic/structural analysis) with multi-model, multi-retrieval empirical evaluation, yielding actionable insights (retrieval as dominant bottleneck; when declarative skills help). Paper 2 is valuable but narrower, focused on ASP compliance IR with a domain instantiation, likely impacting a smaller community.
Paper 2 addresses a critical and highly timely challenge in edge AI: privacy-preserving personalization for local agents. Its novel architecture, which decouples local statistical preference learning from remote semantic intent parsing, offers a scalable and lightweight solution. While Paper 1 provides rigorous analysis of agent orchestration, its findings regarding retrieval bottlenecks and prompt-based control are less innovative. Paper 2's approach has broader potential real-world applications in consumer AI, granting it higher potential scientific impact.