DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu
Abstract
We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DecisionBench
1. Core Contribution
DecisionBench introduces a benchmark *substrate* — not just a task set — for evaluating how LLM agents delegate subtasks to peer models during long-horizon workflows. The core novelty lies in shifting evaluation from pure end-task accuracy to process-level orchestration metrics: routing fidelity (did the agent pick the right peer?), vendor self-preference (does the agent favor same-vendor models?), and a counterfactual ceiling (how much headroom exists for perfect delegation?). The benchmark fixes five components: a task suite drawn from existing benchmarks (GAIA, τ-bench, BFCL), an 11-model peer pool across 7 vendors, a delegation interface (`call_model` + `read_profile`), a deterministic skill-annotation layer, and a multi-axis metric suite.
The key conceptual insight is that emergent delegation — where the orchestrator itself decides whether, when, and to whom to delegate — is a distinct capability that existing benchmarks cannot measure. This fills a genuine gap between single-agent benchmarks (which ignore delegation) and multi-agent systems (which use hard-coded role assignment) and cost-aware routing work (which uses learned external policies).
2. Methodological Rigor
Strengths in design. The benchmark is well-structured with clear separation between the fixed substrate and the reference interventions used to characterize it. The five-condition sweep (blind + 3 profile-card variants + tool-only ablation) is thoughtfully designed, particularly the `aware-tool-only` condition that isolates delivery channel from description content — a clean one-variable ablation. The deterministic skill tagger avoids LLM judgment in the evaluation loop, and the emergent-taxonomy audit (94.5% coverage) validates the 7-skill vocabulary.
Statistical approach. The mixed-effects model (`q ~ cond + (1|agent×benchmark)`) on 23,375 task instances is appropriate, and the authors are transparent about limitations (random slopes did not converge). Paired-bootstrap CIs matched on task ID are standard and correctly implemented. The sensitivity analysis on the counterfactual ceiling (Table 11) is a responsible addition.
Weaknesses. Single-seed Stage-2 runs limit reliability claims. The τ-bench delegation rate is near-zero across all conditions, meaning one of three benchmarks contributes almost no signal to the delegation analysis. The C1 cards rely on a single curator with no inter-rater validation. The mixed-effects model's random-intercept-only specification is acknowledged as conservative but limits the ability to detect condition effects that vary across agent-benchmark combinations — which the per-agent heterogeneity analysis (§6.2) strongly suggests exist. The n=11 agent pool makes cross-agent statistics fragile (the authors acknowledge this regarding Spearman correlations).
3. Potential Impact
Benchmark infrastructure. DecisionBench addresses a real infrastructure gap. As agentic systems move toward multi-model orchestration, there is no established way to evaluate delegation quality. The released artifacts (220 run archives, 33 profile cards, analysis pipeline, ~23K task traces) lower the barrier to entry for future work.
Key findings with practical implications. Three results stand out:
Cross-vendor self-preference (1.5-3.7× chance) is a novel observation that extends the LLM-as-judge bias literature to the orchestration setting — practically relevant for multi-vendor deployments.
Limitations on impact. The benchmark reuses existing task suites rather than introducing novel tasks, so contamination is a concern. The paper does not demonstrate that any method closes the fidelity-quality gap, so the practical payoff of better delegation remains theoretical. The finding that quality is flat across conditions could be read pessimistically — delegation may simply not matter much at current capability levels.
4. Timeliness & Relevance
This is highly timely. Multi-model orchestration is becoming standard practice (MCP, function-calling agents, router-based systems), yet evaluation infrastructure has not kept pace. The paper arrives at a moment when the community is actively building multi-agent and cost-aware systems but lacks principled ways to evaluate the delegation dimension. The 2026 model pool (GPT-5.5, Claude-Opus-4.7, Gemini-3.1-Pro) reflects cutting-edge capabilities.
5. Strengths & Limitations
Key strengths:
Key limitations:
Overall Assessment
DecisionBench makes a solid infrastructure contribution to an important emerging problem. Its primary value is as a measurement instrument — it reveals that orchestration signals exist in process metrics that quality-only evaluation misses. The findings are interesting but somewhat negative (quality doesn't improve), which may limit immediate adoption. The benchmark's complexity and the lack of a demonstrated positive use case (a method that actually benefits from the infrastructure) are the main limitations. Still, the paper is well-executed, transparent, and addresses a genuine gap.
Generated May 20, 2026
Comparison History (20)
Paper 1 introduces a comprehensive benchmark with extensive empirical evaluation (23,375 task instances), quantitative findings on delegation behavior, and releases reusable artifacts enabling reproducible research on multi-agent orchestration. Its findings about routing fidelity and counterfactual delegation ceilings provide concrete, actionable insights for the community. Paper 2 presents an interesting architectural concept (event-sourced agent runtime) but is primarily a design paper that explicitly states it does not demonstrate its claimed benefits, limiting its immediate empirical impact. Benchmarks tend to have outsized community impact by enabling standardized comparison.
While Paper 1 provides a valuable infrastructure benchmark for agentic workflows, Paper 2 tackles a fundamental scientific question regarding how multimodal models reason about human attributes. By exposing the 'Prejudice Gap' and shifting evaluation from superficial score prediction to grounded behavioral reasoning, it addresses critical issues in AI safety, fairness, and interpretability, promising broader interdisciplinary impact across AI, psychology, and HCI.
Paper 1 addresses a fundamental gap in MLLM evaluation—whether models truly understand personality or rely on superficial cues—introducing a novel task (GPR), a comprehensive dataset (MM-OCEAN), and revealing the striking 'Prejudice Gap' across 27 models. This exposes a deep issue (right answer, wrong reasoning) relevant to AI safety, trustworthiness, and social cognition broadly. Paper 2 contributes a useful benchmark for agentic delegation but addresses a narrower, more engineering-focused problem with less surprising findings. Paper 1's conceptual contribution and implications for responsible AI deployment give it broader and more lasting impact.
Paper 2 introduces a comprehensive benchmark for multi-agent delegation, an emerging and critical area in LLM research. Benchmarks typically drive significant scientific progress by establishing standard evaluation metrics, highlighting performance gaps, and enabling comparisons across future methods. While Paper 1 offers a highly practical and novel approach to GUI automation efficiency, Paper 2's potential to become a foundational testing substrate for orchestration and long-horizon agentic workflows gives it a broader and longer-lasting potential scientific impact.
DecisionBench addresses a timely and broadly impactful problem—emergent delegation in multi-agent LLM workflows—with a rigorous benchmark methodology, large-scale experiments (23K+ instances), and clear findings that expose significant unrealized headroom for future orchestration methods. It provides a reusable evaluation substrate for the rapidly growing agentic AI community. Paper 2 (LBW-Guard), while showing practical training stability improvements, addresses a narrower systems-level concern with limited model/dataset scope (WikiText-103, LoRA fine-tuning) and incremental contribution over existing training stability techniques. DecisionBench's broader community utility and timeliness give it higher impact potential.
Paper 1 makes a fundamental theoretical contribution to reinforcement learning by establishing minimax optimal variance-aware regret bounds for MNL mixture MDPs, with both upper and lower bounds that fully characterize the regret complexity. This resolves an open theoretical question with rigorous proofs and has broad implications across RL theory. Paper 2 introduces a useful benchmark for LLM delegation workflows, but benchmarks tend to have more transient impact, and its key finding—that quality is indistinguishable across conditions—limits its immediate practical influence. Paper 1's theoretical results are more enduring and foundational.
Paper 1 reveals a highly counterintuitive phenomenon in embodied AI—that higher observation fidelity and perfect information can actually degrade LLM problem-solving performance. This challenges prevailing assumptions and evaluation methodologies in the field, likely sparking significant follow-up research into LLM reasoning failures and perceptual interactions. While Paper 2 provides a valuable and rigorous benchmark for agentic workflows, Paper 1 offers a more fundamental conceptual shift with broad implications for robotics and AI safety.
Paper 2 likely has higher scientific impact due to a more novel, generalizable methodological contribution: an automated benchmark-generation pipeline coupled with a formal verification guarantee (cycle-consistency ensuring uniqueness), reducing annotation cost while controlling hallucinations. This can scale to many task families and offers a principled way to create robust reasoning benchmarks, with relevance beyond LLMs (program synthesis, formal methods, cognitive evaluation). Paper 1 is valuable infrastructure for delegation/orchestration evaluation, but is narrower in scope and more incremental as a benchmark substrate rather than a new formalizable paradigm.
AutoRubric-T2I addresses a widely impactful problem—aligning text-to-image generation with human preferences—with a novel, practical framework that dramatically reduces data requirements (0.01% of annotated data) while improving interpretability and performance. It demonstrates strong results on multiple benchmarks and downstream RL tasks. DecisionBench introduces a useful evaluation substrate for multi-agent delegation but reports largely negative findings (quality indistinguishable across conditions) and is more niche in scope. Paper 2's combination of methodological novelty, practical efficiency gains, and broad applicability to the rapidly growing T2I field gives it higher potential impact.
Paper 1 has higher likely impact: it introduces a reusable, large-scale benchmark substrate for emergent delegation/orchestration in long-horizon agent workflows with standardized interfaces, metrics, deterministic annotations, and extensive reference sweeps plus released artifacts—directly enabling rigorous, comparable progress across many LLM-agent methods and vendors. Its applications (agent routing, tool/model selection, cost/latency-quality tradeoffs) are immediate and broadly relevant to ML systems and deployment. Paper 2 is timely and methodologically solid observational science with cross-disciplinary interest, but its contributions are more domain-specific and less likely to catalyze widespread methodological advances.
Paper 1 introduces a comprehensive benchmark for an emerging and critical area (agentic delegation), which often dictates the trajectory of future research. By exposing a significant 15-31% unrealized performance headroom, it sets a clear target for the community, likely driving high citation volume and broader structural impact compared to the specific algorithmic optimization proposed in Paper 2.
Paper 2 has higher likely impact because it delivers a concrete, reproducible benchmark substrate with released data, metrics, and large-scale empirical characterization (n=23,375), enabling immediate methodological comparisons and progress on delegation/orchestration—an increasingly timely real-world need for agentic systems. Its multi-axis metrics and counterfactual ceiling provide rigorous evaluation signals beyond end-task quality, with applicability across LLM routing, systems, and HCI. Paper 1 is a compelling, novel framing and taxonomy, but as a position paper it is less methodologically grounded and offers fewer directly actionable artifacts for the community.
Paper 1 advances fundamental understanding of Multimodal Large Language Models through mechanistic interpretability, pinpointing the exact internal dynamics causing modality-conflict hallucinations. Its proposed causal intervention directly addresses a critical AI safety and reliability issue without requiring retraining. While Paper 2 offers a valuable benchmark for multi-agent systems, Paper 1 provides deeper methodological innovation and broader immediate impact by uncovering the architectural root causes of hallucinations and offering a scientifically grounded, inference-time solution.
Paper 2 likely has higher scientific impact because it introduces a broadly reusable benchmark substrate (tasks, peer-model pool, interface, annotations, metrics, and large-scale runs) that can standardize evaluation of delegation/orchestration across many future methods and vendors—high breadth, timeliness, and real-world relevance. Its methodological rigor is supported by multi-axis metrics, controlled conditions, and large n. Paper 1 is a solid algorithmic contribution with measurable gains, but its impact is narrower (on-policy RL for reasoning/optimization) and may be superseded by alternative RL objectives, while DecisionBench can become shared infrastructure for the field.
Paper 2 likely has higher impact: it introduces a broadly useful, standardized benchmark and evaluation substrate for emergent delegation/orchestration across many models and vendors, with large-scale empirical characterization and released artifacts. This directly enables reproducible, comparable progress in a timely area (agentic workflows, routing, cost/latency tradeoffs) and can influence multiple fields (LLM systems, HCI, benchmarking, applied ML). Paper 1 is novel for MARL communication using LLMs but is narrower in scope/application and less likely to become a shared community infrastructure.
GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. This has broad theoretical implications for how neural systems perform extended computation, with applications spanning structured reasoning, constraint satisfaction, and generation. Paper 2, while a solid engineering contribution providing a useful benchmark for agentic delegation, is more narrowly scoped to evaluating LLM orchestration workflows. GRAM's methodological novelty (variational inference for recursive reasoning, inference-time scaling via depth and sampling) is likely to inspire more follow-on research across multiple subfields.
Paper 1 presents a highly novel algorithmic approach combining Bayesian optimization and LLM-driven dynamic feature elicitation for prompt tuning. This addresses a critical, ubiquitous problem in AI deployment—optimizing system prompts with only aggregate black-box feedback. Its innovative methodology offers broad implications for optimizing discrete natural language artifacts, likely yielding higher immediate and cross-disciplinary scientific impact than the benchmark proposed in Paper 2, despite Paper 2's rigorous evaluation of agentic delegation.
Paper 2 addresses a critical and timely AI safety concern—jailbreaking Large Reasoning Models—with a novel mechanistic insight linking attention patterns to attack success, plus a concrete RL-based method. This has broader immediate impact across AI safety, alignment, and model deployment. The attention-guided reward mechanism is innovative and actionable for both offensive and defensive research. Paper 1, while thorough as a benchmarking substrate for agentic delegation, addresses a narrower community and its key finding—that quality is indistinguishable across conditions—limits its immediate transformative impact.
Paper 1 introduces a novel theoretical framework bridging behavioral economics (prospect theory) with strategic classification in ML, addressing a fundamental limitation in existing models. This interdisciplinary contribution has broad implications for deploying ML systems in real-world settings where human behavior deviates from rationality. Paper 2, while useful as a benchmark for LLM-agent delegation, is more incremental—benchmarks have shorter lifespans and narrower conceptual contributions. Paper 1's formalization of behaviorally realistic strategic agents opens a new research direction with stronger long-term theoretical and practical impact.
Paper 1 has higher impact potential due to a broader, more reusable benchmark substrate for emergent delegation/orchestration across long-horizon agentic workflows, spanning multiple task suites and many models/vendors with rigorous multi-axis metrics and large-scale evaluation. It introduces actionable diagnostics (routing fidelity, vendor self-preference, counterfactual ceiling) that directly enable progress and comparisons across methods and labs. Paper 2 targets an important AV planning niche and offers useful negative results plus a benchmark, but its scope is narrower and evidence of measurable gains is limited.