Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang, Hanqing Wang, Yuxin Wu, Yifan Wu, Siru Zhong

May 6, 2026arXiv:2605.05007v1

cs.AI

#767of 3489·Artificial Intelligence

#767 of 3489 · Artificial Intelligence

Tournament Score

1461±36

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty6.8

Clarity6.5

Abstract

Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Uno-Orchestra

1. Core Contribution

Uno-Orchestra addresses a genuine gap in LLM multi-agent orchestration: the disconnect between task decomposition and model routing. Prior work either routes queries to single experts (flat routing) or decomposes tasks via hand-engineered planners that don't jointly optimize which worker handles each subtask. The key novelty is collapsing both decisions—whether and how to decompose and which (model, primitive) pair to assign—into a single causal language model's forward pass. The factorization in Equation 4 is elegant: decomposition and routing share the same backbone with no auxiliary heads, separated only by causal masking.

The "selective delegation" framing is the paper's most distinctive conceptual contribution. The system can collapse to a zero-cost direct answer for simple queries or expand into multi-step orchestration for complex ones, all under a unified objective. This parsimony is structurally enforced rather than heuristically tuned.

2. Methodological Rigor

Training pipeline: The two-stage approach (SFT on verifier-gated teacher trajectories → Agentic-GRPO refinement) is well-motivated. The verifier-gated curriculum construction (Equation 3) is sound—discarding tasks the cold-start router already solves and splitting remainder by teacher success creates clean learning signal partitions.

Agentic-GRPO: The turn-level credit assignment (Equation 6) addresses a real limitation of vanilla GRPO for multi-turn agents. The bounded process shaping (S(τ) ≤ 0.10) preventing the policy from harvesting "cheap-and-wrong" cost bonuses is a thoughtful design choice.

Blind worker protocol: Anonymizing worker identities during RL (WORKER 1...K) and demonstrating that removing this increases cost from $0.16 t o$ 0.82 with <1 point accuracy gain is a compelling ablation showing the router learns capability profiles through interaction rather than brand shortcuts.

Concerns:

The 61,201 SFT trajectories are teacher-distilled, meaning quality depends heavily on the teacher orchestrator (GPT-5.4, Claude Opus, etc.). The paper doesn't deeply analyze failure modes of teacher distillation.

The worker pool includes very recent commercial models (GPT-5.3-Codex, GPT-5.4, Claude-Opus-4-6), making reproducibility challenging as these APIs evolve.

The cost comparison depends on specific API pricing at evaluation time—a moving target.

3. Evaluation Breadth and Results

The evaluation is comprehensive: 13 benchmarks across 5 capability domains, 22 baselines spanning 5 families, and multiple ablation dimensions (training stages, worker pools, router backbone sizes, domain shift). The 77.0% macro pass@1 with ~16% improvement over AgentOrchestra at ~10× lower cost is a strong result if reproducible.

The stage ablation (Table 3) showing monotonic improvement across all benchmarks from Uno-base through Uno-Orchestra is convincing evidence that each component contributes. The weak-worker-pool ablation (Table 4) demonstrating gains even without frontier commercial models strengthens the claim that the routing policy itself, not just access to better workers, drives improvements.

The generalization analysis across in-domain, near-domain, and out-of-domain regimes (Figure 4) showing increasing margins in harder transfer settings is notable and suggests genuine policy learning rather than benchmark memorization.

4. Timeliness & Relevance

This paper arrives at a critical moment. The proliferation of LLMs at different capability-cost points creates an urgent need for intelligent orchestration. The cost awareness is particularly timely—deploying frontier models for every query is economically unsustainable at scale. The formalization of the "selective delegation" problem and the demonstration that a 7B router can effectively coordinate heterogeneous workers addresses a practical deployment bottleneck.

The work also responds to the emerging consensus that capability emerges from coordination rather than individual model scale, positioning it well within current research trajectories on compound AI systems.

5. Strengths

Unified formulation: Collapsing decomposition and routing into one policy is architecturally clean and eliminates the planner-dispatcher coordination overhead that plagues hierarchical systems.

Four emergent behavior modes (lazy/oneshot/continuation/decomp_repair) arising naturally from training demonstrate the policy's flexibility without mode-specific engineering.

Pareto dominance: Simultaneously improving accuracy AND reducing cost (rather than trading one for the other) is rare and practically significant.

Extensive ablations: Worker pool diversity, router backbone size, blind protocol, and training stage ablations systematically isolate contributions.

Reproducibility effort: Released code and dataset, though commercial API dependence limits full reproducibility.

6. Limitations & Weaknesses

Commercial API dependence: Results are snapshots tied to specific API versions and pricing. The paper acknowledges this but the core results cannot be independently verified as these endpoints evolve.

Scalability of SFT curriculum: The verifier-gated curriculum requires running both cold-start and teacher orchestrators on the full pool, which is expensive. The paper doesn't discuss how this scales to new domains.

Limited theoretical grounding: The selective delegation formulation is operationally defined but lacks formal analysis of when/why joint optimization should outperform sequential decomposition-then-routing.

Evaluation fairness: Some baselines (e.g., Router-R1, ToolLLM) perform suspiciously poorly, suggesting potential implementation or configuration disadvantages. The claim of 22 baselines is impressive but some appear to be straw-man configurations.

The paper is very long (32+ pages) with extensive appendices that could indicate incomplete distillation of key insights versus supporting material.

Missing analysis: No systematic study of failure cases where Uno-Orchestra underperforms (GAIA, SWE-bench show marginal losses to AgentOrchestra), and limited discussion of when selective delegation might be inappropriate.

7. Overall Assessment

Uno-Orchestra makes a solid engineering and systems contribution to LLM orchestration with a clean formulation and thorough evaluation. The simultaneous accuracy improvement and cost reduction is the strongest selling point. The main risks to impact are commercial API dependence limiting reproducibility and the possibility that rapid model improvements may reduce the need for sophisticated routing. The Agentic-GRPO contribution, while incremental over prior credit assignment work, is well-adapted to the specific orchestration setting.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 6.8Clarity 6.5

Generated May 7, 2026

Comparison History (33)

Wonvs. Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

Paper 2 likely has higher impact due to a broadly applicable, methodologically unified orchestration policy that jointly optimizes decomposition, routing, and budget via learned policy (RL from real worker interactions), with strong evidence across 13 benchmarks and 22 baselines plus major cost reductions—highly timely for scalable LLM agent deployment. Paper 1 offers a valuable mixed-motive negotiation testbed and human-vs-AI behavioral findings, but its primary contribution is an environment/dataset with more domain-specific applicability and less clear generalization to diverse tasks compared to an accuracy–efficiency advance in agent routing.

gpt-5.2·May 7, 2026

Wonvs. Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

Paper 2 has higher potential impact due to a more broadly applicable contribution: a learned orchestration policy that jointly optimizes decomposition, routing, and cost across diverse benchmarks. Its demonstrated gains on 13 benchmarks versus 22 baselines, plus a large accuracy improvement and ~10× lower cost, suggest strong methodological rigor and immediate practical relevance for deploying multi-agent LLM systems. Paper 1 is novel and valuable as a mixed-motive negotiation testbed with human studies, but its impact is narrower (environment-specific) and more diagnostic than generally enabling for real-world systems.

gpt-5.2·May 7, 2026

Wonvs. TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

Uno-Orchestra addresses a broader problem—unified orchestration of multi-agent LLM systems—with joint optimization of decomposition, routing, and cost under a single objective. Its evaluation across 13 benchmarks and 22 baselines demonstrates stronger methodological rigor and wider applicability. While TrigReason offers a useful SRM/LRM collaboration framework with practical latency/cost savings, it targets a narrower problem (accelerating chain-of-thought reasoning) evaluated on only 3 benchmarks. Uno-Orchestra's ~16% accuracy gain plus order-of-magnitude cost reduction across diverse task types suggests broader impact on the rapidly growing multi-agent systems field.

claude-opus-4-6·May 7, 2026

Lostvs. Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring

Paper 2 presents a framework for fully autonomous, 24/7 deep learning experimentation, representing a significant step toward AI scientists. Its innovations in zero-cost monitoring and constant-size memory directly solve major bottlenecks in long-horizon autonomous agents. While Paper 1 offers a strong algorithmic advancement in agent routing and task decomposition, Paper 2 has a broader transformational potential across the scientific pipeline and real-world autonomous system deployment.

gemini-3-pro-preview·May 7, 2026

Lostvs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval addresses fundamental limitations in how autonomous agents are evaluated—trajectory-opaque grading, safety/robustness gaps, and narrow coverage—which are critical bottlenecks for the entire field. Its findings (e.g., 44% of safety violations missed by traditional evaluation) have broad implications for trustworthy AI deployment. While Uno-Orchestra offers strong engineering contributions to multi-agent orchestration with impressive performance gains, Claw-Eval's impact is broader: it establishes evaluation infrastructure and methodology that will influence how all future agent systems are assessed, making it more foundational for the field.

claude-opus-4-6·May 7, 2026

Lostvs. Mitigating Misalignment Contagion by Steering with Implicit Traits

Paper 2 identifies a novel and important phenomenon—misalignment contagion in multi-agent LM systems—that has significant safety implications as AI systems are increasingly deployed in multi-agent settings. It addresses a fundamental gap in alignment research (single-agent focus vs. multi-agent dynamics) and proposes a practical, parameter-free mitigation technique. Its breadth of impact spans AI safety, multi-agent systems, and policy/governance. While Paper 1 offers strong engineering contributions to agent routing with impressive benchmarks, Paper 2 opens a new research direction with broader implications for safe AI deployment.

claude-opus-4-6·May 7, 2026

Wonvs. Position: Embodied AI Requires a Privacy-Utility Trade-off

Paper 2 presents a concrete, empirically validated system (Uno-Orchestra) with strong quantitative results across 13 benchmarks against 22 baselines, demonstrating significant improvements in both accuracy (+16%) and cost efficiency (~10x reduction). This addresses a core, widely-relevant problem in LLM multi-agent orchestration with immediately applicable methods. Paper 1, while addressing an important topic (privacy in embodied AI), is a position paper with only preliminary/conceptual validation, proposing a framework (SPINE) without rigorous empirical demonstration. Paper 2's methodological rigor, breadth of evaluation, and practical applicability give it higher near-term scientific impact.

claude-opus-4-6·May 7, 2026

Wonvs. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

Paper 1 tackles a fundamental bottleneck in multi-agent LLM systems by dynamically optimizing task decomposition and resource allocation. Its broad evaluation across 13 benchmarks, demonstrating a 16% performance gain alongside a 10x reduction in inference costs, offers immense practical value and broader applicability than the specialized forecasting focus of Paper 2.

gemini-3-pro-preview·May 7, 2026

Lostvs. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Paper 2 likely has higher scientific impact due to its strong real-world applicability and timeliness: runtime interception for agent tool safety addresses immediate, high-stakes deployment risks (data loss/exfiltration) across many domains. It contributes a concrete system with low latency, multiple complementary mechanisms (normalization, multi-step chain detection, safer-action suggestions), and sizable benchmarks, enabling broader adoption and follow-on research. Paper 1 is novel and rigorous for efficiency/accuracy in LLM orchestration, but its impact is more incremental within agent routing, whereas safety layers can influence standards, infrastructure, and cross-field practices.

gpt-5.2·May 7, 2026

Wonvs. Curated AI beats frontier LLMs at pharma asset discovery

Paper 2 introduces a novel, generalizable framework (Uno-Orchestra) for multi-agent LLM orchestration that jointly optimizes decomposition, routing, and cost—a fundamental challenge in AI systems. It demonstrates broad impact across 13 benchmarks and 22 baselines spanning diverse tasks, with significant improvements in both accuracy (+16%) and cost (~10x reduction). Paper 1, while practically useful for pharma, is narrower in scope—essentially a domain-specific benchmark showing curated databases outperform web search for niche queries, which is somewhat expected. Paper 2's methodological contribution to agent orchestration has broader applicability across fields.

claude-opus-4-6·May 7, 2026

#767of 3489·Artificial Intelligence

#767 of 3489 · Artificial Intelligence

Tournament Score

1461±36

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty6.8

Clarity6.5