Zhiqing Cui, Haotong Xie, Jiahao Yuan, Cheng Yang, Hanqing Wang, Yuxin Wu, Yifan Wu, Siru Zhong
Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.
Uno-Orchestra addresses a genuine gap in LLM multi-agent orchestration: the disconnect between task decomposition and model routing. Prior work either routes queries to single experts (flat routing) or decomposes tasks via hand-engineered planners that don't jointly optimize which worker handles each subtask. The key novelty is collapsing both decisions—whether and how to decompose and which (model, primitive) pair to assign—into a single causal language model's forward pass. The factorization in Equation 4 is elegant: decomposition and routing share the same backbone with no auxiliary heads, separated only by causal masking.
The "selective delegation" framing is the paper's most distinctive conceptual contribution. The system can collapse to a zero-cost direct answer for simple queries or expand into multi-step orchestration for complex ones, all under a unified objective. This parsimony is structurally enforced rather than heuristically tuned.
Training pipeline: The two-stage approach (SFT on verifier-gated teacher trajectories → Agentic-GRPO refinement) is well-motivated. The verifier-gated curriculum construction (Equation 3) is sound—discarding tasks the cold-start router already solves and splitting remainder by teacher success creates clean learning signal partitions.
Agentic-GRPO: The turn-level credit assignment (Equation 6) addresses a real limitation of vanilla GRPO for multi-turn agents. The bounded process shaping (S(τ) ≤ 0.10) preventing the policy from harvesting "cheap-and-wrong" cost bonuses is a thoughtful design choice.
Blind worker protocol: Anonymizing worker identities during RL (WORKER 1...K) and demonstrating that removing this increases cost from 0.82 with <1 point accuracy gain is a compelling ablation showing the router learns capability profiles through interaction rather than brand shortcuts.
Concerns:
The evaluation is comprehensive: 13 benchmarks across 5 capability domains, 22 baselines spanning 5 families, and multiple ablation dimensions (training stages, worker pools, router backbone sizes, domain shift). The 77.0% macro pass@1 with ~16% improvement over AgentOrchestra at ~10× lower cost is a strong result if reproducible.
The stage ablation (Table 3) showing monotonic improvement across all benchmarks from Uno-base through Uno-Orchestra is convincing evidence that each component contributes. The weak-worker-pool ablation (Table 4) demonstrating gains even without frontier commercial models strengthens the claim that the routing policy itself, not just access to better workers, drives improvements.
The generalization analysis across in-domain, near-domain, and out-of-domain regimes (Figure 4) showing increasing margins in harder transfer settings is notable and suggests genuine policy learning rather than benchmark memorization.
This paper arrives at a critical moment. The proliferation of LLMs at different capability-cost points creates an urgent need for intelligent orchestration. The cost awareness is particularly timely—deploying frontier models for every query is economically unsustainable at scale. The formalization of the "selective delegation" problem and the demonstration that a 7B router can effectively coordinate heterogeneous workers addresses a practical deployment bottleneck.
The work also responds to the emerging consensus that capability emerges from coordination rather than individual model scale, positioning it well within current research trajectories on compound AI systems.
Uno-Orchestra makes a solid engineering and systems contribution to LLM orchestration with a clean formulation and thorough evaluation. The simultaneous accuracy improvement and cost reduction is the strongest selling point. The main risks to impact are commercial API dependence limiting reproducibility and the possibility that rapid model improvements may reduce the need for sophisticated routing. The Agentic-GRPO contribution, while incremental over prior credit assignment work, is well-adapted to the specific orchestration setting.
Generated May 7, 2026
Paper 2 likely has higher impact due to a broadly applicable, methodologically unified orchestration policy that jointly optimizes decomposition, routing, and budget via learned policy (RL from real worker interactions), with strong evidence across 13 benchmarks and 22 baselines plus major cost reductions—highly timely for scalable LLM agent deployment. Paper 1 offers a valuable mixed-motive negotiation testbed and human-vs-AI behavioral findings, but its primary contribution is an environment/dataset with more domain-specific applicability and less clear generalization to diverse tasks compared to an accuracy–efficiency advance in agent routing.
Paper 2 has higher potential impact due to a more broadly applicable contribution: a learned orchestration policy that jointly optimizes decomposition, routing, and cost across diverse benchmarks. Its demonstrated gains on 13 benchmarks versus 22 baselines, plus a large accuracy improvement and ~10× lower cost, suggest strong methodological rigor and immediate practical relevance for deploying multi-agent LLM systems. Paper 1 is novel and valuable as a mixed-motive negotiation testbed with human studies, but its impact is narrower (environment-specific) and more diagnostic than generally enabling for real-world systems.
Uno-Orchestra addresses a broader problem—unified orchestration of multi-agent LLM systems—with joint optimization of decomposition, routing, and cost under a single objective. Its evaluation across 13 benchmarks and 22 baselines demonstrates stronger methodological rigor and wider applicability. While TrigReason offers a useful SRM/LRM collaboration framework with practical latency/cost savings, it targets a narrower problem (accelerating chain-of-thought reasoning) evaluated on only 3 benchmarks. Uno-Orchestra's ~16% accuracy gain plus order-of-magnitude cost reduction across diverse task types suggests broader impact on the rapidly growing multi-agent systems field.
Paper 2 presents a framework for fully autonomous, 24/7 deep learning experimentation, representing a significant step toward AI scientists. Its innovations in zero-cost monitoring and constant-size memory directly solve major bottlenecks in long-horizon autonomous agents. While Paper 1 offers a strong algorithmic advancement in agent routing and task decomposition, Paper 2 has a broader transformational potential across the scientific pipeline and real-world autonomous system deployment.
Claw-Eval addresses fundamental limitations in how autonomous agents are evaluated—trajectory-opaque grading, safety/robustness gaps, and narrow coverage—which are critical bottlenecks for the entire field. Its findings (e.g., 44% of safety violations missed by traditional evaluation) have broad implications for trustworthy AI deployment. While Uno-Orchestra offers strong engineering contributions to multi-agent orchestration with impressive performance gains, Claw-Eval's impact is broader: it establishes evaluation infrastructure and methodology that will influence how all future agent systems are assessed, making it more foundational for the field.
Paper 2 identifies a novel and important phenomenon—misalignment contagion in multi-agent LM systems—that has significant safety implications as AI systems are increasingly deployed in multi-agent settings. It addresses a fundamental gap in alignment research (single-agent focus vs. multi-agent dynamics) and proposes a practical, parameter-free mitigation technique. Its breadth of impact spans AI safety, multi-agent systems, and policy/governance. While Paper 1 offers strong engineering contributions to agent routing with impressive benchmarks, Paper 2 opens a new research direction with broader implications for safe AI deployment.
Paper 2 presents a concrete, empirically validated system (Uno-Orchestra) with strong quantitative results across 13 benchmarks against 22 baselines, demonstrating significant improvements in both accuracy (+16%) and cost efficiency (~10x reduction). This addresses a core, widely-relevant problem in LLM multi-agent orchestration with immediately applicable methods. Paper 1, while addressing an important topic (privacy in embodied AI), is a position paper with only preliminary/conceptual validation, proposing a framework (SPINE) without rigorous empirical demonstration. Paper 2's methodological rigor, breadth of evaluation, and practical applicability give it higher near-term scientific impact.
Paper 1 tackles a fundamental bottleneck in multi-agent LLM systems by dynamically optimizing task decomposition and resource allocation. Its broad evaluation across 13 benchmarks, demonstrating a 16% performance gain alongside a 10x reduction in inference costs, offers immense practical value and broader applicability than the specialized forecasting focus of Paper 2.
Paper 2 likely has higher scientific impact due to its strong real-world applicability and timeliness: runtime interception for agent tool safety addresses immediate, high-stakes deployment risks (data loss/exfiltration) across many domains. It contributes a concrete system with low latency, multiple complementary mechanisms (normalization, multi-step chain detection, safer-action suggestions), and sizable benchmarks, enabling broader adoption and follow-on research. Paper 1 is novel and rigorous for efficiency/accuracy in LLM orchestration, but its impact is more incremental within agent routing, whereas safety layers can influence standards, infrastructure, and cross-field practices.
Paper 2 introduces a novel, generalizable framework (Uno-Orchestra) for multi-agent LLM orchestration that jointly optimizes decomposition, routing, and cost—a fundamental challenge in AI systems. It demonstrates broad impact across 13 benchmarks and 22 baselines spanning diverse tasks, with significant improvements in both accuracy (+16%) and cost (~10x reduction). Paper 1, while practically useful for pharma, is narrower in scope—essentially a domain-specific benchmark showing curated databases outperform web search for niche queries, which is somewhat expected. Paper 2's methodological contribution to agent orchestration has broader applicability across fields.