Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan
Abstract
We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-, an asynchronous decentralized -learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC- that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural -learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC- matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper addresses a genuinely important gap at the intersection of multi-agent reinforcement learning and LLM pipeline orchestration. The core contribution is threefold: (a) the IC-SMDP formalism, which captures the operating regime of multi-agent systems where agents hand off control through a shared artifact without centralized trajectory access; (b) IC-Q, an asynchronous decentralized Q-learning algorithm requiring only a single scalar to be communicated at each handoff; and (c) a finite-sample convergence bound that decomposes into three interpretable, independently controllable error terms.
The problem formulation is well-motivated by real-world deployment scenarios: multi-vendor LLM pipelines where privacy, API contracts, or organizational boundaries prevent sharing of internal states or joint trajectories. The paper clearly distinguishes its sequential handoff regime from concurrent multi-agent settings (Dec-POMDPs) and centralized training paradigms (CTDE), positioning itself in a genuinely underserved niche.
2. Methodological Rigor
The theoretical framework is carefully constructed. The authors systematically position the IC-SMDP relative to existing formalisms (single-agent MDPs, Dec-POMDPs, options framework, AIS framework), clearly articulating why each is insufficient.
The finite-sample bound (Theorem 1) represents genuine theoretical novelty. Three technical challenges are convincingly identified: (i) establishing Bellman contraction under random discounting γ^{τ_{k+1}} rather than fixed γ, (ii) propagating AIS gap bounds through the SMDP Bellman operator at the option scale, and (iii) controlling Markovian noise under random option durations. The claim that this is "the first finite-sample guarantee for neural Q-learning under decentralized partial observability" appears substantiated by the careful comparison with prior work.
The assumptions (A1-A6) are standard individually but their combination is non-trivial. Assumption A6 (SMDP-level Bellman contraction) is the genuinely new condition, and the authors appropriately discuss why it is necessary. The AIS conditions (Assumptions 1-2) are the substantive modeling restrictions, and the paper is transparent about their strength.
However, the theoretical analysis covers only the pre-configured regime where local-action policies are fixed. The adaptable regime (where agents also learn their local policies) is treated only empirically, creating a gap between theory and the most interesting practical scenarios.
3. Potential Impact
Practical relevance: The framework directly addresses the emerging architecture of multi-agent LLM systems spanning trust boundaries. As agent-to-agent protocols (MCP, A2A) standardize, the IC-SMDP provides a principled foundation for what is currently done through heuristics. The minimal communication requirement (one scalar per handoff) is practically appealing for bandwidth-constrained or privacy-sensitive deployments.
Theoretical impact: The lifting of the AIS framework from single-agent primitive-step MDPs to multi-agent SMDPs is a meaningful theoretical contribution that could enable further work in decentralized temporal abstraction. The three-term error decomposition provides actionable design guidance: system architects can independently control representation fidelity (ε_φ, δ_φ), network capacity, and sample budget.
Broader influence: The work connects several previously separate research threads — options/SMDPs, approximate information states, decentralized RL, and LLM orchestration — into a unified framework. This synthesis could catalyze work at these intersections.
4. Timeliness & Relevance
The paper is exceptionally timely. Multi-agent LLM pipelines are proliferating in production (MetaGPT, AutoGen, AgentCoder), and the emergence of standardized agent-to-agent protocols (Anthropic's MCP, Google's A2A) creates immediate demand for principled workflow optimization. The paper fills a clear theoretical vacuum: existing orchestration systems either use hand-designed workflows or require centralized optimization that violates real-world deployment constraints. The gap between theoretical foundations and practical deployment in this area is wide, and this paper makes a substantive contribution toward closing it.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
This is a strong paper that provides a needed theoretical foundation for an increasingly important practical problem. The formalism is clean, the algorithm is minimal and elegant, and the theory is technically substantial. The main limitation — the gap between the pre-configured theory and adaptable practice — is honestly acknowledged. The work should influence both the theoretical study of decentralized temporal-abstraction RL and the practical design of multi-agent LLM pipelines.
Generated May 20, 2026
Comparison History (22)
Paper 1 offers a profound theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing multi-agent LLM pipelines across trust/vendor boundaries as IC-SMDPs, it addresses a highly relevant, real-world bottleneck in compound AI systems. While Paper 2 tackles the important trend of inference-time scaling, Paper 1's combination of rigorous mathematical bounds, novel extension of the approximate information state framework, and diverse empirical validation (from synthetic to multi-LLM reasoning) suggests a deeper and more foundational scientific impact on multi-agent reinforcement learning.
Paper 1 (Seirênes) has higher estimated impact due to its broader applicability and timeliness. It addresses a critical and widely recognized problem—LLM reasoning fragility—with a novel self-play framework that scales across model sizes and demonstrates practical impact even against top-tier commercial models. The approach is intuitive, immediately actionable, and relevant to the massive community working on LLM reasoning. Paper 2 is theoretically rigorous with novel convergence guarantees for decentralized multi-agent settings, but its impact is narrower, targeting a more specialized audience. Seirênes' combination of strong empirical gains and broad relevance gives it higher potential impact.
Paper 2 addresses a highly timely and impactful problem: decentralized learning in multi-agent LLM pipelines. It introduces a novel formalization (IC-SMDP) and provides the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its strong methodological rigor, combined with immediate real-world applications in distributed AI systems, gives it a broader and more significant potential scientific impact compared to the narrower focus on voting theory in Paper 1.
Paper 2 provides a foundational theoretical breakthrough with the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its framework for multi-agent LLM pipelines addresses a highly relevant problem with broad applicability across domains. In contrast, Paper 1 offers an innovative but narrower application specifically targeting CAD generation. The rigorous methodological contribution and broader potential impact across reinforcement learning and multi-agent systems make Paper 2 more scientifically impactful.
Paper 2 has higher impact potential due to a clear methodological contribution (new IC-SMDP formalism and an asynchronous decentralized neural Q-learning algorithm) paired with a novel finite-sample convergence bound under decentralized partial observability—likely broadly reusable across multi-agent RL, distributed systems, and LLM pipeline orchestration. It is timely for cross-org/vendor agent workflows and offers principled design guidance via decomposed error sources, supported by experiments that validate the theory. Paper 1 is timely and useful as an evaluation study of auto-research, but is more diagnostic/benchmarking-oriented with narrower methodological novelty and less generalizable theoretical output.
Paper 1 presents a novel theoretical framework (IC-SMDP) with the first finite-sample guarantee for neural Q-learning under decentralized partial observability, directly addressing a critical challenge in multi-agent LLM pipelines. Its contributions span reinforcement learning theory, multi-agent systems, and practical LLM orchestration—a highly timely topic. The rigorous mathematical foundations combined with practical demonstrations across diverse domains suggest broad and lasting impact. Paper 2, while a valuable multimodal dataset contribution, serves a narrower community (affective computing) and represents incremental progress in data collection rather than fundamental methodological innovation.
Paper 1 has higher potential impact due to a more fundamental, broadly applicable theoretical contribution: a new IC-SMDP formalization of decentralized handoff-based workflows and the first finite-sample guarantee for neural Q-learning under decentralized partial observability, with a decomposable error bound and methodological novelty (AIS lifted to multi-agent SMDPs). This could influence multi-agent RL, distributed learning, and multi-LLM pipeline design across trust boundaries. Paper 2 is timely and useful (mechanistic interpretability + intervention for MLLM hallucinations), but is narrower in scope and more benchmark/architecture-dependent.
Paper 1 presents a novel theoretical framework (IC-SMDP) with rigorous mathematical guarantees (finite-sample bounds for neural Q-learning under decentralized partial observability), which is a first-of-its-kind result. It combines theoretical novelty with empirical validation across multiple domains including LLM pipelines. Paper 2 is a vision/position paper proposing a conceptual framework for trustworthy agent networks without formal guarantees or empirical results. While both address multi-agent LLM coordination, Paper 1's methodological rigor, provable convergence guarantees, and experimental validation give it substantially higher scientific impact potential.
Paper 1 offers a fundamental theoretical breakthrough by providing the first finite-sample guarantees for neural Q-learning under decentralized partial observability. Its mathematically rigorous approach addresses the highly relevant problem of multi-agent LLM coordination across trust boundaries. In contrast, Paper 2 presents a practical but narrower engineering optimization for token reduction in GUI agents. Paper 1's combination of strong methodological rigor, broad applicability to multi-agent workflows, and significant algorithmic innovation gives it higher potential for widespread scientific impact.
Paper 1 offers higher impact: it introduces a new formalism (IC-SMDP) for decentralized, interface-constrained multi-agent workflows and provides a first-of-its-kind finite-sample convergence bound for neural Q-learning under decentralized partial observability, with a clean error decomposition and supporting experiments. This is methodologically rigorous, broadly relevant to multi-agent RL and multi-LLM pipelines across trust boundaries, and timely for real-world orchestrated AI systems. Paper 2 is valuable and timely but is primarily a reanalysis/negative result with narrower scope and less methodological novelty.
Paper 2 offers a significant theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing a novel framework (IC-SMDP) and bridging deep reinforcement learning theory with modern multi-agent LLM pipelines, it demonstrates exceptional methodological rigor. While Paper 1 presents a practical and effective algorithm for prompt optimization, Paper 2's foundational theoretical contributions are likely to have a broader, more enduring impact across reinforcement learning, multi-agent systems, and AI safety/coordination.
Paper 1 introduces NRI, a foundation model for zero-shot logical rule induction, which represents a paradigm shift in ILP by enabling transfer across tasks without retraining. The concept of foundation models for symbolic reasoning is highly novel and timely, bridging neural and symbolic AI. While Paper 2 makes strong theoretical contributions (first finite-sample guarantee for neural Q-learning under decentralized partial observability), its impact is narrower, targeting a specific multi-agent workflow setting. Paper 1's broader applicability, connection to the foundation model paradigm, and potential to transform symbolic reasoning give it higher impact potential.
Paper 2 makes a stronger theoretical contribution by formalizing interface-constrained semi-MDPs and providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. This foundational result—lifting AIS to multi-agent SMDPs with provable convergence bounds—has broader impact across multi-agent systems, decentralized learning, and LLM pipelines spanning trust boundaries. While Paper 1 presents a practical and effective caching system (PEEK) with solid empirical gains, its contribution is more engineering-focused. Paper 2's novel theoretical framework with clean decomposable error bounds opens new research directions in decentralized multi-agent learning.
Paper 2 offers a novel theoretical framework and provably convergent algorithm for multi-agent LLM workflows, a highly active and practically relevant area. It provides the first finite-sample guarantee for neural Q-learning under decentralized partial observability, backed by rigorous math and diverse empirical results. Paper 1, while providing valuable conceptual clarifications regarding Transformer Turing-completeness, is a position paper whose impact is primarily theoretical, lacking the broad algorithmic applicability and empirical validation of Paper 2.
Paper 1 has higher potential impact due to a stronger theoretical contribution: a first finite-sample guarantee for neural Q-learning under decentralized partial observability in an interface-constrained multi-agent SMDP, plus new AIS extensions and Markovian-noise control under random durations. This is methodologically rigorous and broadly relevant to multi-agent RL, decentralized control, and cross-boundary LLM-agent pipelines. Paper 2 is timely and practical for prompt optimization, but is more application/engineering-focused with weaker general theoretical novelty and potentially narrower scientific spillover.
Paper 2 presents a fundamental theoretical breakthrough in multi-agent reinforcement learning, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its formalization of interface-constrained multi-agent LLM pipelines addresses a highly timely problem with rigorous mathematical foundations and broad empirical validation. While Paper 1 offers valuable practical applications of conformal prediction, Paper 2's deep methodological innovation and potential to shape the foundational design of decentralized multi-agent systems give it a higher potential for lasting scientific impact.
Paper 2 likely has higher impact: it introduces a new formalism (IC-SMDP) matching an emerging, high-demand regime (decentralized multi-agent/LLM workflows across trust boundaries) and provides rare finite-sample convergence guarantees for neural Q-learning under decentralized partial observability, extending AIS to multi-agent SMDPs. The methodological contribution is substantial (new proof techniques, error decomposition) and broadly relevant to RL theory, multi-agent systems, and practical LLM orchestration. Paper 1 is timely and useful for PoS governance analysis, but is narrower in scope and more incremental relative to existing social-choice power-index work.
Paper 2 addresses the critical and highly timely issue of safety in Multimodal LLMs. By providing a geometric understanding of the multimodal safety gap and introducing a training-free, inference-time correction method (ReGap), it offers immediate, practical real-world applications for deploying safe AI systems. While Paper 1 provides strong methodological rigor and novel theoretical bounds for multi-agent RL, Paper 2's focus on AI safety aligns with one of the most pressing challenges in the broader AI community today, likely leading to wider and faster scientific impact.
Paper 1 makes fundamental theoretical contributions—formalizing interface-constrained SMDPs, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability, and lifting the AIS framework to multi-agent SMDPs. These are novel, rigorous results with broad applicability beyond LLM pipelines to any multi-agent sequential decision-making setting. Paper 2 is a strong engineering/systems contribution with real-world deployment at Uber, but it is more narrowly focused on enterprise AI security and introduces less foundational methodology. Paper 1's theoretical novelty and breadth of impact across reinforcement learning, multi-agent systems, and workflow optimization give it higher long-term scientific impact.
Paper 2 offers foundational theoretical contributions, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its formalization of multi-agent LLM pipelines as IC-SMDPs addresses a highly timely problem with rigorous methodology and broad applicability across reinforcement learning and AI systems. Paper 1 is valuable for applied fact-checking, but Paper 2's combination of theoretical rigor and diverse empirical validation suggests a broader and deeper scientific impact.