Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

May 18, 2026

arXiv:2605.19140v1 PDF

cs.AI(primary)

#160of 2292·Artificial Intelligence

#160 of 2292 · Artificial Intelligence

Tournament Score

1528±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7.5

Novelty8

Clarity8.5

Tournament Score

1528±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC- $Q$ , an asynchronous decentralized $Q$ -learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC- $Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$ -learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC- $Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper addresses a genuinely important gap at the intersection of multi-agent reinforcement learning and LLM pipeline orchestration. The core contribution is threefold: (a) the IC-SMDP formalism, which captures the operating regime of multi-agent systems where agents hand off control through a shared artifact without centralized trajectory access; (b) IC-Q, an asynchronous decentralized Q-learning algorithm requiring only a single scalar to be communicated at each handoff; and (c) a finite-sample convergence bound that decomposes into three interpretable, independently controllable error terms.

The problem formulation is well-motivated by real-world deployment scenarios: multi-vendor LLM pipelines where privacy, API contracts, or organizational boundaries prevent sharing of internal states or joint trajectories. The paper clearly distinguishes its sequential handoff regime from concurrent multi-agent settings (Dec-POMDPs) and centralized training paradigms (CTDE), positioning itself in a genuinely underserved niche.

2. Methodological Rigor

The theoretical framework is carefully constructed. The authors systematically position the IC-SMDP relative to existing formalisms (single-agent MDPs, Dec-POMDPs, options framework, AIS framework), clearly articulating why each is insufficient.

The finite-sample bound (Theorem 1) represents genuine theoretical novelty. Three technical challenges are convincingly identified: (i) establishing Bellman contraction under random discounting γ^{τ_{k+1}} rather than fixed γ, (ii) propagating AIS gap bounds through the SMDP Bellman operator at the option scale, and (iii) controlling Markovian noise under random option durations. The claim that this is "the first finite-sample guarantee for neural Q-learning under decentralized partial observability" appears substantiated by the careful comparison with prior work.

The assumptions (A1-A6) are standard individually but their combination is non-trivial. Assumption A6 (SMDP-level Bellman contraction) is the genuinely new condition, and the authors appropriately discuss why it is necessary. The AIS conditions (Assumptions 1-2) are the substantive modeling restrictions, and the paper is transparent about their strength.

However, the theoretical analysis covers only the pre-configured regime where local-action policies are fixed. The adaptable regime (where agents also learn their local policies) is treated only empirically, creating a gap between theory and the most interesting practical scenarios.

3. Potential Impact

Practical relevance: The framework directly addresses the emerging architecture of multi-agent LLM systems spanning trust boundaries. As agent-to-agent protocols (MCP, A2A) standardize, the IC-SMDP provides a principled foundation for what is currently done through heuristics. The minimal communication requirement (one scalar per handoff) is practically appealing for bandwidth-constrained or privacy-sensitive deployments.

Theoretical impact: The lifting of the AIS framework from single-agent primitive-step MDPs to multi-agent SMDPs is a meaningful theoretical contribution that could enable further work in decentralized temporal abstraction. The three-term error decomposition provides actionable design guidance: system architects can independently control representation fidelity (ε_φ, δ_φ), network capacity, and sample budget.

Broader influence: The work connects several previously separate research threads — options/SMDPs, approximate information states, decentralized RL, and LLM orchestration — into a unified framework. This synthesis could catalyze work at these intersections.

4. Timeliness & Relevance

The paper is exceptionally timely. Multi-agent LLM pipelines are proliferating in production (MetaGPT, AutoGen, AgentCoder), and the emergence of standardized agent-to-agent protocols (Anthropic's MCP, Google's A2A) creates immediate demand for principled workflow optimization. The paper fills a clear theoretical vacuum: existing orchestration systems either use hand-designed workflows or require centralized optimization that violates real-world deployment constraints. The gap between theoretical foundations and practical deployment in this area is wide, and this paper makes a substantive contribution toward closing it.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation: The four operating conditions (sequential handoff, decentralized training/execution, interface-limited observation, finite-sample guarantees) precisely delineate the contribution space.

Minimal communication: The single-scalar handoff design is elegant and practically meaningful.

Interpretable bound: The three-term decomposition with independently controllable sources provides genuine design guidance rather than an opaque convergence rate.

Comprehensive experiments: The synthetic validation that isolates each error term term-by-term is particularly convincing methodologically, complementing the applied experiments.

Strong empirical results: IC-Q matching centralized oracles on multi-LLM math reasoning is a compelling demonstration.

Notable Limitations:

Theory-practice gap: The formal guarantee covers only the pre-configured regime. The adaptable case (where both routing and local actions are learned) is arguably the more interesting setting and lacks theoretical coverage.

AIS conditions may be hard to verify: Assumptions (B1)-(B2) require bounding reward and transition sufficiency at the option scale, which may be difficult to check a priori in practice, particularly for LLM agents with complex internal dynamics.

Scalability questions: The experiments use relatively small agent populations (4-10 agents for the main results, 100 for routing). Scalability to truly large agent ecosystems remains untested.

LLM experiment scope: The multi-LLM experiments use a single model (GPT-4o-mini) for all agents with fixed prompts. Cross-vendor scenarios with heterogeneous models would more convincingly demonstrate the framework's raison d'être.

The 4-bit AIS observation: While effective, the hand-designed nature of φ_i raises questions about how to design good AIS maps in general; automated AIS discovery is not addressed.

Self-citation density: A notable number of citations are to the authors' own or closely related work, particularly in the positioning of the contribution relative to transfer RL.

Overall Assessment

This is a strong paper that provides a needed theoretical foundation for an increasingly important practical problem. The formalism is clean, the algorithm is minimal and elegant, and the theory is technically substantial. The main limitation — the gap between the pre-configured theory and adaptable practice — is honestly acknowledged. The work should influence both the theoretical study of decentralized temporal-abstraction RL and the practical design of multi-agent LLM pipelines.

Rating:7.8/ 10

Significance 8Rigor 7.5Novelty 8Clarity 8.5

Generated May 20, 2026

Comparison History (22)

vs. Generative Recursive Reasoning

gemini-3.15/20/2026

Paper 1 offers a profound theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing multi-agent LLM pipelines across trust/vendor boundaries as IC-SMDPs, it addresses a highly relevant, real-world bottleneck in compound AI systems. While Paper 2 tackles the important trend of inference-time scaling, Paper 1's combination of rigorous mathematical bounds, novel extension of the approximate information state framework, and diverse empirical validation (from synthetic to multi-LLM reasoning) suggests a deeper and more foundational scientific impact on multi-agent reinforcement learning.

vs. Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

claude-opus-4.65/20/2026

Paper 1 (Seirênes) has higher estimated impact due to its broader applicability and timeliness. It addresses a critical and widely recognized problem—LLM reasoning fragility—with a novel self-play framework that scales across model sizes and demonstrates practical impact even against top-tier commercial models. The approach is intuitive, immediately actionable, and relevant to the massive community working on LLM reasoning. Paper 2 is theoretically rigorous with novel convergence guarantees for decentralized multi-agent settings, but its impact is narrower, targeting a more specialized audience. Seirênes' combination of strong empirical gains and broad relevance gives it higher potential impact.

vs. Efficient Elicitation of Collective Disagreements

gemini-3.15/20/2026

Paper 2 addresses a highly timely and impactful problem: decentralized learning in multi-agent LLM pipelines. It introduces a novel formalization (IC-SMDP) and provides the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its strong methodological rigor, combined with immediate real-world applications in distributed AI systems, gives it a broader and more significant potential scientific impact compared to the narrower focus on voting theory in Paper 1.

vs. Memory-Augmented Reinforcement Learning Agent for CAD Generation

gemini-3.15/20/2026

Paper 2 provides a foundational theoretical breakthrough with the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its framework for multi-agent LLM pipelines addresses a highly relevant problem with broad applicability across domains. In contrast, Paper 1 offers an innovative but narrower application specifically targeting CAD generation. The rigorous methodological contribution and broader potential impact across reinforcement learning and multi-agent systems make Paper 2 more scientifically impactful.

vs. How Far Are We From True Auto-Research?

gpt-5.25/20/2026

Paper 2 has higher impact potential due to a clear methodological contribution (new IC-SMDP formalism and an asynchronous decentralized neural Q-learning algorithm) paired with a novel finite-sample convergence bound under decentralized partial observability—likely broadly reusable across multi-agent RL, distributed systems, and LLM pipeline orchestration. It is timely for cross-org/vendor agent workflows and offers principled design guidance via decomposed error sources, supported by experiments that validate the theory. Paper 1 is timely and useful as an evaluation study of auto-research, but is more diagnostic/benchmarking-oriented with narrower methodological novelty and less generalizable theoretical output.

vs. GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

claude-opus-4.65/20/2026

Paper 1 presents a novel theoretical framework (IC-SMDP) with the first finite-sample guarantee for neural Q-learning under decentralized partial observability, directly addressing a critical challenge in multi-agent LLM pipelines. Its contributions span reinforcement learning theory, multi-agent systems, and practical LLM orchestration—a highly timely topic. The rigorous mathematical foundations combined with practical demonstrations across diverse domains suggest broad and lasting impact. Paper 2, while a valuable multimodal dataset contribution, serves a narrower community (affective computing) and represents incremental progress in data collection rather than fundamental methodological innovation.

vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

gpt-5.25/20/2026

Paper 1 has higher potential impact due to a more fundamental, broadly applicable theoretical contribution: a new IC-SMDP formalization of decentralized handoff-based workflows and the first finite-sample guarantee for neural Q-learning under decentralized partial observability, with a decomposable error bound and methodological novelty (AIS lifted to multi-agent SMDPs). This could influence multi-agent RL, distributed learning, and multi-LLM pipeline design across trust boundaries. Paper 2 is timely and useful (mechanistic interpretability + intervention for MLLM hallucinations), but is narrower in scope and more benchmark/architecture-dependent.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

claude-opus-4.65/20/2026

Paper 1 presents a novel theoretical framework (IC-SMDP) with rigorous mathematical guarantees (finite-sample bounds for neural Q-learning under decentralized partial observability), which is a first-of-its-kind result. It combines theoretical novelty with empirical validation across multiple domains including LLM pipelines. Paper 2 is a vision/position paper proposing a conceptual framework for trustworthy agent networks without formal guarantees or empirical results. While both address multi-agent LLM coordination, Paper 1's methodological rigor, provable convergence guarantees, and experimental validation give it substantially higher scientific impact potential.

vs. AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

gemini-3.15/20/2026

Paper 1 offers a fundamental theoretical breakthrough by providing the first finite-sample guarantees for neural Q-learning under decentralized partial observability. Its mathematically rigorous approach addresses the highly relevant problem of multi-agent LLM coordination across trust boundaries. In contrast, Paper 2 presents a practical but narrower engineering optimization for token reduction in GUI agents. Paper 1's combination of strong methodological rigor, broad applicability to multi-agent workflows, and significant algorithmic innovation gives it higher potential for widespread scientific impact.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

gpt-5.25/20/2026

Paper 1 offers higher impact: it introduces a new formalism (IC-SMDP) for decentralized, interface-constrained multi-agent workflows and provides a first-of-its-kind finite-sample convergence bound for neural Q-learning under decentralized partial observability, with a clean error decomposition and supporting experiments. This is methodologically rigorous, broadly relevant to multi-agent RL and multi-LLM pipelines across trust boundaries, and timely for real-world orchestrated AI systems. Paper 2 is valuable and timely but is primarily a reanalysis/negative result with narrower scope and less methodological novelty.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gemini-3.15/20/2026

Paper 2 offers a significant theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing a novel framework (IC-SMDP) and bridging deep reinforcement learning theory with modern multi-agent LLM pipelines, it demonstrates exceptional methodological rigor. While Paper 1 presents a practical and effective algorithm for prompt optimization, Paper 2's foundational theoretical contributions are likely to have a broader, more enduring impact across reinforcement learning, multi-agent systems, and AI safety/coordination.

vs. A Foundation Model for Zero-Shot Logical Rule Induction

claude-opus-4.65/20/2026

Paper 1 introduces NRI, a foundation model for zero-shot logical rule induction, which represents a paradigm shift in ILP by enabling transfer across tasks without retraining. The concept of foundation models for symbolic reasoning is highly novel and timely, bridging neural and symbolic AI. While Paper 2 makes strong theoretical contributions (first finite-sample guarantee for neural Q-learning under decentralized partial observability), its impact is narrower, targeting a specific multi-agent workflow setting. Paper 1's broader applicability, connection to the foundation model paradigm, and potential to transform symbolic reasoning give it higher impact potential.

vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

claude-opus-4.65/20/2026

Paper 2 makes a stronger theoretical contribution by formalizing interface-constrained semi-MDPs and providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. This foundational result—lifting AIS to multi-agent SMDPs with provable convergence bounds—has broader impact across multi-agent systems, decentralized learning, and LLM pipelines spanning trust boundaries. While Paper 1 presents a practical and effective caching system (PEEK) with solid empirical gains, its contribution is more engineering-focused. Paper 2's novel theoretical framework with clean decomposable error bounds opens new research directions in decentralized multi-agent learning.

vs. Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management

gemini-3.15/20/2026

Paper 2 offers a novel theoretical framework and provably convergent algorithm for multi-agent LLM workflows, a highly active and practically relevant area. It provides the first finite-sample guarantee for neural Q-learning under decentralized partial observability, backed by rigorous math and diverse empirical results. Paper 1, while providing valuable conceptual clarifications regarding Transformer Turing-completeness, is a position paper whose impact is primarily theoretical, lacking the broad algorithmic applicability and empirical validation of Paper 2.

vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

gpt-5.25/20/2026

Paper 1 has higher potential impact due to a stronger theoretical contribution: a first finite-sample guarantee for neural Q-learning under decentralized partial observability in an interface-constrained multi-agent SMDP, plus new AIS extensions and Markovian-noise control under random durations. This is methodologically rigorous and broadly relevant to multi-agent RL, decentralized control, and cross-boundary LLM-agent pipelines. Paper 2 is timely and practical for prompt optimization, but is more application/engineering-focused with weaker general theoretical novelty and potentially narrower scientific spillover.

vs. Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

gemini-3.15/20/2026

Paper 2 presents a fundamental theoretical breakthrough in multi-agent reinforcement learning, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its formalization of interface-constrained multi-agent LLM pipelines addresses a highly timely problem with rigorous mathematical foundations and broad empirical validation. While Paper 1 offers valuable practical applications of conformal prediction, Paper 2's deep methodological innovation and potential to shape the foundational design of decentralized multi-agent systems give it a higher potential for lasting scientific impact.

vs. Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

gpt-5.25/20/2026

Paper 2 likely has higher impact: it introduces a new formalism (IC-SMDP) matching an emerging, high-demand regime (decentralized multi-agent/LLM workflows across trust boundaries) and provides rare finite-sample convergence guarantees for neural Q-learning under decentralized partial observability, extending AIS to multi-agent SMDPs. The methodological contribution is substantial (new proof techniques, error decomposition) and broadly relevant to RL theory, multi-agent systems, and practical LLM orchestration. Paper 1 is timely and useful for PoS governance analysis, but is narrower in scope and more incremental relative to existing social-choice power-index work.

vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

gemini-3.15/20/2026

Paper 2 addresses the critical and highly timely issue of safety in Multimodal LLMs. By providing a geometric understanding of the multimodal safety gap and introducing a training-free, inference-time correction method (ReGap), it offers immediate, practical real-world applications for deploying safe AI systems. While Paper 1 provides strong methodological rigor and novel theoretical bounds for multi-agent RL, Paper 2's focus on AI safety aligns with one of the most pressing challenges in the broader AI community today, likely leading to wider and faster scientific impact.

vs. ADR: An Agentic Detection System for Enterprise Agentic AI Security

claude-opus-4.65/20/2026

Paper 1 makes fundamental theoretical contributions—formalizing interface-constrained SMDPs, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability, and lifting the AIS framework to multi-agent SMDPs. These are novel, rigorous results with broad applicability beyond LLM pipelines to any multi-agent sequential decision-making setting. Paper 2 is a strong engineering/systems contribution with real-world deployment at Uber, but it is more narrowly focused on enterprise AI security and introduces less foundational methodology. Paper 1's theoretical novelty and breadth of impact across reinforcement learning, multi-agent systems, and workflow optimization give it higher long-term scientific impact.

vs. Neurosymbolic Learning for Inference-Time Argumentation

gemini-3.15/20/2026

Paper 2 offers foundational theoretical contributions, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. Its formalization of multi-agent LLM pipelines as IC-SMDPs addresses a highly timely problem with rigorous methodology and broad applicability across reinforcement learning and AI systems. Paper 1 is valuable for applied fact-checking, but Paper 2's combination of theoretical rigor and diverse empirical validation suggests a broader and deeper scientific impact.