Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
Yuxin Zhang, Mengxue Hu, Zheng Lin, Xiaoyi Fan, Fan Xie, Zihan Fang, Jing Yang, Wenjun Zhu
Abstract
Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Hera – Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
1. Core Contribution
Hera addresses the device-cloud routing problem for LLM agents operating in long-horizon, multi-step environments. The key insight is that existing device-cloud routers make coarse task-level or request-level decisions, which are suboptimal for sequential decision-making settings where step difficulty varies dynamically. Hera performs step-level routing — deciding at each environment interaction step whether to invoke the cheaper on-device model or the expensive cloud model.
The system uses a two-stage training paradigm: (1) imitation learning (IL) that bootstraps routing by replaying the device model on cloud-generated trajectories and labeling steps by action agreement, and (2) cost-aware reinforcement learning (RL) that groups identical states across multiple rollouts and derives preference labels favoring higher return with fewer future cloud calls. This is a clean formulation that converts the RL problem into iterated supervised classification with carefully constructed labels, avoiding the instabilities of direct policy gradient methods.
The motivating analysis in §4 is compelling: on tasks where the device fails but the cloud succeeds, ~35-39% of steps produce identical outputs, suggesting that an ideal router needs the cloud for fewer than 25% of steps. This establishes a clear opportunity for step-level coordination.
2. Methodological Rigor
Strengths of the approach:
Concerns:
3. Potential Impact
Practical relevance is high. As LLM agents move toward deployment in latency-sensitive applications (robotics, mobile assistants, GUI navigation), the device-cloud trade-off is a genuine engineering bottleneck. Achieving 92.5% of cloud-only performance while using the cloud in only 46.3% of steps represents a meaningful cost reduction.
Broader applicability: The framework could extend to any multi-step decision-making system with heterogeneous compute resources — not just LLM agents but potentially hierarchical planning in robotics, adaptive computation in streaming systems, or mixed-precision inference.
Limitations on impact: The benchmarks (ALFWorld, WebShop, AppWorld), while standard, are relatively constrained environments. The approach's effectiveness in truly open-ended real-world settings with continuous state spaces, partial observability, or non-deterministic environments remains undemonstrated. The synchronous, always-connected assumption is also a significant practical limitation.
4. Timeliness & Relevance
This work is highly timely. The explosion of LLM agent research (ReAct, tool-use, web agents) combined with growing concerns about inference cost and latency makes device-cloud coordination an emerging need. The paper correctly identifies that existing routing methods (RouteLLM, FrugalGPT, Hybrid LLM) are designed for single-turn queries and don't account for trajectory-level dependencies. This gap is real and becoming more pressing as agents tackle longer-horizon tasks.
The paper also arrives at an interesting moment in the "agentic RL" trend, where RL is increasingly used to train LLM agents — Hera applies RL not to the agents themselves but to the coordination layer, which is a complementary and potentially orthogonal contribution.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Reproducibility: The paper provides detailed hyperparameters, pseudocode (Algorithm 1), and prompt templates. However, reliance on the Qwen-Max API (a closed-source model) limits full reproducibility.
Summary
Hera makes a solid contribution by identifying and addressing a genuine gap: step-level device-cloud coordination for multi-step LLM agents. The two-stage training pipeline is well-designed, the experiments are comprehensive, and the results demonstrate clear improvements over existing routing methods. The work is timely and practically relevant. However, the novelty is primarily in the application and system design rather than fundamental algorithmic advances, and scalability to more complex environments remains an open question.
Generated May 26, 2026
Comparison History (21)
Paper 1 introduces a foundational algorithmic innovation in LLM test-time scaling, a highly critical and active area of research. By modifying positional encodings (RoPE) to allow inter-sequence collaboration during parallel generation, it addresses fundamental inefficiencies in current reasoning pipelines. While Paper 2 offers a strong, practical systems-level solution for device-cloud routing, Paper 1's approach has the potential to broadly influence core LLM inference architectures, scaling laws, and reasoning capabilities across the entire field.
Paper 1 (Hera) addresses a highly practical and timely problem—efficient device-cloud collaboration for LLM agents—with a rigorous two-stage training paradigm combining imitation and reinforcement learning. It demonstrates strong empirical results across three diverse benchmarks, achieving 92.5% of cloud performance at 46.3% cloud usage. The approach has broad real-world applicability as LLM deployment scales. Paper 2 (CSMR) proposes an interesting cognitive scheduling mechanism for multimodal reasoning, but the problem scope is narrower and the zero-shot evaluation setting, while notable, limits demonstrated impact. Hera's cost-efficiency contributions are more immediately impactful for the field.
Hera addresses the highly timely and practical device-cloud dilemma for LLM agents. By optimizing the performance-cost Pareto frontier, it presents immediate real-world applications for deploying efficient, autonomous agents on edge devices. While Paper 2 offers strong theoretical contributions to multi-agent reasoning, Paper 1's approach significantly lowers the barrier for practical, wide-scale LLM agent deployment, promising broader immediate impact across industry and applied research.
Paper 1 tackles a fundamental and highly ambitious goal in AI research: autonomous self-improvement. By successfully unifying previously disjoint approaches (harness updates and weight updates) and demonstrating significant performance gains across highly diverse and complex domains (law, systems optimization, and biology), it offers a broader conceptual leap. Paper 2, while offering a practical and rigorous solution for deployment efficiency (device-cloud routing), represents a more incremental optimization rather than a fundamental paradigm shift in AI capabilities.
Paper 2 (Hera) likely has higher impact due to strong real-world applicability and timeliness: step-level device–cloud routing directly targets deployment cost/latency constraints for long-horizon LLM agents and shows sizable practical gains on established embodied/web agent benchmarks. Its two-stage IL→cost-aware RL framework is methodologically substantial and broadly relevant to systems + RL + agent communities. Paper 1 is theoretically novel and rigorous for agentic uncertainty evaluation, but its impact may be narrower (metrics/assessment) and more indirect on deployments than a coordination method that materially changes performance–cost tradeoffs.
Paper 2 addresses a fundamental bottleneck in LLMs: the compute and memory costs of dense attention in long contexts. By demonstrating that extreme context sparsity is theoretically sound, empirically robust across 20 models without retraining, and capable of up to 10x hardware speedups, it has massive implications for LLM inference, training, and architecture. While Paper 1 offers an excellent, practical system for device-cloud agent routing, Paper 2 challenges core assumptions about attention mechanisms. Its findings have a broader, potentially transformative impact across the entire field of generative AI and serving infrastructure.
Paper 2 identifies a fundamental and broadly applicable phenomenon (premature confidence) in LLM reasoning, proposes a label-free RL solution that improves accuracy across diverse tasks and scales, and addresses faithfulness/safety—all highly relevant concerns. Its insights generalize across model sizes and task types, suggesting broad impact on the reasoning and alignment communities. Paper 1, while practically useful for device-cloud coordination, addresses a narrower systems-level optimization problem with more limited cross-field applicability.
Paper 1 likely has higher impact: it introduces a step-level device–cloud coordination framework for long-horizon LLM agents with a two-stage IL→cost-aware RL training scheme, addressing a major deployment bottleneck (latency/cost vs capability) with broad applicability to embodied/web/app agents. The methodological contribution (state grouping, joint success+cost optimization) and multi-environment evaluation suggest stronger novelty and generality. Paper 2 is practical and timely but is a narrower, parameter-free prompting heuristic focused on MCQA abstention, with more limited cross-domain reach and algorithmic depth.
Paper 2 has higher likely scientific impact due to its strong real-world applicability (telemedicine), timeliness, and broad cross-field relevance (multimodal ML, HCI, clinical decision support, evaluation science). It introduces a novel continuous audio-visual co-clinician setting with a dual-agent design and a comparatively rigorous human-subjects, randomized crossover evaluation plus new TelePACES criteria—raising the bar for medical conversational AI benchmarks. Paper 1 is technically solid and useful for cost-performance routing, but its impact is more incremental and narrower to deployment optimization of LLM agents.
Paper 1 (Hera) addresses a highly practical and timely problem in LLM deployment—efficient device-cloud coordination for agentic tasks—with a rigorous two-stage training methodology (imitation learning + reinforcement learning) and strong empirical results across multiple benchmarks. The device-cloud efficiency tradeoff is a critical bottleneck for real-world LLM agent deployment, giving it broad applicability. Paper 2 (ATWL) introduces a formal workflow language for visual analytics, which is a useful but narrower contribution targeting a smaller community. While well-executed, its impact is more incremental and domain-specific compared to Paper 1's broader relevance to the rapidly growing LLM agents field.
Paper 2 addresses a critical and highly timely challenge in deploying LLM agents: balancing edge efficiency with cloud performance. Its novel step-level reinforcement learning routing mechanism has broad applicability across edge AI and agent research, supported by strong empirical results on standard benchmarks. Conversely, Paper 1 presents an interactive tool for ontology engineering, which, while useful, targets a more niche audience and lacks the broad, transformative potential and immediate real-world applicability of optimizing collaborative LLM agents.
GlobalDentBench has higher potential impact due to its broader scope (multinational, 88 countries, 14 specialties), direct clinical safety implications (31% unsafe rate finding), and its role as the first comprehensive dental LLM benchmark. It addresses critical patient safety concerns and provides a scalable evaluation framework for healthcare AI deployment. Paper 2, while technically sound in optimizing device-cloud coordination, addresses a more incremental engineering optimization problem with narrower applicability. The safety findings in Paper 1 have urgent real-world consequences and policy implications for AI in healthcare.
Paper 2 (Hera) likely has higher impact due to strong real-world applicability (deployable device–cloud coordination with explicit cost/performance tradeoffs), broad relevance across agent systems, edge AI, and systems/ML, and timeliness as step-level routing is a practical bottleneck. Its two-stage IL+RL paradigm and state-grouped cost-aware updates are methodologically substantial and evaluated on multiple standard long-horizon benchmarks with clear Pareto gains. Paper 1 is novel and theoretically grounded for RL credit assignment in LLM reasoning, but its applications are narrower to post-training and may translate less directly to deployment constraints.
Paper 1 offers higher potential scientific impact by resolving a critical bottleneck in real-world LLM agent deployment: the device-cloud compute and cost trade-off. By introducing a rigorous step-level routing methodology via imitation and reinforcement learning, Hera advances AI deployment scalability. While Paper 2 presents a thoughtful multi-agent framework for qualitative data analysis, its impact is largely confined to research methodologies in social sciences and HCI. Paper 1's generalizable architecture for cost-efficient, edge-cloud collaborative agentic systems is exceptionally timely and has immense implications for the broader AI, systems engineering, and consumer tech communities.
Paper 1 (Hera) addresses a broadly relevant and practical problem—efficient device-cloud coordination for LLM agents—with a rigorous two-stage training paradigm (imitation learning + reinforcement learning) evaluated across three diverse benchmarks. It achieves strong results (92.5% of cloud performance at 46.3% cloud usage), offering immediate practical value for deploying LLM agents at scale. Paper 2 identifies an interesting failure mode (library drift) but addresses a narrower problem within self-evolving skill libraries, evaluated on a single coding benchmark. Hera's broader applicability, methodological depth, and multi-domain evaluation give it higher potential impact.
Paper 2 addresses a highly timely and practically critical bottleneck in modern AI: the cost-performance trade-off of deploying LLM agents. By introducing a step-level device-cloud coordinator, it offers immediate, real-world utility for scaling autonomous agents across devices. While Paper 1 introduces a rigorous, novel theoretical framework for runtime fairness, Paper 2's direct applicability to the rapidly expanding ecosystem of LLMs and edge computing gives it a broader and more immediate potential impact across both academia and industry.
Paper 2 addresses a highly timely and critical bottleneck in deploying LLM agents: balancing computational cost and performance. Its step-level device-cloud coordination framework has massive real-world applicability across mobile computing and AI systems. While Paper 1 offers a rigorous and novel automated approach to combinatorial counting, Paper 2's direct impact on scalable LLM deployment gives it a significantly broader and more immediate scientific and industrial impact.
Paper 2 (Hera) likely has higher scientific impact due to a more novel algorithmic contribution (step-level device–cloud coordination with a two-stage IL→cost-aware RL training scheme), clearer methodological rigor and generalizable evaluation across multiple established long-horizon benchmarks, and strong real-world applicability to practical deployment constraints (latency/cost/privacy). Paper 1 provides an important, timely benchmark and data pipeline for always-on assistants, but benchmark papers can have narrower methodological novelty and their impact depends heavily on adoption; Hera’s approach is more broadly transferable across agent systems and edge/cloud settings.
Paper 1 (Hera) addresses a highly practical and timely problem—efficient device-cloud coordination for LLM agents—with a novel two-stage training paradigm combining imitation and reinforcement learning. It demonstrates strong empirical results across multiple benchmarks, achieving near cloud-level performance at roughly half the cost. This has broad real-world applicability as LLM deployment scales. Paper 2 addresses multimodal knowledge editing generalization, which is more niche. While technically sound, its impact is narrower. Hera's contribution to efficient LLM deployment is more broadly relevant and timely given the rapid growth of agentic AI systems.
Paper 2 likely has higher impact due to broader relevance and timeliness: extracting transferable safety principles from crowd preferences addresses a central, cross-domain problem in RL/LLM alignment and safe autonomy. The hierarchical skill framework targets real-world deployment constraints where explicit safety rewards are unavailable, enabling applications in robotics, decision-making, and aligned LLM agents. Paper 1 is strong and practical but more specialized to device–cloud routing for LLM agents, with narrower conceptual breadth. Both are rigorous, but safety alignment advances typically propagate more widely across fields and stakeholders.