Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou
Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.
INFRAMIND addresses a genuine and increasingly important problem: multi-agent LLM orchestration systems make routing decisions based solely on task features, ignoring the runtime state of the serving infrastructure. The paper terms this "infrastructure blindness" and demonstrates convincingly through profiling experiments that it causes load imbalance, avoidable latency, and resource underutilization on shared GPU clusters.
The system introduces three coordinated components: (1) an infrastructure-aware planner that adjusts topology complexity based on system congestion via FiLM modulation, (2) an infrastructure-aware executor that jointly selects models and reasoning depth (Flash/Concise/DeepThink) based on real-time queue depths, KV-cache utilization, and latencies, and (3) an EDF scheduler that reorders within-model queues by deadline urgency. These are unified under a hierarchical constrained MDP trained end-to-end with reinforcement learning, with a single Lagrange multiplier automatically mediating the quality-latency tradeoff.
The key insight—that infrastructure signals should inform not just serving but also orchestration decisions—is intuitive yet surprisingly unexplored. The paper correctly identifies that prior work on LLM routing (RouteLLM, TREACLE, R2-Router) handles single-turn calls and that serving optimizations (vLLM, Sarathi-Serve) optimize within a single model without cross-model routing.
Formalization. The hierarchical CMDP formulation is clean and appropriate. The separation between planner (per-query, coarse infrastructure summary) and executor (per-step, full state vector) is well-motivated by the different decision timescales. The dual Lagrange approach for constraint enforcement is standard but applied sensibly.
Training procedure. The joint training of planner (REINFORCE) and executor (PPO) with a shared constraint multiplier is sound. Training across budget tiers and arrival rates with inter-sweep queue draining ensures exposure to diverse congestion regimes. The ~471K parameter count for the policy network is modest, suggesting practical trainability.
Experimental concerns. The evaluation is conducted on a real GPU cluster (two NVIDIA B200s) under realistic Poisson arrival patterns at three load levels, which is considerably more realistic than single-query evaluations. However, several concerns arise:
Ablations. Table 3 provides useful ablations showing each component contributes distinct value, though the ablation could be more comprehensive (e.g., varying the number of infrastructure signals, different RL algorithms, sensitivity to polling frequency).
Immediate practical value. The paper targets a real deployment scenario that is becoming increasingly common: shared GPU clusters serving multiple LLMs for multi-agent workflows. The cited examples (JPMorgan, Bloomberg, Uber) and Gartner projections contextualize the growing need. The extension to blackbox/hybrid pools using client-side proxies (EMA latency, RPM-based congestion signals) makes the approach applicable beyond self-hosted deployments.
Systems-ML intersection. INFRAMIND sits at the intersection of ML systems and learned orchestration, a space with growing interest. The principle of making ML decisions infrastructure-aware could extend to other compound AI systems beyond multi-agent LLMs.
Limitations on generality. The framework assumes a fixed model pool and commits to topology at planning time. The authors acknowledge these limitations but they constrain near-term applicability in elastic cloud environments. The reliance on vLLM's specific metrics endpoint creates a dependency on serving stack details.
This paper is highly timely. Multi-agent LLM systems are proliferating, and the transition from single-model serving to multi-agent orchestration on shared infrastructure is an active industry trend. The gap between task-level routing intelligence and infrastructure-level serving optimization is real and growing. The paper addresses a bottleneck that will become more severe as multi-agent deployments scale.
INFRAMIND makes a solid contribution by identifying and addressing a genuine gap in multi-agent LLM orchestration. The infrastructure blindness problem is well-motivated, the hierarchical RL solution is technically sound, and the empirical results—particularly the SLO compliance improvements under load—are impressive and practically relevant. The work is timely and addresses a real need. However, the experimental scope is somewhat narrow (5 models, structured benchmarks), and the absence of simpler baseline comparisons (infrastructure-aware heuristics without RL) leaves open the question of how much value the learned policy adds over straightforward engineering solutions.
Generated Jun 11, 2026
Paper 2 demonstrates higher potential scientific impact by addressing a fundamental systems-level bottleneck: infrastructure-aware multi-agent orchestration. While Paper 1 offers a strong domain-specific application in conflict resolution, Paper 2 solves urgent scalability, latency, and resource utilization challenges applicable to all multi-agent LLM pipelines. By integrating dynamic hardware metrics into model routing via reinforcement learning, INFRAMIND promises broad, foundational impact across AI systems, cloud computing, and large-scale deployment.
Paper 1 bridges the gap between machine learning and systems engineering, addressing a critical bottleneck in deploying multi-agent LLM systems. By integrating infrastructure state directly into the agent orchestration process via reinforcement learning, it offers massive real-world applicability and efficiency gains. While Paper 2 tackles the important issue of AI explainability, Paper 1's interdisciplinary approach and immediate potential to scale complex AI pipelines give it a broader and more quantifiable systemic impact.
Paper 2 addresses a critical and timely issue: the reliability of AI agents in synthesizing high-stakes scientific information. By introducing a robust benchmark and clean-room evaluation methodology that exposes significant flaws in frontier models, it is likely to drive substantial future research in AI safety, reasoning, and scientific discovery across multiple disciplines. While Paper 1 offers a strong technical systems-ML contribution, Paper 2 has broader societal and cross-disciplinary impact.
Paper 2 (INFRAMIND) likely has higher scientific impact due to broader applicability across ML systems and agentic workflows, strong timeliness (LLM serving under shared GPU constraints), and methodological rigor (hierarchical constrained MDP with end-to-end RL, multi-benchmark evaluation, SLO/latency metrics). Its infrastructure-aware orchestration can affect many deployed multi-agent pipelines, improving both performance and efficiency. Paper 1 (HELM) is novel and useful for automating FE modeling in civil engineering, but its domain specificity narrows breadth of impact compared to a general systems+AI framework.
Paper 2 likely has higher scientific impact: it introduces a new, open-source evaluation paradigm (MAC) targeting a timely and broadly relevant capability—autonomous agent development and recursive improvement—spanning agent design, benchmarking, safety/alignment, and secure evaluation. Its multi-layer anti-reward-hacking design and documented emergent adversarial behaviors increase rigor and relevance, and the benchmark can become a shared community standard. Paper 1 is a strong systems contribution with clear practical benefits for LLM serving, but its impact is narrower (infrastructure-aware orchestration) and more incremental relative to existing routing/scheduling work.
Paper 2 addresses a fundamental and critical bottleneck in modern AI: LLM deployment and multi-agent orchestration under dynamic infrastructure constraints. By bridging system-level signals with multi-agent planning and routing, it offers broad, scalable applications across cloud computing and AI services. Its massive performance improvements (7x latency reduction, near-perfect SLO compliance) indicate substantial real-world and industry impact. In contrast, Paper 1 is innovative but its scope is relatively narrower, focusing primarily on educational technology and human-AI creativity assessment.
Paper 2 (INFRAMIND) likely has higher scientific impact because it tackles an under-addressed, timely bottleneck—real-world deployment of multi-agent LLM systems under shared, congested infrastructure—yielding large latency/SLO gains with competitive or improved accuracy. Its infrastructure-aware, end-to-end RL formulation spans planning, routing, and scheduling, making it broadly applicable across serving stacks and agentic pipelines, with immediate industry relevance and cross-field impact (systems, RL, LLM orchestration). Paper 1 is novel and rigorous for ToM prompting, but its applications are narrower and more benchmark-centric.
Paper 1 offers higher fundamental scientific impact by addressing a core cognitive gap in Large Reasoning Models (spatial reasoning) without relying on ground-truth annotations. By formalizing consistency verifiers and introducing a novel RL strategy (OT-GRPO), it advances the critical frontier of self-improving models and unsupervised alignment. While Paper 2 provides exceptional systems-level and practical deployment contributions for multi-agent orchestration, Paper 1's methodological innovations in algorithmic self-improvement have broader implications for foundational model training and reasoning.
Paper 1 (INFRAMIND) addresses a critical and timely gap in multi-agent LLM orchestration by incorporating infrastructure awareness, demonstrating strong empirical results (7.6pp accuracy gain, 7x lower latency, 99.9% SLO compliance). It introduces a novel hierarchical constrained MDP framework with broad applicability to the rapidly growing LLM deployment ecosystem. Paper 2, while valuable for AI regulation discourse, is more narrowly scoped to EU AI Act interpretation and credit scoring, with impact limited primarily to the legal/policy domain rather than driving broad scientific or technical advances.
Paper 2 (INFRAMIND) has higher likely impact due to broader relevance and timeliness: infrastructure-aware orchestration for multi-agent LLM systems applies across many domains deploying AI on shared clusters. It introduces a novel, system-level integration of planning/routing/scheduling driven by real-time signals, formalized as a hierarchical constrained MDP and optimized end-to-end with RL, suggesting strong methodological rigor. The reported gains span both quality and latency with strong SLO results, making it highly actionable for real-world serving. Paper 1 is valuable but more domain-specific to AEC/BIM compliance.