INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

Jun 9, 2026arXiv:2606.11440v1

cs.AI

#357of 3489·Artificial Intelligence

#357 of 3489 · Artificial Intelligence

Tournament Score

1500±49

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity8

Abstract

Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: INFRAMIND

1. Core Contribution

INFRAMIND addresses a genuine and increasingly important problem: multi-agent LLM orchestration systems make routing decisions based solely on task features, ignoring the runtime state of the serving infrastructure. The paper terms this "infrastructure blindness" and demonstrates convincingly through profiling experiments that it causes load imbalance, avoidable latency, and resource underutilization on shared GPU clusters.

The system introduces three coordinated components: (1) an infrastructure-aware planner that adjusts topology complexity based on system congestion via FiLM modulation, (2) an infrastructure-aware executor that jointly selects models and reasoning depth (Flash/Concise/DeepThink) based on real-time queue depths, KV-cache utilization, and latencies, and (3) an EDF scheduler that reorders within-model queues by deadline urgency. These are unified under a hierarchical constrained MDP trained end-to-end with reinforcement learning, with a single Lagrange multiplier automatically mediating the quality-latency tradeoff.

The key insight—that infrastructure signals should inform not just serving but also orchestration decisions—is intuitive yet surprisingly unexplored. The paper correctly identifies that prior work on LLM routing (RouteLLM, TREACLE, R2-Router) handles single-turn calls and that serving optimizations (vLLM, Sarathi-Serve) optimize within a single model without cross-model routing.

2. Methodological Rigor

Formalization. The hierarchical CMDP formulation is clean and appropriate. The separation between planner (per-query, coarse infrastructure summary) and executor (per-step, full state vector) is well-motivated by the different decision timescales. The dual Lagrange approach for constraint enforcement is standard but applied sensibly.

Training procedure. The joint training of planner (REINFORCE) and executor (PPO) with a shared constraint multiplier is sound. Training across budget tiers and arrival rates with inter-sweep queue draining ensures exposure to diverse congestion regimes. The ~471K parameter count for the policy network is modest, suggesting practical trainability.

Experimental concerns. The evaluation is conducted on a real GPU cluster (two NVIDIA B200s) under realistic Poisson arrival patterns at three load levels, which is considerably more realistic than single-query evaluations. However, several concerns arise:

The model pool is limited to 5 models, all open-weight. While the blackbox extension (§5.4) partially addresses this, the main experiments are narrow.

The benchmarks, while covering code/math/QA, are all relatively structured tasks with clear correctness criteria. Performance on open-ended generation tasks is unknown.

The paper reports "up to" figures prominently (e.g., +7.6pp, 7× lower latency, 99.9% SLO), which are cherry-picked maxima across conditions. The average improvements across all settings are more modest.

At mid and high load on MMLU-Pro, INFRAMIND's SLO compliance drops to 59.8% and 56.0%, showing the system also struggles under extreme conditions.

The comparison against GPTSwarm is somewhat unfair since GPTSwarm freezes its graph at test time by design and was not designed for dynamic infrastructure adaptation.

Ablations. Table 3 provides useful ablations showing each component contributes distinct value, though the ablation could be more comprehensive (e.g., varying the number of infrastructure signals, different RL algorithms, sensitivity to polling frequency).

3. Potential Impact

Immediate practical value. The paper targets a real deployment scenario that is becoming increasingly common: shared GPU clusters serving multiple LLMs for multi-agent workflows. The cited examples (JPMorgan, Bloomberg, Uber) and Gartner projections contextualize the growing need. The extension to blackbox/hybrid pools using client-side proxies (EMA latency, RPM-based congestion signals) makes the approach applicable beyond self-hosted deployments.

Systems-ML intersection. INFRAMIND sits at the intersection of ML systems and learned orchestration, a space with growing interest. The principle of making ML decisions infrastructure-aware could extend to other compound AI systems beyond multi-agent LLMs.

Limitations on generality. The framework assumes a fixed model pool and commits to topology at planning time. The authors acknowledge these limitations but they constrain near-term applicability in elastic cloud environments. The reliance on vLLM's specific metrics endpoint creates a dependency on serving stack details.

4. Timeliness & Relevance

This paper is highly timely. Multi-agent LLM systems are proliferating, and the transition from single-model serving to multi-agent orchestration on shared infrastructure is an active industry trend. The gap between task-level routing intelligence and infrastructure-level serving optimization is real and growing. The paper addresses a bottleneck that will become more severe as multi-agent deployments scale.

5. Strengths & Limitations

Strengths:

Clear problem identification with compelling motivating experiments (Figure 1)

Principled hierarchical CMDP formulation with automatic quality-latency balancing

Strong empirical results, especially SLO compliance under high load (the most practically relevant metric)

The reasoning depth adaptation (Flash/Concise/DeepThink) as a quality lever is a creative addition

Extension to blackbox APIs demonstrates generalizability

Figure 5 showing emergent budget-awareness is a nice result

Limitations:

Small model pool (5 models) limits conclusions about scalability

No analysis of training cost or convergence behavior

The paper doesn't discuss how the system handles model failures or restarts

No comparison against simpler infrastructure-aware heuristics (e.g., weighted round-robin with queue-based load balancing) that don't require RL training

The system monitor polling frequency and its impact on decision quality is not analyzed

Reproducibility concerns: while hyperparameters are provided, the RL training on a live serving cluster is inherently difficult to reproduce

The paper does not discuss how performance scales with the number of concurrent users or models beyond the tested configurations

Overall Assessment

INFRAMIND makes a solid contribution by identifying and addressing a genuine gap in multi-agent LLM orchestration. The infrastructure blindness problem is well-motivated, the hierarchical RL solution is technically sound, and the empirical results—particularly the SLO compliance improvements under load—are impressive and practically relevant. The work is timely and addresses a real need. However, the experimental scope is somewhat narrow (5 models, structured benchmarks), and the absence of simpler baseline comparisons (infrastructure-aware heuristics without RL) leaves open the question of how much value the learned policy adds over straightforward engineering solutions.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 8

Generated Jun 11, 2026

Comparison History (24)

Wonvs. Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

Paper 2 demonstrates higher potential scientific impact by addressing a fundamental systems-level bottleneck: infrastructure-aware multi-agent orchestration. While Paper 1 offers a strong domain-specific application in conflict resolution, Paper 2 solves urgent scalability, latency, and resource utilization challenges applicable to all multi-agent LLM pipelines. By integrating dynamic hardware metrics into model routing via reinforcement learning, INFRAMIND promises broad, foundational impact across AI systems, cloud computing, and large-scale deployment.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Forecasting Future Behavior as a Learning Task

Paper 1 bridges the gap between machine learning and systems engineering, addressing a critical bottleneck in deploying multi-agent LLM systems. By integrating infrastructure state directly into the agent orchestration process via reinforcement learning, it offers massive real-world applicability and efficiency gains. While Paper 2 tackles the important issue of AI explainability, Paper 1's interdisciplinary approach and immediate potential to scale complex AI pipelines give it a broader and more quantifiable systemic impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Can AI Agents Synthesize Scientific Conclusions?

Paper 2 addresses a critical and timely issue: the reliability of AI agents in synthesizing high-stakes scientific information. By introducing a robust benchmark and clean-room evaluation methodology that exposes significant flaws in frontier models, it is likely to drive substantial future research in AI safety, reasoning, and scientific discovery across multiple disciplines. While Paper 1 offers a strong technical systems-ML contribution, Paper 2 has broader societal and cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

Paper 2 (INFRAMIND) likely has higher scientific impact due to broader applicability across ML systems and agentic workflows, strong timeliness (LLM serving under shared GPU constraints), and methodological rigor (hierarchical constrained MDP with end-to-end RL, multi-benchmark evaluation, SLO/latency metrics). Its infrastructure-aware orchestration can affect many deployed multi-agent pipelines, improving both performance and efficiency. Paper 1 (HELM) is novel and useful for automating FE modeling in civil engineering, but its domain specificity narrows breadth of impact compared to a general systems+AI framework.

gpt-5.2·Jun 11, 2026

Lostvs. The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Paper 2 likely has higher scientific impact: it introduces a new, open-source evaluation paradigm (MAC) targeting a timely and broadly relevant capability—autonomous agent development and recursive improvement—spanning agent design, benchmarking, safety/alignment, and secure evaluation. Its multi-layer anti-reward-hacking design and documented emergent adversarial behaviors increase rigor and relevance, and the benchmark can become a shared community standard. Paper 1 is a strong systems contribution with clear practical benefits for LLM serving, but its impact is narrower (infrastructure-aware orchestration) and more incremental relative to existing routing/scheduling work.

gpt-5.2·Jun 11, 2026

Wonvs. IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

Paper 2 addresses a fundamental and critical bottleneck in modern AI: LLM deployment and multi-agent orchestration under dynamic infrastructure constraints. By bridging system-level signals with multi-agent planning and routing, it offers broad, scalable applications across cloud computing and AI services. Its massive performance improvements (7x latency reduction, near-perfect SLO compliance) indicate substantial real-world and industry impact. In contrast, Paper 1 is innovative but its scope is relatively narrower, focusing primarily on educational technology and human-AI creativity assessment.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

Paper 2 (INFRAMIND) likely has higher scientific impact because it tackles an under-addressed, timely bottleneck—real-world deployment of multi-agent LLM systems under shared, congested infrastructure—yielding large latency/SLO gains with competitive or improved accuracy. Its infrastructure-aware, end-to-end RL formulation spans planning, routing, and scheduling, making it broadly applicable across serving stacks and agentic pipelines, with immediate industry relevance and cross-field impact (systems, RL, LLM orchestration). Paper 1 is novel and rigorous for ToM prompting, but its applications are narrower and more benchmark-centric.

gpt-5.2·Jun 11, 2026

Lostvs. The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Paper 1 offers higher fundamental scientific impact by addressing a core cognitive gap in Large Reasoning Models (spatial reasoning) without relying on ground-truth annotations. By formalizing consistency verifiers and introducing a novel RL strategy (OT-GRPO), it advances the critical frontier of self-improving models and unsupervised alignment. While Paper 2 provides exceptional systems-level and practical deployment contributions for multi-agent orchestration, Paper 1's methodological innovations in algorithmic self-improvement have broader implications for foundational model training and reasoning.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

Paper 1 (INFRAMIND) addresses a critical and timely gap in multi-agent LLM orchestration by incorporating infrastructure awareness, demonstrating strong empirical results (7.6pp accuracy gain, 7x lower latency, 99.9% SLO compliance). It introduces a novel hierarchical constrained MDP framework with broad applicability to the rapidly growing LLM deployment ecosystem. Paper 2, while valuable for AI regulation discourse, is more narrowly scoped to EU AI Act interpretation and credit scoring, with impact limited primarily to the legal/policy domain rather than driving broad scientific or technical advances.

claude-opus-4-6·Jun 11, 2026

Wonvs. Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

Paper 2 (INFRAMIND) has higher likely impact due to broader relevance and timeliness: infrastructure-aware orchestration for multi-agent LLM systems applies across many domains deploying AI on shared clusters. It introduces a novel, system-level integration of planning/routing/scheduling driven by real-time signals, formalized as a hierarchical constrained MDP and optimized end-to-end with RL, suggesting strong methodological rigor. The reported gains span both quality and latency with strong SLO results, making it highly actionable for real-world serving. Paper 1 is valuable but more domain-specific to AEC/BIM compliance.

gpt-5.2·Jun 11, 2026

#357of 3489·Artificial Intelligence

#357 of 3489 · Artificial Intelligence

Tournament Score

1500±49

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity8