Multi-Agent Orchestration for High-Throughput Materials Screening on a Leadership-Class System

Thang Duc Pham, Harikrishna Tummalapalli, Fakhrul Hasan Bhuiyan, Álvaro Vázquez Mayagoitia, Christine Simpson, Riccardo Balin, Venkatram Vishwanath, Murat Keçeli

Apr 9, 2026

arXiv:2604.07681v1 PDF

cs.AI(primary)

#164of 2292·Artificial Intelligence

#164 of 2292 · Artificial Intelligence

Tournament Score

1526±26

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor4.5

Novelty5

Clarity7

Tournament Score

1526±26

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The integration of Artificial Intelligence (AI) with High-Performance Computing (HPC) is transforming scientific workflows from human-directed pipelines into adaptive systems capable of autonomous decision-making. Large language models (LLMs) play a critical role in autonomous workflows; however, deploying LLM-based agents at scale remains a significant challenge. Single-agent architectures and sequential tool calls often become serialization bottlenecks when executing large-scale simulation campaigns, failing to utilize the massive parallelism of exascale resources. To address this, we present a scalable, hierarchical multi-agent framework for orchestrating high-throughput screening campaigns. Our planner-executor architecture employs a central planning agent to dynamically partition workloads and assign subtasks to a swarm of parallel executor agents. All executor agents interface with a shared Model Context Protocol (MCP) server that orchestrates tasks via the Parsl workflow engine. To demonstrate this framework, we employed the open-weight gpt-oss-120b model to orchestrate a high-throughput screening of the Computation-Ready Experimental (CoRE) Metal-Organic Framework (MOF) database for atmospheric water harvesting. The results demonstrate that the proposed agentic framework enables efficient and scalable execution on the Aurora supercomputer, with low orchestration overhead and high task completion rates. This work establishes a flexible paradigm for LLM-driven scientific automation on HPC systems, with broad applicability to materials discovery and beyond.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

1. Core Contribution

This paper presents a hierarchical multi-agent framework ("planner–executor" architecture) for orchestrating high-throughput computational screening campaigns on leadership-class HPC systems. The key architectural innovation is the separation of concerns: a central LLM-based planner agent decomposes scientific objectives into subtasks, which are dispatched to a swarm of parallel executor agents. These executors interface with a shared Model Context Protocol (MCP) server that generates Parsl workflow applications rather than directly executing simulations, enabling asynchronous dispatch at scale. The framework is demonstrated by screening ~5,600 Metal-Organic Frameworks (MOFs) from the CoRE database for atmospheric water harvesting on the Aurora exascale supercomputer using the open-weight gpt-oss-120b model.

The problem being addressed—bridging natural language scientific intent and large-scale HPC execution—is genuine and practically important. The serialization bottleneck of single-agent sequential tool-calling is a real limitation that this hierarchical design attempts to overcome.

2. Methodological Rigor

The methodology is competently executed but has notable gaps:

Strengths: The weak and strong scaling experiments are well-designed, with both random and nested sampling strategies for weak scaling. The use of realistic GCMC simulations with established force fields (UFF, TIP4P, TraPPE) and a real materials database (CoRE MOF 2025) gives practical credibility. The multi-objective screening experiment (water, CO₂, N₂ adsorption) is a useful demonstration of flexibility.

Weaknesses: The evaluation is primarily a systems demonstration rather than a rigorous comparison. There is no baseline comparison against a conventional (non-agentic) workflow using the same Parsl infrastructure, making it impossible to quantify the actual overhead or benefit of the LLM orchestration layer. The 84% success rate (21/25 runs succeeding) is concerning for production use and is only briefly discussed. The strong scaling efficiency dropping to 64.9% at 256 nodes is attributed to inherent workload imbalance but not analyzed in depth. The paper acknowledges that simulation parameters (50,000 cycles for strong scaling) were chosen for benchmarking convenience rather than scientific accuracy, which is reasonable but limits the scientific value of the screening results themselves.

The orchestration overhead of 60–90 seconds is reported but without detailed profiling of where time is spent (tokenization, inference, network latency, MCP communication), limiting the ability to identify optimization opportunities.

3. Potential Impact

The framework addresses a real gap in the HPC landscape: making supercomputing resources accessible through natural language interfaces. If reliable, this could significantly lower the barrier to entry for domain scientists who lack HPC expertise. The materials screening application (MOFs for water harvesting) is timely and practically relevant given climate change and water scarcity concerns.

However, the impact is somewhat constrained by several factors:

The framework is demonstrated for an "embarrassingly parallel" screening problem where task decomposition is straightforward. More complex workflows with data dependencies, iterative refinement, or active learning would be far more challenging.

The 84% reliability rate limits practical adoption for production campaigns.

The actual scientific insights from the MOF screening are modest—the distribution of working capacities is presented but not validated against existing literature or experimental data.

The broader applicability claimed (materials discovery "and beyond") is plausible but undemonstrated. The open-source release via ChemGraph enhances potential impact.

4. Timeliness & Relevance

The paper sits at a very timely intersection: LLM agents, MCP (released by Anthropic in late 2024), and exascale computing (Aurora recently deployed). The use of an open-weight model for scientific HPC workflows is particularly relevant given growing concerns about cost, privacy, and reproducibility with proprietary APIs. The MCP integration is novel in the HPC context and positions the work well for the rapidly evolving agent ecosystem.

The paper addresses the current bottleneck of translating agentic AI capabilities into practical HPC workflows, which is an active area of research (as evidenced by the cited Colmena, LangChain-Parsl, and federated agents work).

5. Strengths & Limitations

Key Strengths:

Clean architectural design with well-motivated separation of concerns (planning vs. execution vs. analysis)

Practical demonstration on a real exascale system (Aurora) with a scientifically meaningful application

Use of open-weight model removes dependency on proprietary APIs, improving reproducibility

MCP-Parsl integration is a novel and potentially widely reusable contribution

Open-source availability enhances reproducibility and community adoption

Multi-objective workflow capability demonstrated through natural language alone

Notable Limitations:

No comparison against non-agentic baselines—the paper cannot quantify what the LLM layer actually adds versus a simple Python script that submits the same Parsl workflows

The 84% success rate is problematic; failed runs requiring restart from scratch is wasteful and the paper provides no checkpointing or partial recovery mechanism

The screening application is embarrassingly parallel with no inter-task dependencies, which is the easiest case for parallelization; it doesn't test the agent's ability to handle complex, adaptive workflows

Limited analysis of LLM reasoning quality—were the agent's decomposition decisions optimal? Did it make any scientifically meaningful decisions, or did it simply partition a list?

The paper does not address what happens when the LLM makes scientifically incorrect decisions (e.g., wrong simulation parameters)

Token consumption and cost analysis is missing, despite being mentioned as a motivation for using open-weight models

The scientific results (MOF rankings) are not validated against known experimental or computational benchmarks

Additional Observations

The paper is well-written and clearly structured. The framework's model-agnostic design is a practical advantage. However, the fundamental question of whether LLM orchestration provides meaningful value over a well-written Python script for this specific type of embarrassingly parallel screening remains unanswered. The most compelling use case for LLM agents—adaptive, scientifically-reasoned decision-making during workflow execution—is not demonstrated here.

The work represents a competent engineering contribution that brings together several emerging technologies (MCP, LangGraph, Parsl, open-weight LLMs) in a novel configuration, but the scientific and algorithmic novelty is incremental.

Rating:5.5/ 10

Significance 5.5Rigor 4.5Novelty 5Clarity 7

Generated Apr 10, 2026

Comparison History (49)

vs. Does Machine Unlearning Preserve Clinical Safety? A Risk Analysis for Medical Image Classification

claude-opus-4.64/28/2026

Paper 2 presents a novel framework combining LLM-based multi-agent systems with exascale HPC for autonomous scientific workflows—a timely convergence of AI and HPC with broad applicability across materials science and other domains. Its scalable hierarchical architecture addresses a fundamental bottleneck in AI-driven scientific automation. Paper 1 addresses an important but narrower concern (clinical risk in machine unlearning), proposing an incremental modification to an existing method (SalUn). While clinically relevant, its scope is more limited. Paper 2's paradigm-level contribution to scientific automation has wider cross-disciplinary impact potential.

vs. Does Machine Unlearning Preserve Clinical Safety? A Risk Analysis for Medical Image Classification

claude-opus-4.64/28/2026

Paper 1 presents a novel, scalable multi-agent framework combining LLMs with exascale HPC for autonomous scientific workflows—a timely convergence of AI and HPC with broad applicability across materials discovery and other scientific domains. Its demonstration on a leadership-class supercomputer (Aurora) with a concrete materials screening application shows immediate real-world utility and methodological innovation. Paper 2 addresses an important but narrower problem (clinical risk in machine unlearning) with an incremental modification to an existing method (SalUn). While relevant, its scope and potential cross-disciplinary impact are more limited.

vs. The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks

gpt-5.24/26/2026

Paper 2 has higher likely impact due to timeliness and broad applicability: scalable multi-agent LLM orchestration on leadership-class HPC addresses a widely felt bottleneck in autonomous scientific workflows and can generalize across many domains beyond materials (chemistry, climate, biology, etc.). It demonstrates real-world deployment on an exascale system and integrates with established workflow infrastructure (Parsl/MCP), improving adoption potential. Paper 1 is methodologically strong and novel for bipartite dependency robustness, but its impact is more specialized to network mining settings.

vs. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

claude-opus-4.64/26/2026

Paper 2 demonstrates a novel integration of LLM-based multi-agent systems with exascale HPC for autonomous scientific discovery, addressing a timely challenge in AI-driven materials screening. Its hierarchical planner-executor framework with demonstrated results on the Aurora supercomputer has broad applicability across scientific domains. While Paper 1 addresses important privacy/unlearning concerns with a well-designed architecture, it is more narrowly focused on LLM personalization. Paper 2's cross-disciplinary impact (AI, HPC, materials science), practical demonstration at scale, and establishment of a new paradigm for scientific automation give it higher potential impact.

vs. Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models

gpt-5.24/26/2026

Paper 2 has higher potential impact due to stronger methodological rigor and broader, cross-domain applicability: it demonstrates a scalable multi-agent orchestration architecture deployed on a leadership-class exascale system, addressing a timely bottleneck (LLM agent serialization) in AI+HPC scientific automation. The real-world implications span many simulation-driven fields (materials, chemistry, climate, engineering) and provide a reusable systems paradigm. Paper 1 is novel and practical for bias mitigation at inference time, but its scope is narrower (T2I fairness auditing/prompting) and outcomes depend on subjective fairness targets and evaluation limits, likely reducing scientific breadth.

vs. Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

claude-opus-4.64/20/2026

Paper 1 introduces a fundamental methodological critique of ATP benchmarks ('Easy Mode' vs 'Hard Mode'), provides new benchmark datasets, and reveals a striking gap between LLM reasoning and formal proving capabilities. This reframes how the community evaluates theorem-proving systems and has broad implications for AI reasoning research. Paper 2 demonstrates solid engineering of multi-agent HPC orchestration but is more incremental—applying known architectural patterns (planner-executor, workflow engines) to materials screening. Paper 1's conceptual contribution and benchmark artifacts are likely to have more lasting influence on the AI/formal methods community.

vs. HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

gpt-5.24/17/2026

Paper 2 likely has higher impact: it introduces a novel, scalable, repository-level benchmark (HWE-Bench) with realistic, containerized evaluation for hardware bug repair—an underbenchmarked, high-stakes domain. Benchmarks often become community standards, enabling reproducible comparisons, driving methodological progress, and influencing both academia and industry EDA workflows. Its breadth spans LLM agents, verification, hardware design, and software engineering, and it is timely given interest in agentic coding. Paper 1 is valuable systems engineering for HPC orchestration, but appears more incremental and narrower in downstream adoption.

vs. Predicting Power-System Dynamic Trajectories with Foundation Models

gemini-34/17/2026

Paper 1 presents a highly scalable, domain-agnostic framework integrating LLM agents with exascale high-performance computing, fundamentally advancing how autonomous scientific workflows are executed. While Paper 2 is impactful for power systems, Paper 1's approach has broader applicability across computational sciences, offering transformative potential for high-throughput discovery in materials science, chemistry, and beyond.

vs. WebXSkill: Skill Learning for Autonomous Web Agents

gpt-5.24/16/2026

Paper 2 has higher potential impact due to its direct enablement of scalable, LLM-driven autonomous scientific workflows on exascale HPC—an urgent, broadly relevant bottleneck across computational science. The hierarchical multi-agent orchestration tied to real leadership-class deployment (Aurora), shared MCP server, and workflow engine integration suggests stronger methodological/engineering rigor and clearer real-world applicability (high-throughput screening, materials discovery, broader simulations). Paper 1 is novel for web agents and practically useful, but its impact is narrower (web automation benchmarks) and less cross-disciplinary than exascale scientific automation.

vs. Artifacts as Memory Beyond the Agent Boundary

claude-opus-4.64/13/2026

Paper 2 addresses the highly timely intersection of LLMs, HPC, and autonomous scientific discovery, demonstrating a practical scalable framework for materials screening on exascale systems. Its broad applicability to materials discovery and scientific automation, combined with a concrete demonstration on a leadership-class supercomputer, gives it higher near-term impact. Paper 1 offers elegant theoretical contributions formalizing external memory in RL, but its scope is narrower and more foundational, with less immediate practical applicability. Paper 2's relevance to the rapidly growing AI-for-science and agentic AI communities amplifies its potential impact.

vs. Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

gpt-5.24/13/2026

Paper 1 targets a timely, high-impact bottleneck—scaling LLM-driven scientific automation to exascale HPC—by introducing a hierarchical multi-agent orchestration architecture demonstrated on a leadership-class system (Aurora) and a real materials discovery task (MOF screening). This combination of systems innovation, demonstrated large-scale feasibility, and broad applicability across HPC-enabled domains suggests strong real-world adoption potential. Paper 2 offers a solid methodological contribution to many-objective Bayesian optimization with theory and benchmarks, but its impact is likely narrower to the optimization community compared with the cross-field infrastructure implications of Paper 1.

vs. Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

gemini-34/10/2026

Paper 2 introduces a scalable multi-agent framework on HPC systems, directly accelerating scientific discovery in fields like materials science. Its ability to automate high-throughput screening on exascale systems has profound implications for accelerating research across multiple scientific domains. While Paper 1 offers an important socio-technical analysis of AI ethics and commercial conflicts of interest, Paper 2's methodological contribution to automating and scaling foundational scientific research gives it a broader and more transformative scientific impact.

vs. ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

gpt-5.24/10/2026

Paper 2 likely has higher impact due to direct real-world applicability and timely relevance: it demonstrates a scalable LLM-driven orchestration framework on an exascale-class system (Aurora) for a concrete scientific use case (MOF screening for water harvesting). This combines AI+HPC workflow engineering with immediate utility across computational science domains and could be adopted broadly. Paper 1 is novel and valuable as an evaluation benchmark, but its impact is more diagnostic/measurement-focused and mainly confined to LLM research; it does not itself enable new scientific capabilities at scale.

vs. Lightweight LLM Agent Memory with Small Language Models

gemini-34/10/2026

Paper 2 demonstrates a high-impact application of AI for Science by integrating multi-agent LLM orchestration with exascale HPC systems. While Paper 1 offers valuable efficiency improvements for LLM agents, Paper 2 directly accelerates physical scientific discovery (materials screening for water harvesting) and establishes a scalable paradigm for autonomous scientific workflows, promising broader interdisciplinary impact across chemistry, physics, and materials science.

vs. Hidden Biases in Conditioning Autoregressive Models

claude-opus-4.64/10/2026

Paper 2 addresses a fundamental theoretical question about autoregressive models (LLMs, music models) that affects a vast range of applications. By proving NP-hardness and #P-hardness results for exact conditioning and MAP decoding under constraints, it formalizes widely-assumed but unproven claims, providing rigorous foundations that will influence how the entire field understands and develops constrained generation methods. Paper 1, while technically impressive as an engineering contribution combining LLM agents with HPC for materials screening, is more application-specific and incremental in nature. Paper 2's theoretical insights have broader, longer-lasting impact across ML, NLP, and computational complexity.

vs. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

gpt-5.24/10/2026

Paper 2 has higher likely impact due to a clear, broadly applicable systems/method contribution: KV-cache compression directly targets a major, widespread bottleneck in multimodal LLM inference, improving memory and latency with strong benchmark evidence (11 tasks, sizable speed/memory gains). Its applicability spans many models and deployment settings, making cross-field adoption likely. Paper 1 is timely and useful for AI-for-science on HPC, but the contribution is more integration/engineering of agent orchestration tied to specific infrastructure, potentially limiting generalizability compared with a model-inference optimization that can propagate widely.

vs. EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools

gemini-34/10/2026

Paper 2 bridges AI, high-performance computing, and materials science to enable autonomous scientific discovery at an exascale level. While Paper 1 offers useful incremental improvements to AI search agents, Paper 2's scalable multi-agent framework directly accelerates high-throughput scientific workflows (like discovering materials for water harvesting) on world-class supercomputers, promising broader and more transformative real-world scientific impact.

vs. Towards Knowledgeable Deep Research: Framework and Benchmark

claude-opus-4.64/10/2026

Paper 1 presents a novel, scalable multi-agent framework for orchestrating LLM-driven scientific workflows on exascale HPC systems, addressing a critical bottleneck in AI-driven materials discovery. It demonstrates real-world application (MOF screening for water harvesting) on a leadership-class supercomputer, combining HPC, AI agents, and materials science in a broadly applicable paradigm. Paper 2 contributes a useful benchmark and framework for deep research with structured knowledge, but is more incremental in the LLM agent evaluation space. Paper 1's cross-disciplinary impact (HPC + AI + materials science) and practical scalability give it higher potential impact.

vs. OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

gemini-34/10/2026

Paper 2 presents a scalable framework for integrating LLM agents with exascale HPC to drive autonomous scientific workflows. Its application to high-throughput materials discovery tackles critical bottlenecks in 'AI for Science', offering broader cross-disciplinary impact and potential for fundamental scientific breakthroughs compared to the specialized diagram generation task in Paper 1.

vs. Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback

gemini-34/10/2026

Paper 1 introduces a fundamental structural innovation (Reasoning Graphs) that addresses a critical limitation in current LLM agents: stateless reasoning and high variance. By providing a generalizable method for evidence-centric feedback without retraining, this approach has the potential to broadly impact AI reliability and accuracy across virtually all downstream domains. Paper 2, while highly valuable for high-performance computing and materials science, represents an applied systems-engineering effort rather than a foundational algorithmic breakthrough, making Paper 1's potential scientific impact substantially broader.