Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

May 16, 2026

arXiv:2605.17036v1 PDF

cs.AI(primary)cs.LGcs.MAeess.SY

#434of 2292·Artificial Intelligence

#434 of 2292 · Artificial Intelligence

Tournament Score

1481±46

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor7

Novelty8

Clarity8.5

Tournament Score

1481±46

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce the agent bullwhip effect, the amplification of decision unreliability across echelons, manifesting along two dimensions: decision variance increases both across facilities at the same point in time and within the same facility across time. We develop a mathematical framework showing that this phenomenon is inherent to multi-agent systems that involve coordination and information delays, and we demonstrate that repeated sampling fails to meaningfully reduce it. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. GRPO post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper makes three interlinked contributions at the intersection of LLM-based autonomous agents and supply chain management. First, it provides a systematic empirical evaluation of LLM agents as autonomous decision-makers in the MIT Beer Game, benchmarking against human teams and identifying four inference-time levers (model selection, guardrails, information orchestration, prompt engineering). Second, it introduces the concept of "agent bullwhip"—the amplification of decision *unreliability* (run-to-run variance) across echelons and over time in multi-agent systems—distinct from the classical bullwhip effect which concerns order *magnitude* amplification. Third, it proposes a GRPO-based reinforcement learning post-training framework that trains a shared LLM backbone using system-level supply chain rewards, substantially improving reliability.

The agent bullwhip concept is the paper's most novel intellectual contribution. While the classical bullwhip effect is well-studied, the observation that stochastic LLM policies introduce a *second* layer of amplification—where decision variance (not just order levels) compounds upstream and over time—is a genuinely new insight with implications beyond supply chains to any multi-agent coordination problem with delays.

Methodological Rigor

The paper combines empirical, theoretical, and algorithmic approaches. The empirical evaluation spans multiple LLM families (GPT-5 mini, GPT-4o mini, Llama 4 Maverick 17B, Llama 3.3 70B, Qwen-3 4B, DeepSeek-R1), with 30 replications per configuration benchmarked against 12 Georgia Tech cohorts (100+ students). This provides reasonable statistical grounding, though the human benchmark of student teams rather than industry professionals is a limitation the authors do not adequately address.

The theoretical framework is well-constructed. The law of total variance decomposition into demand-driven (V^D) and decision-driven (V^ε) components is clean and insightful. Theorems 1 and 2, showing exponential growth of both components through the same transfer function H_k(L), elegantly formalize the intuition. Proposition 3 addresses intertemporal accumulation. However, the analysis relies on a linear benchmark model (Assumption 1: no order truncation), and while simulations under operational constraints are mentioned, the gap between the linear theory and the nonlinear reality is acknowledged but not fully bridged in the main text.

The GRPO post-training framework is clearly described, with system-level vs. agent-level rewards, episode-level vs. rollout attribution, and a demand curriculum. However, the evaluation is limited: post-training is demonstrated only on Qwen-3 4B, and comparisons with other RL methods (PPO, DPO) or classical inventory policies (base-stock) are absent. The improvement from a CV of 26% to 13% and average cost reduction from 1,585 to 952 is meaningful, but the paper would benefit from ablation studies on reward design choices and training curricula.

Potential Impact

Practical relevance: The paper addresses a real and timely concern for firms considering autonomous supply chain operations. The finding that average performance masks reliability risk is operationally critical—procurement and logistics decisions carry financial commitments, and high-variance agents are undeployable regardless of mean performance.

Conceptual contribution: The agent bullwhip framework extends beyond supply chains to any multi-agent system with coordination delays—manufacturing networks, financial markets with intermediaries, distributed computing systems. The insight that stochastic AI policies create a fundamentally new amplification channel is broadly applicable.

Methodological contribution: The GRPO-based centralized-training/decentralized-execution paradigm for LLM agents in operational settings could influence how firms deploy multi-agent LLM systems more generally. The demonstration that inference-time fixes (repeated sampling/majority voting) are insufficient, while post-training succeeds, provides actionable guidance.

Limitations of impact: The Beer Game, while canonical, is a highly simplified supply chain (serial, four echelons, single product, deterministic lead times). Real supply chains involve network structures, multiple products, stochastic lead times, capacity constraints, and strategic interactions. The paper does not discuss how findings generalize beyond this setting.

Timeliness & Relevance

The paper is exceptionally well-timed. Firms are actively exploring LLM-based autonomous operations, and the gap between "impressive average performance" and "production-ready reliability" is exactly the issue practitioners face. The paper appeared alongside a Harvard Business Review companion piece, positioning it for both academic and practitioner audiences. The use of frontier models (GPT-5 mini, Llama 4 Maverick) ensures relevance to the current generation of capabilities.

Strengths

1. Novel conceptual framework: Agent bullwhip is a well-defined, theoretically grounded concept that fills a genuine gap in understanding multi-agent LLM systems.

2. Theory-practice integration: The mathematical framework directly explains empirical observations (Figures 2-3) and motivates the algorithmic solution (GRPO).

3. Practical actionability: The four inference-time levers provide a clear deployment playbook, while the post-training framework addresses the fundamental reliability limitation.

4. Negative result on repeated sampling: Demonstrating that majority voting over 10 and 100 samples fails to reduce agent bullwhip is valuable, as it redirects effort toward post-training rather than inference-time patches.

5. Comprehensive benchmarking: Multiple models, configurations, and human baselines provide a rich comparative picture.

Limitations

1. Simplified environment: The Beer Game's serial structure, single product, and deterministic lead times limit external validity.

2. Human benchmark: Student teams are an imperfect proxy for industry professionals; experienced supply chain managers likely perform significantly better.

3. Limited post-training evaluation: Only one base model (Qwen-3 4B) is post-trained; no comparison with classical control policies or other RL methods.

4. No cost-of-compute analysis: The GRPO training cost relative to inference-time alternatives is not discussed.

5. Linear theory, nonlinear practice: The exponential growth results hold for the linear benchmark; the paper acknowledges but does not fully characterize behavior under operational constraints.

6. Reproducibility: While the setup is described in detail, the reliance on proprietary models (GPT-5 mini) limits full reproducibility.

Overall Assessment

This is a strong, timely paper that introduces a genuinely novel concept (agent bullwhip), provides both theoretical and empirical support, and offers a practical solution via GRPO post-training. The combination of rigorous theory, systematic experimentation, and practical relevance is compelling. The main weaknesses are the simplified testbed and incomplete post-training evaluation. Despite these, the paper makes a significant contribution to understanding the reliability challenges of autonomous multi-agent LLM systems and will likely influence both the academic literature on LLM agents and practitioner approaches to autonomous supply chain deployment.

Rating:7.5/ 10

Significance 8Rigor 7Novelty 8Clarity 8.5

Generated May 19, 2026

Comparison History (23)

vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

gpt-5.25/20/2026

Paper 2 has higher impact potential due to stronger cross-domain relevance (LLM agents + operations research), clearer real-world applicability (supply-chain decision-making), and a more generalizable conceptual contribution (agent bullwhip effect with a mathematical framework). It also proposes and validates a mitigation approach (GRPO post-training) aimed at reliability/tail-risk, a timely concern for deploying autonomous agents. Paper 1 is practically valuable for LLM training robustness, but appears more incremental/system-specific and less broadly generalizable than Paper 2’s theory + application framing.

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

claude-opus-4.65/19/2026

Paper 2 introduces a novel concept (agent bullwhip effect) with broad implications for multi-agent AI systems beyond supply chains, provides both theoretical framework and practical solution (GRPO post-training), and addresses the timely, high-impact question of autonomous AI agent reliability. While Paper 1 makes solid contributions to materials science with impressive performance gains, Paper 2's insights about fundamental limitations of multi-agent LLM systems and its mathematical framework for understanding coordination failures have broader cross-disciplinary relevance as autonomous AI agents proliferate across industries.

vs. A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to stronger real-world applicability (supply-chain decision-making), broader cross-field relevance (LLMs, multi-agent systems, operations research/control), and a clearer methodological contribution: formalizing the “agent bullwhip effect” with a mathematical framework plus an actionable mitigation via GRPO post-training demonstrated to reduce tail risks. Paper 1 is novel for scalable, formally verifiable benchmark generation and valuable for evaluation science, but its primary impact is narrower (benchmarking/measurement) and less directly tied to deployment-critical reliability outcomes.

vs. Budget-Efficient Automatic Algorithm Design via Code Graph

gpt-5.25/19/2026

Paper 1 likely has higher impact due to a more novel, system-level insight (the “agent bullwhip effect”) with a mathematical framework explaining an inherent reliability pathology in coordinated multi-agent settings, plus a concrete mitigation via GRPO post-training tied to end-to-end supply-chain rewards. The work is timely given rapid deployment of autonomous agents in operations, has strong real-world applicability (multi-echelon supply chains), and the core phenomenon/general framework could transfer to other multi-agent domains with delays (networks, robotics, markets). Paper 2 is innovative and useful, but is narrower and more incremental on LLM-driven search efficiency.

vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

gemini-3.15/19/2026

Paper 1 offers higher scientific impact due to its broad applicability in AI safety and multi-agent systems. It uncovers a counterintuitive vulnerability—the 'capability paradox'—where upgrading to smarter models actively degrades system security. Supported by rigorous large-scale testing (42,000+ trials) and multi-level mediation analysis, it identifies 'semantic hijacking' driven by linguistic certainty as a fundamental flaw. While Paper 2 provides valuable, domain-specific insights for supply chains, Paper 1's findings and proposed defense (heterogeneous ensemble verification) will fundamentally influence the general design and security of LLM architectures across all fields.

vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

gpt-5.25/19/2026

Paper 2 has higher likely impact due to broader applicability and stronger real-world grounding: it targets a common industrial pain point (continuous re-optimization of deployed OR models), provides an end-to-end, human-in-the-loop system with interpretable “patch” updates, and is validated on two large-scale real-world case studies (supply chain and exam scheduling), suggesting methodological rigor and cross-domain relevance. Paper 1 is novel (agent bullwhip + GRPO post-training) but is more specialized to multi-agent LLM control in supply-chain settings and relies on a stylized Beer Game environment, limiting immediate generalizability.

vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

gemini-3.15/19/2026

Paper 2 integrates physiological ODE priors into generative world models to simulate clinical interventions, addressing critical safety and hallucination issues in medical AI. Its potential to directly influence life-saving clinical decision support and its contribution to physics-informed machine learning grant it higher scientific significance and profound societal impact compared to Paper 1's economic application in supply chain optimization.

vs. MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

gemini-3.15/19/2026

Paper 2 identifies a novel phenomenon (the 'agent bullwhip effect') and provides both a mathematical framework and an RL-based post-training solution to address it. This offers profound theoretical insights and practical solutions for multi-agent coordination in complex real-world systems like supply chains. While Paper 1 introduces a valuable benchmark, Paper 2's combination of novel behavioral discovery, theoretical grounding, and algorithmic mitigation promises a broader and more transformative impact on multi-agent systems and operations research.

vs. Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

gemini-3.15/19/2026

Paper 2 addresses a highly relevant, high-impact real-world application (supply chain management) using advanced LLMs. It introduces a novel, quantifiable phenomenon ('agent bullwhip effect'), provides a mathematical framework, and offers a concrete algorithmic solution (GRPO post-training). Its immediate economic applicability and rigorous methodology give it a significantly broader and more measurable potential scientific impact compared to the theoretical, niche phenomenological approach of Paper 1.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

gpt-5.25/19/2026

Paper 1 is more broadly impactful scientifically: it introduces a novel, training-free, inference-time algorithm for hallucination reduction that leverages cross-layer dynamics, challenging common monotonic “truthfulness-by-depth” assumptions. Its methodological contribution is general-purpose (applies across 15 models/8 families) and shows consistent gains without external data, which strengthens rigor and reproducibility. The real-world applicability spans essentially all LLM deployments where factuality matters, making its cross-field reach (NLP, interpretability, safety, systems) larger and more timely than Paper 2’s more domain-specific (supply chain) agent reliability advances.

vs. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

gemini-3.15/19/2026

Paper 2 introduces a novel theoretical concept (the 'agent bullwhip effect') supported by a mathematical framework, addressing a fundamental reliability issue in multi-agent systems. Its proposed RL post-training solution offers broad applicability across supply chain management and multi-agent AI research. In contrast, Paper 1 is an impressive but highly specific industrial application of existing LLM techniques, lacking the theoretical depth and cross-disciplinary scientific novelty of Paper 2.

vs. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

gpt-5.25/19/2026

Paper 1 is likely higher impact due to stronger conceptual novelty (agent bullwhip effect) plus a general mathematical framework for unreliability propagation in multi-agent, delayed-information systems. It goes beyond evaluation by proposing and validating a corrective training method (GRPO post-training) with large practical gains and reliability improvements, directly relevant to real-world autonomous decision-making in supply chains and other coordinated agent settings. Paper 2 is a valuable benchmark with solid methodology, but its scope is narrower (Chinese gaming short-video search) and primarily advances evaluation rather than introducing broadly transferable theory or mitigation techniques.

vs. A Global-Local Graph Attention Network for Traffic Forecasting

gemini-3.15/19/2026

Paper 1 offers higher scientific impact due to its high novelty and extreme timeliness. While Paper 2 proposes an incremental architectural improvement to graph networks for a saturated domain (traffic forecasting), Paper 1 pioneers the study of generative AI agents in multi-echelon supply chains. By identifying the novel 'agent bullwhip effect' and applying cutting-edge RL post-training (GRPO) to mitigate it, Paper 1 opens a new, highly relevant research avenue with massive real-world economic implications across autonomous multi-agent systems and operations management.

vs. Going Headless? On the Boundaries of Vertical AI Firms

gpt-5.25/19/2026

Paper 1 has higher estimated scientific impact due to a more novel technical contribution and stronger methodological rigor: it defines and formalizes the “agent bullwhip effect,” provides a mathematical framework for inherent multi-agent unreliability under delays, and proposes an empirically evaluated GRPO post-training approach that reduces tail risks and improves system-level performance. Its results have clear real-world applicability to supply chains and broader relevance to multi-agent LLM reliability and control. Paper 2 is timely and conceptually valuable but is primarily a strategy/theory piece with less empirical or formal technical validation.

vs. Learning Lifted Action Models from Traces with Minimal Information About Actions and States

claude-opus-4.65/19/2026

Paper 2 introduces a novel concept (agent bullwhip effect) with a mathematical framework, addresses a timely topic (autonomous AI agents in supply chains), and proposes a practical solution (GRPO-based post-training). It bridges LLM/AI agent research with operations management, offering broad interdisciplinary impact. Paper 1 makes solid incremental contributions to action model learning but addresses a narrower AI planning community. Paper 2's relevance to the rapidly growing field of LLM-based autonomous agents and its practical supply chain applications give it significantly higher potential impact.

vs. Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

gpt-5.25/19/2026

Paper 2 has higher potential impact due to clearer real-world applicability (supply chain decision-making), a broadly relevant new failure mode (agent bullwhip effect) with a mathematical framework, and a concrete mitigation via GRPO post-training that improves reliability—an urgent, timely issue for multi-agent LLM deployment. Its contributions generalize beyond supply chains to coordinated multi-agent systems with delays. Paper 1 is innovative in using LLM-based qualitative evaluation for ODE discovery, but its impact is more specialized to scientific ML/model discovery and may face methodological concerns about LLM subjectivity and reproducibility.

vs. Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

claude-opus-4.65/19/2026

Paper 1 addresses a timely, high-impact topic—autonomous AI agents in supply chain management—with broad real-world applicability. It introduces the novel concept of 'agent bullwhip effect,' provides both empirical and theoretical analysis, and proposes a practical GRPO-based solution. The intersection of LLMs and supply chain operations is highly relevant given industry adoption trends. Paper 2, while rigorous in runtime analysis for multi-party multi-objective optimization, addresses a more niche theoretical area with narrower audience and fewer immediate practical applications. Paper 1's novelty, timeliness, and cross-disciplinary relevance give it higher impact potential.

vs. From Feasible to Practical: Pareto-Optimal Synthesis Planning

gpt-5.25/19/2026

Paper 2 likely has higher impact: it reframes CASP around multi-objective, practice-aligned decision-making and introduces MORetro* with clear theoretical optimality guarantees (recovering the Pareto front given a fixed one-step model) plus strong benchmark evidence. The Pareto-front formulation is broadly applicable (chemistry, optimization, decision science) and directly targets industrial constraints (cost, sustainability, toxicity, yield), enhancing real-world adoption. Paper 1 is novel in diagnosing reliability (“agent bullwhip”) and proposing GRPO post-training, but its results are tied to a stylized Beer Game setting and may generalize less broadly than the multi-objective CASP framework.

vs. Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

gpt-5.25/19/2026

Paper 1 has higher likely scientific impact due to a concrete, novel contribution (agent bullwhip effect) with a supporting mathematical framework, empirical demonstration in a canonical testbed (Beer Game), and a tangible mitigation method (GRPO-based post-training) that improves reliability and tail risk. It combines methodological rigor with clear real-world applicability to supply-chain automation and multi-agent LLM deployment. Paper 2 is timely and potentially broad, but is primarily a conceptual agenda (trilemma framing) without validated methods or results, making near-term impact less certain.

vs. Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

gpt-5.25/19/2026

Paper 2 has higher potential scientific impact due to broader cross-field relevance (LLM agents, multi-agent coordination, reliability/safety, operations research), clear novelty in defining and formalizing the “agent bullwhip effect,” and proposing an RL post-training method (GRPO) targeting system-level reliability and tail-risk reduction. Its applications extend beyond supply chains to many multi-agent, delayed-information decision systems. Paper 1 is strong and rigorous for industrial-scale retrieval with clear production value, but its impact is more domain-specific and incremental relative to existing sparse/neural retrieval advances.