Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

May 15, 2026

arXiv:2605.16205v1 PDF

cs.AI(primary)cs.CLcs.LGcs.MAeess.SY

#1180of 2292·Artificial Intelligence

#1180 of 2292 · Artificial Intelligence

Tournament Score

1409±38

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7.5

Novelty6

Clarity8

Tournament Score

1409±38

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4 $\times$ worse mean return while using 1.8-2.7 $\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents a systematic empirical study decomposing compound LLM agent design along three axes—context engineering, deliberation depth, and hierarchical decomposition—in CybORG CAGE-2, an adversarial cyber defense POMDP. The central novelty is the identification of the deliberation cascade: distributing self-critique and self-questioning tools across a multi-agent hierarchy degrades performance for all tested models (up to 3.4× worse returns) while consuming 1.8–2.7× more tokens. This is a genuinely useful finding because it identifies a failure mode that is invisible when studying deliberation and multi-agent hierarchy independently. The paper also demonstrates that deterministic programmatic state abstraction (context engineering) is the most cost-effective lever, improving returns by up to 76% at near-zero marginal token cost.

The design principle distilled—invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning—is practical and actionable for practitioners building LLM-based agents in structured sequential environments.

Methodological Rigor

The experimental design is commendably thorough. The study spans 5 model families, 6 models, 12 configurations, and 3,475 episodes (283.9M tokens), with token-level cost accounting. Key methodological strengths include:

Multi-model replication as the primary validity guard, avoiding single-model artifacts. The paper explicitly demonstrates that single-model conclusions would be contradictory (e.g., Llama suggests hierarchy hurts; Grok suggests it's essential).

Knowledge-free initialization: agents receive only a one-sentence role and action reference table, isolating architectural effects from domain engineering.

Deterministic decoding (temperature 0) and identical prompts across models reduce confounds.

Shared anchor configuration across axes enables clean cross-axis comparisons.

95% confidence intervals and paired difference tests (Appendix E) provide statistical rigor.

However, there are limitations. The cumulative activation order for deliberation tools (question→critique→improve→CoT) means individual tool effects are confounded; the paper acknowledges this. Episode counts vary across models (25–100), and while qualitative conclusions are robust, some comparisons rest on modest sample sizes. The study uses a single environment with a scripted (non-adaptive) adversary and a fixed 3-agent hierarchy topology. The paper is transparent about these scope limitations but they constrain generalizability.

One subtle concern: the "knowledge-free" claim is partially undermined by the state-tracking layer itself, which embeds domain-informed observation processing (signature-based comparison, volatile field filtering). The paper acknowledges this inductive bias but it somewhat complicates the clean separation between "architectural" and "domain" contributions.

Potential Impact

Practical impact is the paper's strongest suit. The findings directly inform practitioners building LLM agents for structured sequential decision-making:

1. *Context engineering first*: Build deterministic state-tracking infrastructure before investing in reasoning depth. This is low-cost, high-return advice applicable beyond cyber defense to any POMDP with structured observations (robotics, network management, logistics).

2. *Avoid distributing deliberation naively*: The deliberation cascade finding is a cautionary result with broad applicability to any compound AI system using multi-agent hierarchies with self-critique capabilities—a common design pattern in current agent frameworks (LangChain, AutoGen, CrewAI).

3. *Token-cost awareness*: The RPTS metric and Pareto frontier analysis provide a template for cost-performance evaluation that the compound AI community lacks.

Research impact is more moderate. The paper opens the question of uncertainty propagation in hierarchical LLM systems and suggests the need for inter-agent arbitration protocols (confidence gating, aggregation rules). The mechanistic analysis of "passivity amplification" (Section 5.2.3, Appendix G) provides concrete evidence for how distributed caution bias manifests.

The work connects to adjacent fields: autonomous systems design, multi-agent coordination, and inference-time compute scaling. However, it is empirical rather than theoretical—it identifies patterns but does not formalize them (e.g., no formal model of when deliberation cascades occur or how to prevent them).

Timeliness & Relevance

This paper is highly timely. The compound AI systems paradigm is rapidly maturing (2024–2025), with practitioners deploying multi-agent LLM systems in production without principled design guidance. The literature on multi-agent LLM systems has focused on topology and scaling laws while largely ignoring the interaction between internal agent configuration and hierarchy—exactly the gap this paper fills. The "deliberation cascade" concept arrives at a moment when self-critique and reflection are being enthusiastically adopted (Self-Refine, Reflexion) without sufficient understanding of their failure modes in composed systems.

The cyber defense domain adds applied relevance given growing interest in autonomous security operations, though the specific CAGE-2 environment is relatively narrow.

Strengths

1. Well-controlled ablation design with a shared anchor across axes, enabling clean attribution.

2. Multi-model validation that demonstrates both robustness of qualitative findings and the necessity of multi-model evaluation.

3. Cost-performance analysis that goes beyond accuracy to jointly optimize return and token expenditure—rare in LLM agent literature.

4. The deliberation cascade is a novel, well-documented failure mode with practical implications.

5. Exceptional reproducibility: full YAML configurations, code artifacts, and detailed appendices.

Limitations

1. Single environment: All findings are from CAGE-2. Transfer to unstructured environments, adaptive adversaries, or longer horizons is speculative.

2. Fixed hierarchy topology: Only one 3-agent hierarchy is tested; alternative decompositions might interact differently with deliberation.

3. Cumulative deliberation confound: Individual tool contributions cannot be isolated.

4. No frontier-scale models: Mid-tier models may behave differently from GPT-4o, Claude 3.5, or Gemini Ultra.

5. Descriptive rather than prescriptive: The paper identifies the deliberation cascade but offers no formal characterization or mitigation beyond "don't do this."

6. Environment simplicity: 30-step horizon, 5 actions, scripted adversary—results may not hold in more complex settings.

Overall Assessment

This is a well-executed empirical study that fills a genuine gap in the compound AI systems literature. Its primary contribution—the deliberation cascade—is novel, practically relevant, and robustly validated. The cost-performance analysis template is valuable. Impact is strongest in the practitioner community; theoretical contributions are limited. The single-environment scope and descriptive nature cap its broader scientific impact, but the findings are timely and actionable.

Rating:6.5/ 10

Significance 6.5Rigor 7.5Novelty 6Clarity 8

Generated May 18, 2026

Comparison History (26)

vs. MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

gemini-3.15/19/2026

Paper 2 introduces a novel, metacognition-inspired framework for multi-agent LLMs with broad applicability across domains. Its architectural innovations in self-assessment and adaptive delegation are likely to inspire significant follow-up research. While Paper 1 provides a rigorous and highly practical empirical study on cost-performance trade-offs, Paper 2's fundamental methodological advancement offers higher potential for widespread scientific impact and architectural adoption in the rapidly growing field of autonomous AI agents.

vs. Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

gemini-3.15/19/2026

Paper 2 proposes a highly novel cross-modal framework for brain-computer interfaces, translating EEG signals into visual proxies to leverage MLLM priors. This introduces a paradigm shift in processing physiological signals, offering broader scientific implications across neuroscience, clinical diagnostics, and multimodal AI. In contrast, Paper 1 primarily offers an empirical, though rigorous, benchmarking study of existing LLM agent architectures in a niche cyber-defense setting.

vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

claude-opus-4.65/19/2026

Paper 1 provides actionable, empirically grounded design principles for compound LLM agents in adversarial sequential decision-making—a rapidly growing area. Its large-scale controlled study (3,475 episodes, 5 model families, 12 configurations) with cost accounting offers immediately useful guidance for practitioners. The identification of 'deliberation cascades' as a failure mode is novel and broadly applicable. Paper 2 addresses an interesting niche (spatial ToM in MLLMs) but targets a narrower problem with less immediate practical breadth. Paper 1's findings on context engineering vs. deliberation tradeoffs will likely influence a wider range of LLM agent deployments.

vs. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to broader relevance and stronger methodological contribution. It provides a large, controlled, token-costed ablation study across multiple model families and agent design axes, yielding actionable design principles (e.g., state abstraction boosts return-per-token; “deliberation cascade” harms performance). These insights generalize to many LLM-agent deployments in partially observable/adversarial settings (cyber, robotics, operations), making it timely for agentic AI. Paper 2 is valuable and novel as a dataset/evaluation benchmark for CBT audio distress, but its scope is narrower and constrained by dataset size and domain-specific applicability.

vs. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

claude-opus-4.65/19/2026

LinAlg-Bench offers higher scientific impact due to its novel forensic methodology for understanding LLM failure modes in mathematical reasoning, revealing a universal structural threshold (the fabrication-to-abandonment transition at 4x4 scale) that generalizes across architectures. This finding about working memory limits has broad implications for understanding fundamental LLM capabilities. The benchmark, error taxonomy, and publicly released pipeline provide lasting community infrastructure. Paper 2, while practically useful, addresses a narrower domain (compound LLM agents in adversarial POMDPs) with findings that, though valuable for practitioners, are more incremental and context-specific.

vs. Interactive Evaluation Requires a Design Science

gpt-5.25/19/2026

Paper 1 has higher potential impact because it proposes a general evaluation paradigm for interactive/agentic AI, offering a taxonomy, design principles, and reporting standards that could reshape how many interactive benchmarks and deployments are evaluated across domains. Its breadth (applicable to tool use, multi-agent, robustness, coordination, recoverability) and timeliness (shift from static to trajectory-based systems) make it widely influential. Paper 2 is methodologically rigorous and practically useful, but its findings are more domain- and setting-specific (CybORG cyber POMDP) and thus likely narrower in cross-field impact.

vs. Divergence-Suppressing Couplings for Rectified Flow

gemini-3.15/19/2026

Paper 1 introduces a fundamental algorithmic improvement to Rectified Flow, a highly influential generative modeling framework. By addressing trajectory entanglement mathematically, its contributions can broadly impact various domains using generative models. In contrast, Paper 2 offers a valuable but more narrowly focused empirical study on LLM agent design in a specific POMDP environment. Theoretical and algorithmic advancements in foundational generative models typically yield broader, more long-lasting scientific impact across multiple disciplines.

vs. AI for Auto-Research: Roadmap & User Guide

gpt-5.25/19/2026

Paper 2 is likely higher impact because it contributes concrete, testable findings and a named failure mode (“deliberation cascade”) from a controlled, cost-accounted experimental study in an adversarial POMDP benchmark. Its results translate directly into actionable agent-design principles (state abstraction, hierarchy vs deliberation tradeoffs) relevant to real deployments in cybersecurity and sequential decision-making, and are methodologically stronger (multiple model families, many episodes, token-level cost). Paper 1 is broader and timely but is primarily a roadmap/taxonomy with less new empirical evidence.

vs. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

gpt-5.25/19/2026

Paper 2 has higher likely impact: it introduces an automated, interpretable rubric-learning framework for T2I reward modeling that dramatically reduces reliance on large human preference datasets while improving benchmark and downstream RL performance. This targets a central bottleneck in current generative image alignment and has broad applicability (reward modeling, VLM judging, RLHF/RLAIF, diffusion model training). The method is concrete (rule synthesis + l1 selection), adaptable, and timely given rapid T2I development. Paper 1 is a valuable cost-performance study but is more domain-specific and primarily provides empirical design guidance rather than a broadly reusable new alignment technique.

vs. ADR: An Agentic Detection System for Enterprise Agentic AI Security

claude-opus-4.65/19/2026

Paper 1 presents a novel, production-deployed system (ADR) addressing the emerging and critical problem of securing AI agents in enterprise environments. Its 10-month deployment at Uber scale (7,200+ hosts, 10,000+ daily sessions) demonstrates real-world impact. It introduces a new benchmark (ADR-Bench) and significantly outperforms baselines. The problem of agentic AI security is highly timely given rapid AI agent adoption. Paper 2, while methodologically sound, is a controlled empirical study in a simulated environment offering design guidelines but with narrower scope and less immediate real-world applicability.

vs. Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

gemini-3.15/19/2026

Paper 2 demonstrates higher potential scientific impact due to its rigorous theoretical contributions to multi-agent reinforcement learning (MARL). While Paper 1 provides a highly relevant empirical evaluation of LLM agents, its findings are tied to current model behaviors and costs, which may age quickly. In contrast, Paper 2 introduces a fundamentally novel, mathematically grounded framework (HIBCG) that solves a recognized theoretical gap in MARL topology learning. Its proven variational bounds and principled communication capacity allocation offer lasting, generalizable advancements that will likely influence foundational MARL research across multiple domains.

vs. Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

gpt-5.25/19/2026

Paper 2 has higher likely impact due to timeliness and broad applicability: it provides actionable, cost-accounted design guidance for compound LLM agents in adversarial POMDPs, a rapidly growing real-world deployment setting (cyber defense). The controlled ablation across models/configurations and identification of the “deliberation cascade” is a clear, generalizable insight that can influence agent design across domains. Paper 1 is methodologically rigorous and novel in runtime theory for multi-party multi-objective optimization, but its audience and immediate applications are narrower, making near-term cross-field impact lower.

vs. Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

claude-opus-4.65/18/2026

Paper 1 addresses a timely and practically important question about compound LLM agent design in adversarial settings, providing actionable design principles backed by rigorous empirical evaluation across multiple models and configurations. The identification of 'deliberation cascades' as a destructive pattern and the finding that programmatic infrastructure outperforms deeper reasoning offers immediately useful guidance for the rapidly growing LLM agent community. Paper 2 presents solid work on bilevel planning but combines relatively established ideas (symbolic planning + imitation learning) in a more incremental fashion. Paper 1's broader relevance to the booming LLM agent ecosystem gives it higher potential impact.

vs. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

claude-opus-4.65/18/2026

Paper 1 presents novel empirical findings with actionable design principles (e.g., 'deliberation cascade' phenomenon, programmatic state abstraction over deeper reasoning) from a rigorous controlled study with cost accounting across multiple models and configurations. These concrete, counterintuitive findings directly guide practitioners building compound LLM agents. Paper 2 is a survey that synthesizes existing work into a useful framework (LIFE progression), but surveys generally have less direct scientific impact than papers introducing new empirical discoveries. Paper 1's findings on how deliberation tools can degrade hierarchical agent performance are novel and practically significant.

vs. MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

gemini-3.15/18/2026

Paper 1 offers a rigorously formalized, algorithmic solution (reducing to s-t min-cut) to a fundamental infrastructure problem in agentic systems (memory staleness and cascade updates). Its mathematical guarantees and broad applicability across agent architectures give it higher potential for long-term scientific impact compared to Paper 2, which is an empirical ablation study providing design heuristics tied to current model behaviors in a specific POMDP environment.

vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

gpt-5.25/18/2026

Paper 1 is more methodologically innovative and likely higher impact: it introduces a new learnable, discrete persona-token mechanism grounded in large-scale real clickstream data, enabling scalable personalization and population-level simulation without prompt engineering. The evaluation is extensive (8.37M buyers, 42 live storefronts) with clear real-world deployment relevance (e-commerce agents, merchant-specific distributions) and an open-source pipeline that can catalyze follow-on work. Paper 2 is timely and rigorous as a controlled cost-performance study, but is primarily diagnostic/ablation-focused with narrower domain scope and less algorithmic novelty.

vs. EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

claude-opus-4.65/18/2026

Paper 1 provides broadly applicable design principles for compound LLM agents in adversarial POMDPs, backed by a rigorous controlled study across multiple models and configurations with cost accounting. Its findings—that programmatic state abstraction outperforms deliberation, and that combining deliberation with hierarchy causes 'deliberation cascades'—offer actionable, generalizable guidance for the growing field of LLM agent design. Paper 2 presents a solid contribution with EmbodiSkill's skill-aware reflection, but is more narrowly scoped to embodied environments and specific benchmarks. Paper 1's systematic methodology and counterintuitive findings have broader implications for agent architecture design.

vs. Abductive Reasoning with Probabilistic Commonsense

claude-opus-4.65/18/2026

Paper 1 presents a comprehensive, controlled empirical study addressing a timely and practical problem—designing compound LLM agents for adversarial sequential decision-making. Its findings (e.g., deliberation cascades, cost-performance tradeoffs, programmatic state abstraction dominance) provide actionable design principles for the rapidly growing field of LLM-based agents. The scale of evaluation (5 model families, 12 configurations, 3,475 episodes with cost accounting) adds rigor. Paper 2 contributes a novel probabilistic abductive reasoning framework, but its scope is narrower (commonsense reasoning benchmarks). Paper 1's practical relevance to agent system design gives it broader and more immediate impact.

vs. Coding Agent Is Good As World Simulator

claude-opus-4.65/18/2026

Paper 2 introduces a novel paradigm shift by using coding agents as world simulators, addressing fundamental limitations of video-based world models (lack of physical plausibility). It has broader impact across robotics, autonomous driving, and embodied AI, offering a generalizable framework. Paper 1, while methodologically rigorous and practically useful, is a controlled empirical study within a specific cyber defense environment (CAGE-2), yielding design guidelines that are valuable but narrower in scope. Paper 2's approach of generating executable simulation code for physically grounded world models is more innovative and has wider applicability across multiple fields.

vs. PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

claude-opus-4.65/18/2026

Paper 2 introduces a novel problem formulation (precision-sensitive GUI tasks), a new benchmark (PAGE Bench with 224K+ actions), and a new method (PAGER) achieving 4.1x improvement over baselines. It identifies and addresses a fundamental 'Semantic-Execution Gap' with broad implications for GUI agents, geometric reasoning, and VLM capabilities. Paper 1 provides useful empirical design guidance for compound LLM agents in adversarial POMDPs but is more of a controlled ablation study within a specific environment (CybORG CAGE-2) with narrower applicability. Paper 2's contributions—benchmark, method, and conceptual framework—have broader potential impact across multiple research communities.