POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García, Annemarie F. Laudanski, Álvaro Gutiérrez, Eduardo Rocon, Manuel Cebrian

Jun 1, 2026

arXiv:2606.02282v1 PDF

cs.AI(primary)

#906of 3355·Artificial Intelligence

#906 of 3355 · Artificial Intelligence

Tournament Score

1452±45

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6

Novelty7.5

Clarity7.5

Tournament Score

1452±45

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$ ), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: POIROT

1. Core Contribution

POIROT introduces a decentralized peer-interrogation protocol for failure detection and fault attribution in LLM-based Multi-Agent Systems. Rather than relying on an external judge (human or LLM), POIROT repurposes the system's own agents as diagnosticians through a five-phase protocol: hazard space construction, self-assessment, peer interrogation, private voting, and hazard-aware weighted aggregation. The key conceptual insight is that epistemic diversity already present in a multi-agent architecture can be leveraged for self-auditing, eliminating the need for domain-specific external evaluators.

The paper also introduces BLAME, a benchmark for fault attribution in safety-critical multi-agent systems spanning medical rehabilitation (CORTEX) and algorithmic trading (TradingAgents), with controlled fault injection across up to 15 hazard dimensions and 12 agents.

2. Methodological Rigor

Strengths in experimental design:

Evaluation spans three benchmarks of increasing complexity (Who&When, CORTEX, TradingAgents), providing a reasonable progression from simple to complex scenarios.

Four backbone LLMs are tested (two proprietary, two open-weight), reducing model-specific confounds.

Statistical analysis includes logistic regression with interaction terms (OR = 1.60, p = 0.008 for the complexity interaction), lending credibility to the scaling claims.

The hazard-aware aggregation mechanism using Hamming distance-based weighting is formally specified and principled.

Methodological concerns:

Sample sizes are relatively small, particularly for Gemini 2.5 Pro (6 trials per scenario in CORTEX, 4 in TradingAgents). This limits statistical power and makes per-scenario conclusions fragile.

The complexity metric (average reasoning tokens across models) is an indirect proxy. While defensible, it conflates model-specific verbosity with genuine problem difficulty.

The baseline comparison is limited to a single-LLM evaluator in "all-at-once" mode. More competitive baselines—such as multi-judge panels, chain-of-verification approaches, or debate-style protocols—would strengthen the comparative claims.

Compound fault accuracy is notably low across all conditions (best: 28.8%), raising questions about practical utility in the most realistic failure scenarios where multiple faults co-occur.

GPT-oss 120B shows POIROT underperforming the baseline overall on CORTEX (Δ = −1.3pp), and several per-scenario results show large inversions (Δ = −100pp for Parent+Sensor), which the authors acknowledge but do not fully explain. This inconsistency undermines the generality claim.

3. Potential Impact

Practical applications:

The work addresses a genuine deployment bottleneck: how to evaluate and audit LLM-MAS without expensive human oversight or brittle domain-specific judges. If the approach generalizes, it could enable regulatory compliance (EU AI Act) for high-risk AI systems by providing built-in traceability and fault attribution.

Broader influence:

The idea of "agents auditing themselves" connects to broader themes in AI safety, collective intelligence, and distributed systems. It could inspire analogous approaches in robotics, autonomous vehicles, or critical infrastructure monitoring.

The BLAME benchmark fills a gap—most MAS benchmarks focus on task performance rather than failure attribution. If adopted by the community, it could catalyze a subfield of fault-oriented MAS evaluation.

The open-source release (pip-installable library) lowers adoption barriers significantly.

Limitations on impact:

The absolute accuracy numbers, especially for compound faults, suggest the approach is not yet reliable enough for true safety-critical deployment where near-perfect attribution is needed.

The protocol adds computational overhead (multiple LLM calls per agent for interrogation), which may limit scalability in cost-sensitive or latency-sensitive applications.

4. Timeliness & Relevance

The paper is highly timely. The EU AI Act's requirements for traceability and human oversight in high-risk AI systems create immediate demand for auditing tools. Simultaneously, LLM-MAS adoption is accelerating while evaluation methodology lags behind. The paper explicitly bridges this gap and positions POIROT within the regulatory landscape, which strengthens its relevance.

The observation that no single model dominates across fault types is a practically important finding that resonates with the growing recognition that LLM capabilities are highly task-dependent.

5. Strengths & Limitations

Key strengths:

1. Novel framing: Treating fault attribution as a distributed consensus problem over a structured hazard space is conceptually elegant and operationally grounded.

2. Scaling with complexity: The statistically significant interaction showing POIROT's advantage grows with problem difficulty is the paper's strongest empirical claim—it suggests the protocol becomes more valuable precisely where it is most needed.

3. Per-agent analysis: Demonstrating that the aggregate consistently exceeds the best individual agent (Fig. 4) directly quantifies the value of the collective mechanism over self-evaluation.

4. Reproducibility: Open-source code, benchmark, and detailed prompt documentation in supplementary materials set a high standard.

5. Two distinct safety-critical domains (healthcare + finance) demonstrate breadth.

Notable weaknesses:

1. Weak baselines: Only a single-LLM all-at-once baseline is compared. Multi-judge ensembles, debate protocols, or even simple majority voting across independent LLM calls would be more informative comparisons.

2. Low compound fault performance: The best compound fault accuracy of 28.8% is concerning for safety-critical claims. The paper's framing as a "step towards self-auditing architectures" is appropriate, but the gap to practical reliability is large.

3. Model-dependent effectiveness: The protocol appears ineffective with lower-capability models (GPT-oss 20B/120B show minimal or negative gains in several settings), limiting generality.

4. Sequential agent execution: Agents are evaluated in a fixed order, introducing potential ordering effects that are not analyzed.

5. Injected faults only: All faults are researcher-designed and injected. Real-world emergent failures may be qualitatively different and harder to attribute.

6. No cost analysis: The computational cost of running POIROT (multiple interrogation rounds across all agents) versus a single evaluator call is not reported, which is critical for practical deployment decisions.

Additional Observations

The hazard-aware aggregation mechanism, while principled, uses a relatively simple Hamming distance weighting. More sophisticated aggregation (e.g., learned weights, attention-based mechanisms) could potentially improve performance. The authors' finding that reputation-based β updates yield negligible improvement is interesting but may reflect the limited evaluation horizon rather than a fundamental property.

The paper's dedication to the late Manuel Cebrian adds a human dimension, and the work clearly represents a substantial collaborative effort bridging robotics, AI safety, and computational social science.

Rating:6.8/ 10

Significance 7Rigor 6Novelty 7.5Clarity 7.5

Generated Jun 2, 2026

Comparison History (24)

vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

claude-opus-4.66/3/2026

POIROT addresses a more fundamental and broadly applicable problem—safety oversight in multi-agent LLM systems—which is critical for deployment across many domains. Its novel approach of using agents as their own diagnostic layer is conceptually innovative and has immediate implications for AI safety regulation. The release of both a library and benchmark (BLAME) increases practical impact. BehaviorBench, while solid, targets a narrower niche (personalized decision modeling from blockchain/prediction-market data) with more incremental contributions. POIROT's relevance to AI safety and regulation gives it stronger timeliness and broader cross-field impact.

vs. What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

gpt-5.26/3/2026

Paper 2 targets a foundational blind spot in agent evaluation—whether an agent should act at all—introducing a general taxonomy (specification/verification/authority gaps) and concrete, reusable metrics/protocols for abstention competence. This reframes benchmark design in a way likely to influence multiple subfields (alignment, RLHF, agent safety, evaluation, enterprise deployment) and is timely given real-world autonomy and governance needs. Paper 1 is practical and open-sourced, but is more specific to multi-agent failure detection and may have narrower cross-domain influence than abstention-aware evaluation frameworks.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

gpt-5.26/3/2026

Paper 2 likely has higher impact: it targets a broad, timely problem (reliability/safety of LLM multi-agent systems) with clear real-world relevance under emerging regulation, and provides an actionable protocol plus open-source tooling and a benchmark (POIROT + BLAME), enabling rapid adoption and follow-on work across domains. Its empirical claims include statistical significance and scaling analyses, suggesting stronger methodological rigor. Paper 1 is novel for spatial reasoning supervision in VLMs and contributes datasets, but its impact is narrower to multimodal spatial tasks and dependent on a specific training paradigm/backbone.

vs. Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

claude-opus-4.66/2/2026

POIROT addresses a critical and timely problem—failure detection in multi-agent LLM systems—which is highly relevant given the rapid deployment of such systems and emerging AI regulation. Its novel approach of using agents to audit themselves is innovative and broadly applicable across safety-critical domains. The paper offers open-source tools (POIROT library and BLAME benchmark) that can catalyze further research. Paper 1, while useful, addresses a narrower educational domain with incremental methodological contributions combining known techniques (knowledge graphs, attention mechanisms, temporal modeling). Paper 2's broader impact across AI safety, regulation, and multi-agent systems gives it significantly higher potential.

vs. Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

gpt-5.26/2/2026

Paper 2 has higher estimated impact due to strong real-world applicability and timeliness: it targets safety, reliability, and regulatory needs for LLM-based multi-agent systems. It offers a concrete, deployable protocol plus open-source tooling and a new benchmark, enabling broad uptake and follow-on work. Its claims are framed with measurable improvements and scaling trends across conditions, suggesting methodological rigor and practical relevance. Paper 1 is novel and scientifically interesting (mechanistic link between transitive inference and embedding geometry) but is more specialized and likely to have narrower immediate downstream adoption.

vs. LLM-Evolved Pattern Generators for Optimal Classical Planning

gpt-5.26/2/2026

Paper 2 has higher potential impact: it introduces a first-of-its-kind approach to learn domain-dependent heuristics that are admissible by design, directly advancing optimal classical planning with preserved A* guarantees—an enduring, broadly used formal setting. The method is technically rigorous (program synthesis + admissible combination via saturated cost partitioning), yields interpretable artifacts, and promises practical speedups with negligible test-time overhead. Its contribution generalizes across planning domains and connects LLMs to principled algorithm design. Paper 1 is timely for LLM safety, but its self-auditing premise may face robustness and external-validity limits in safety-critical oversight.

vs. SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

claude-opus-4.66/2/2026

POIROT addresses a more fundamental and broadly applicable challenge—failure detection and safety oversight in multi-agent LLM systems—which is relevant across all domains deploying LLM-MAS, especially safety-critical ones. Its novel insight that agents can self-audit without external evaluators is innovative and paradigm-shifting. The connection to AI regulation adds timeliness. While SMH-Bench is a solid benchmark contribution, it targets a narrower application domain (smart homes) and is more incremental. POIROT's broader applicability, stronger novelty, and alignment with urgent AI safety needs give it higher impact potential.

vs. Tracking the Behavioral Trajectories of Adapting Agents

claude-opus-4.66/2/2026

POIROT addresses a more fundamental and broadly applicable problem—failure detection in multi-agent LLM systems—with a novel protocol that repurposes agents as their own diagnostic layer. It demonstrates scalability across multiple dimensions (complexity, agent count, fault types) with rigorous statistical evidence. The release of both an open-source library and a benchmark (BLAME) increases adoption potential. Paper 2, while presenting a creative trait-tracking methodology, addresses a narrower problem (monitoring skill file edits) with a smaller evaluation scope (68 labeled pairs) and more limited generalizability. POIROT's relevance to AI safety regulation gives it broader and more timely impact.

vs. Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

gpt-5.26/2/2026

Paper 2 is more ambitious and potentially transformative: a planet-scale runtime/protocol for connecting heterogeneous scientific capabilities (simulations, wet labs, proof engines) could broadly reshape AI-native discovery across many fields. Its cross-domain applicability and timeliness (AI-for-science infrastructure) suggest high upside impact if adopted. Paper 1 is novel and rigorous for LLM multi-agent safety oversight with benchmarking and statistics, but its scope is narrower (evaluation/failure detection in LLM-MAS). Paper 2’s methodological rigor is less benchmarked, yet its breadth and real-world integration potential are larger.

vs. Evaluation of Baseline Methods for IDD-based SSD External Memory Search

gpt-5.26/2/2026

Paper 1 is more novel and timely: it targets emergent failure detection in LLM multi-agent systems, a rapidly growing area with direct relevance to safety-critical deployment and AI regulation. The approach (using agents as a distributed diagnostic layer) is innovative, broadly applicable across many MAS settings, and supported by quantitative results plus open-source release and a new benchmark, which can catalyze follow-on work. Paper 2 is methodologically useful but primarily an incremental systems evaluation of baseline external-memory search methods, with narrower cross-field impact and less urgency.

vs. OctoT2I: A Self-Evolving Agentic Text-to-Image Router

gpt-5.26/2/2026

Paper 2 (POIROT) has higher estimated impact due to broader cross-domain relevance and timeliness: failure detection and auditing for multi-agent LLM systems directly addresses deployment blockers in safety-critical settings and aligns with emerging AI regulation. Methodologically, it provides a general protocol, quantitative evidence (effect size, significance), robustness analyses (scaling with complexity, agent count, compound faults), and releases both an open-source library and a benchmark, enabling adoption and follow-on research. Paper 1 is innovative but more niche (T2I routing) and primarily optimizes efficiency within a specific application area.

vs. VESTA: Visual Exploration with Statistical Tool Agents

gpt-5.26/2/2026

Paper 2 (VESTA) likely has higher impact: it targets a core, cross-domain bottleneck (automating statistical modeling) with a broadly applicable framework combining VLMs, dynamic tool creation, and rigorous diagnostics. Its benchmark (DAWN) spans synthetic-to-real tasks including astronomy, increasing relevance and adoption potential across sciences. The approach extends beyond evaluation/oversight to directly accelerate scientific workflows, offering clearer real-world utility. Paper 1 is timely and important for LLM-MAS safety, but its impact is narrower to agent reliability/oversight compared to VESTA’s broader scientific workflow automation.

vs. TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

gemini-3.16/2/2026

Paper 2 addresses a highly urgent bottleneck in AI deployment: safety and failure detection in complex Multi-Agent Systems. Its novel decentralized approach to diagnostics, coupled with the release of an open-source library and a new benchmark, ensures immediate utility and reproducibility. While Paper 1 tackles an important fairness issue in personalized LLMs, Paper 2's focus on safety-critical domains and AI regulation gives it broader, more critical real-world applicability and foundational impact.

vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

gemini-3.16/2/2026

Paper 1 addresses a critical bottleneck in deploying Multi-Agent Systems in safety-critical domains by proposing a highly novel, decentralized approach to failure detection. Its alignment with emerging AI regulations and focus on safety oversight give it broader, more urgent real-world implications compared to Paper 2, which focuses primarily on efficiency optimizations in tool use.

vs. TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

gpt-5.26/2/2026

Paper 2 (TIGER) likely has higher impact due to broader applicability and stronger methodological grounding: it targets multimodal hallucinations across multiple generation paths and backbones, introduces a principled graph-based risk-scoring/routing framework, and provides a convergence analysis with explicit guarantees—boosting rigor and adoption potential. Its inference-time, frozen-backbone design is timely and practical for deployment. Paper 1 (POIROT) is novel for multi-agent self-auditing and releases tools/benchmarks, but its scope is narrower (LLM-MAS oversight) and may generalize less across modalities and tasks than TIGER.

vs. Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

gemini-3.16/2/2026

Paper 1 addresses an extremely timely and critical issue: the security and intellectual property of cutting-edge reasoning models (like OpenAI's o1) that attempt to hide their reasoning traces. By demonstrating a fundamental vulnerability in interface-level trace hiding, it has immediate and profound implications for model distillation, security, and deployment strategies across the AI industry. While Paper 2 offers a valuable methodological contribution to multi-agent safety, Paper 1's findings challenge current industry practices for protecting proprietary model capabilities, giving it higher immediate scientific and practical impact.

vs. Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

claude-opus-4.66/2/2026

POIROT introduces a novel paradigm for safety oversight in multi-agent LLM systems by using the agents themselves as diagnostic tools, addressing a critical gap in AI safety and regulation. Its contributions—a new protocol, benchmark (BLAME), and open-source library—have broad applicability across safety-critical AI domains. Paper 2, while practical, applies existing DRL and explainability techniques to building energy management, representing incremental progress in a well-explored area. POIROT's novelty, timeliness given AI regulation trends, and cross-domain relevance give it substantially higher impact potential.

vs. PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

claude-opus-4.66/2/2026

PassNet addresses a fundamental gap in compiler optimization by proposing the first large-scale ecosystem for LLM-based compiler pass generation, with concrete datasets (18K+ graphs), benchmarks, and demonstrated results showing fine-tuned small models approaching frontier performance. It opens a new research direction (pass generation vs. kernel generation) with immediate practical applications for compiler optimization. While POIROT addresses important MAS safety concerns with a clever self-diagnostic approach, PassNet's contribution—new abstraction, large-scale infrastructure, and strong empirical validation—has broader potential to reshape how compilers are designed and optimized, impacting both ML systems and programming languages communities.

vs. Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

claude-opus-4.66/2/2026

POIROT addresses a critical and timely problem—failure detection in multi-agent LLM systems—with direct implications for AI safety and regulatory compliance. Its practical framework (open-source library + benchmark) enables immediate adoption across safety-critical domains. The breadth of impact is wider: it applies to any multi-agent LLM deployment, whereas BSETD targets a narrower niche (emotion transition analysis from multi-annotator labels). POIROT's novelty of using agents to audit themselves is conceptually compelling, and the scaling results with complexity suggest broad generalizability. Both papers are methodologically rigorous, but POIROT's timeliness amid AI regulation discussions gives it an edge.

vs. Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

gpt-5.26/2/2026

Paper 1 likely has higher impact: it introduces a broadly applicable, regulation-relevant protocol for safety oversight in LLM multi-agent systems, plus an open-source library and a new benchmark (BLAME), which can catalyze follow-on research and standardization. The idea of using agents as their own distributed evaluators is novel and potentially influential across evaluation, alignment, and safety engineering. Paper 2 is a solid methodological advance in knowledge editing with measurable gains, but its scope is narrower and more incremental within an already active sub-area.