Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin

May 22, 2026

arXiv:2605.24219v1 PDF

cs.AI(primary)

#1266of 2682·Artificial Intelligence

#1266 of 2682 · Artificial Intelligence

Tournament Score

1415±42

10501800

53%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5

Novelty6.5

Clarity7

Tournament Score

1415±42

10501800

53%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows"

1. Core Contribution

Trajel addresses a genuine and underexplored gap: hallucination evaluation in LLM-based agentic systems has remained fixated on final outputs, ignoring failures that originate and propagate through intermediate reasoning steps. The paper contributes three interlinked artifacts: (1) a five-type hallucination taxonomy (factual, referential, logical, procedural, scope-based) defined as structural predicates over Thought-Action-Observation traces; (2) a dataset of 225 expert-annotated agent trajectories from the AssetOpsBench industrial framework; and (3) a benchmarking framework comparing subtask-level (BERT), trajectory-level (NLI), and long-context (Longformer) detection paradigms.

The taxonomy is the most intellectually distinctive contribution. Defining hallucination types by the *scope of context required for detection* — factual needing only local evidence, referential/logical requiring trajectory history, procedural requiring workflow specifications, and scope requiring agent role definitions — provides a principled organizational scheme that maps directly to detection architecture choices. The finding that 48.7% of hallucinated trajectories exhibit multiple types simultaneously is a strong empirical justification for multi-label formulation.

2. Methodological Rigor

The paper demonstrates reasonable methodological care but has notable weaknesses:

Strengths in methodology:

The two-phase annotation protocol (LLM-as-a-Judge followed by blind human review) is well-motivated to mitigate anchoring bias.

The analysis is multi-faceted, covering prevalence, localization, cross-model comparison, detection modeling, and signal analysis.

The context-ordering hypothesis — that detection difficulty increases with the context scope required — is empirically validated across both human agreement (κ drops from 0.656 for scope to 0.176 for referential) and automated detection (F1 drops correspondingly).

Weaknesses:

The dataset is small: 225 trajectories is modest for training supervised classifiers, and the authors acknowledge this. The supervised models underperform the zero-shot LLM judge (best F1 = 0.590 vs. 0.855), which limits the conclusions one can draw about detection modeling.

Inter-annotator agreement is moderate overall (κ = 0.456) and weak for the most interesting categories (referential κ = 0.176, logical κ = 0.211). This raises questions about whether these categories are well-defined enough for reliable annotation, or whether the annotation protocol needs fundamental redesign for these types.

Only two annotator "parties" are involved (LLM judge and one human reviewer per trajectory), and inter-human agreement is not separately reported, making it impossible to distinguish taxonomy ambiguity from LLM-human disagreement.

The supervised detection experiments lack proper cross-validation reporting, confidence intervals, or statistical significance tests. With 225 trajectories and imbalanced classes, variance could be substantial.

The signal analysis (Section 6.6), while yielding striking results (CJ achieving AUC = 0.908), uses binary flags produced by the same AssetOpsBench framework, raising circularity concerns — the system evaluating execution quality may share information pathways with the system generating trajectories.

3. Potential Impact

The work has moderate-to-high potential impact along several dimensions:

For the agentic AI safety community: The taxonomy and the empirical finding that procedural hallucinations dominate (38.5%) but are invisible to output-only evaluation is a concrete, actionable insight. The localization analysis showing hallucinations concentrate in Actions and Responses rather than Thoughts could directly inform guardrail architecture.

For industrial AI deployment: The execution-quality signal analysis suggests that lightweight runtime monitors (particularly clarity-and-justification flags) could serve as effective "kill switches" — the 97.1% hallucination rate when both CJ and RV are absent is operationally useful.

For benchmark design: The demonstration that high binary accuracy masks systematic failure on subtle types (referential and logical) is an important cautionary finding for the evaluation community.

However, the impact is limited by domain specificity (single industrial domain, single orchestrator framework) and dataset scale. The 225-trajectory dataset, while carefully annotated, is too small to serve as a definitive training resource.

4. Timeliness & Relevance

The paper is well-timed. The rapid deployment of agentic LLM systems in production settings has created urgent demand for evaluation frameworks that go beyond static benchmarks. The paper explicitly targets this gap and provides a concrete framework. The connection to AssetOpsBench grounds the work in realistic industrial scenarios rather than toy environments. The submission to NeurIPS Datasets and Benchmarks is appropriate.

5. Strengths & Limitations

Key Strengths:

The taxonomy is well-motivated and structurally grounded, with the context-ordering principle providing theoretical coherence.

The multi-label formulation is empirically justified by the 48.7% multi-type co-occurrence rate.

The signal analysis reveals that simple execution flags outperform trained classifiers, a counterintuitive and practically valuable finding.

The paper is clearly written with thorough appendices covering annotation protocols, prompts, and detailed breakdowns.

Notable Limitations:

Dataset scale (225 trajectories) severely constrains the supervised modeling experiments and generalizability claims.

Single-domain evaluation (industrial asset operations) limits external validity.

The gap between human agreement on subtle types (κ ≤ 0.211) and the paper's claims about taxonomy utility creates tension — if annotators cannot reliably agree, the taxonomy may need refinement for referential and logical types.

No inter-human agreement analysis is reported (only human-vs-LLM), which is a significant omission for a dataset paper.

The supervised classifiers perform poorly enough (AUC < 0.70) that the "detection modeling framework" contribution feels preliminary rather than substantive.

The paper does not explore whether the taxonomy generalizes beyond the AssetOpsBench domain.

Overall Assessment

Trajel makes a meaningful conceptual contribution by formalizing trajectory-level hallucination as a multi-type, multi-label detection problem and providing initial empirical evidence for its importance. The taxonomy and the empirical findings about type prevalence, co-occurrence, and detection difficulty are the paper's strongest contributions. However, the dataset is small, the supervised modeling results are preliminary, and the low inter-annotator agreement on key categories tempers enthusiasm. The work opens a useful research direction but represents an early-stage contribution rather than a definitive benchmark.

Rating:5.8/ 10

Significance 6.5Rigor 5Novelty 6.5Clarity 7

Generated May 26, 2026

Comparison History (17)

vs. JobBench: Aligning Agent Work With Human Will

gemini-3.15/27/2026

Paper 1 offers a profound paradigm shift by evaluating AI agents based on human empowerment rather than economic replacement. Its extensive scope (130 tasks across 35 occupations) and detailed evaluation framework position it to become a foundational benchmark for human-AI collaboration. While Paper 2 tackles the crucial technical issue of trajectory hallucinations, Paper 1's broader socio-technical relevance, interdisciplinary appeal, and potential to steer the future development of agentic AI towards human-centric workflows give it a higher overall scientific and societal impact.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

gemini-3.15/27/2026

Paper 2 addresses a fundamental bottleneck in LLM agents—long-horizon memory—by formalizing operations and isolating specific failure modes. This architectural focus offers broader applicability across any LLM system relying on memory. While Paper 1 is highly valuable for multi-agent workflows, Paper 2's foundational approach to stress-testing memory components likely provides more generalized insights that will influence future agent design and evaluation.

vs. Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

gemini-3.15/26/2026

LLM agents are rapidly being deployed, making the evaluation of intermediate reasoning steps a critical bottleneck. Paper 2 addresses this urgent issue by introducing a novel dataset, taxonomy, and evaluation framework for trajectory-level hallucinations. This has immediate, widespread applicability for AI safety and reliability across diverse industries, likely resulting in broader adoption and higher cross-disciplinary impact compared to the more domain-specific, albeit rigorous, robotics and control focus of Paper 1.

vs. When Mean CE Fails: Median CE Can Better Track Language Model Quality

claude-opus-4.65/26/2026

Paper 1 addresses a critical and timely gap in LLM agent safety—trajectory-level hallucination detection in multi-agent systems—introducing a novel taxonomy, dataset, and evaluation framework. As agentic AI deployment accelerates in industry, this work has broad implications for reliability and safety. Paper 2 offers a useful but narrower practical insight (median vs. mean CE as a training metric), which is more of an incremental diagnostic recommendation than a paradigm shift. Paper 1's contribution is more novel, has broader cross-field impact, and addresses a more pressing problem in the AI safety landscape.

vs. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

gpt-5.25/26/2026

Paper 2 has higher likely impact due to strong real-world applicability (data-center energy and QoS), clear methodological rigor (implemented in vLLM, evaluated across dense+MoE and multi-GPU), and timely relevance as power becomes a limiting factor for LLM deployment. Its approach (treating power caps as a controllable runtime knob with feedback control) can generalize to many serving stacks and aligns with industry needs (energy proportionality, grid interaction). Paper 1 is novel and valuable for safety evaluation, but its impact depends on adoption of specific datasets/taxonomies and is narrower operationally.

vs. Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

claude-opus-4.65/26/2026

Paper 2 addresses the broadly relevant problem of hallucination detection in LLM agents with a clear, practical taxonomy and dataset (Trajel) that fills a well-recognized gap—most benchmarks only evaluate final outputs. Its contribution is more accessible, empirically grounded, and applicable across the rapidly growing multi-agent AI ecosystem. Paper 1, while intellectually ambitious in applying actuarial concepts to AI agent control, introduces highly specialized formalism (authority frontiers, reserve capital budgets) that may have narrower adoption. Paper 2's dataset and taxonomy are more likely to be widely cited and built upon by the safety and evaluation communities.

vs. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

gemini-3.15/26/2026

Paper 1 addresses a critical and universally relevant bottleneck in modern AI: trajectory-level hallucinations in autonomous LLM agents. Its framework for auditing intermediate reasoning steps has broad applicability across AI safety, multi-agent systems, and industrial deployments. Paper 2, while highly rigorous and valuable for Operations Research, focuses on a narrower domain of optimization algorithm design, giving Paper 1 a higher potential for widespread cross-disciplinary impact.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

claude-opus-4.65/26/2026

Paper 2 introduces a novel causal intervention framework (CMI) for memory selection in LLM agents, addressing a fundamental limitation in how agents utilize persistent memory. The causal approach is methodologically innovative, bridging causal inference with LLM agent design—a connection with broad implications beyond the specific application. It includes a reusable benchmark and open-source code. Paper 1, while valuable in introducing a hallucination taxonomy for agent trajectories, is more narrowly focused on auditing industrial workflows. Paper 2's causal framework has wider applicability across agent architectures and tasks, and its principled approach to memory selection addresses a more general challenge in the rapidly growing LLM agent field.

vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

gpt-5.25/26/2026

Paper 2 likely has higher impact due to strong novelty (signal-language contrastive foundation model), very large-scale training (2.8M ECGs) and extensive external validation (~1.5M ECGs across nine cohorts) over 89 tasks, supporting methodological rigor and generalizability. Its real-world clinical applications are substantial (broad cardiovascular assessment, opportunistic screening, rare disease detection) with clear timeliness in foundation models for healthcare. Paper 1 is timely and useful for safer agentic LLM deployment, but its domain-specific dataset/evaluation framework and narrower application scope suggest comparatively less immediate cross-field and societal impact.

vs. A governance horizon for ethical-use constraints in open-weight AI models

gpt-5.25/26/2026

Paper 2 likely has higher impact due to its scale (auditing 2.1M repos), strong quantitative finding (a measurable “governance horizon” with predictive fit), and immediate policy relevance for open-weight model governance and supply-chain accountability. Its conclusions generalize across platforms (Hugging Face vs PyPI) and inform actionable platform/policy designs, affecting research, industry compliance, and regulation. Paper 1 is timely and useful for agent safety evaluation, but its domain specificity (industrial multi-agent workflows) and narrower stakeholder reach suggest comparatively smaller cross-field and real-world governance impact.

vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

gemini-3.15/26/2026

Paper 2 addresses a fundamental challenge in agentic AI by auditing intermediate trajectory-level hallucinations rather than just final outputs. This has broad applicability across all multi-step reasoning systems, offering significant real-world impact for safe agent deployment. Paper 1, while novel in addressing multi-turn clarification, is largely confined to the niche domain of computational science, limiting its cross-disciplinary impact compared to Paper 2.

vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

gpt-5.25/26/2026

Paper 1 likely has higher impact due to its timely, broadly applicable evaluation framework for agentic LLM safety: trajectory-level hallucination auditing with a clear taxonomy, expert-annotated traces, and benchmarking across granularities. This directly targets a major deployment bottleneck in industry (multi-step tool-using agents), making real-world adoption plausible and cross-cutting across LLM agents, evaluation, safety, and reliability. Paper 2 is novel in embodied ToM under perceptual bottlenecks, but appears narrower in applicability and more dependent on task design and CoT prompting, with less obvious standardization path than Paper 1’s dataset/taxonomy-driven evaluation.

vs. Scalable Environments Drive Generalizable Agents

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental and broadly impactful question about how to build generalizable agents through environment scaling, proposing a unifying taxonomy and research agenda that could shape the entire field of agent learning. Its scope spans reinforcement learning, foundation models, and open-ended learning. Paper 2, while practically valuable for LLM agent safety, addresses a narrower problem (trajectory-level hallucination detection) with a specific dataset/benchmark contribution. Paper 1's conceptual framework has greater potential to influence multiple research communities and drive long-term paradigm shifts in agent training methodology.

vs. Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: trajectory-level hallucination auditing is a central, cross-domain problem for agentic LLM systems, and Trajel offers a reusable dataset, taxonomy, and evaluation protocol that can become a community benchmark. This enables follow-on work in safety, evaluation, tool-use reliability, and industrial deployment. Paper 1 is innovative and high-value clinically, but its impact may be narrower and harder to translate without prospective trials, regulatory validation, and real-world deployment evidence.

vs. Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

gemini-3.15/26/2026

Paper 2 addresses a critical, timely bottleneck in LLM agent deployment (trajectory-level hallucinations) by introducing a concrete dataset, taxonomy, and framework. Methodological contributions like benchmarks typically garner high citations and drive immediate follow-up research. Paper 1 offers a valuable conceptual paradigm but lacks the immediate empirical utility and falsifiable technical contribution of Paper 2.

vs. TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

gpt-5.25/26/2026

Paper 2 is more timely and broadly impactful: it targets safety and reliability of agentic LLM systems, a rapidly growing deployment setting across industries. It contributes a dataset and evaluation framework (Trajel) plus a hallucination taxonomy at the trajectory level, enabling standardized benchmarking and follow-on research. The methodological framing (multi-level evaluation, expert annotations, comparison to post-hoc verification) suggests stronger rigor and clearer community reuse. Paper 1 is useful for ontology engineering, but appears more niche and tool-centric, with narrower cross-field reach.

vs. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a generally applicable, model-agnostic framework (state-adaptive memory) that directly improves long-horizon agent performance across multiple established benchmarks and backbones, with optimization via supervision and RL. This is timely for practical agent deployments and can influence systems, retrieval/memory research, and tool-using agents broadly. Paper 1 is valuable for safety/evaluation (trajectory-level hallucination taxonomy and dataset), but its impact may be narrower to auditing/benchmarking and dependent on adoption of Trajel/AssetOpsBench, whereas SAM offers an immediately deployable capability improvement.