Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin
Abstract
Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows"
1. Core Contribution
Trajel addresses a genuine and underexplored gap: hallucination evaluation in LLM-based agentic systems has remained fixated on final outputs, ignoring failures that originate and propagate through intermediate reasoning steps. The paper contributes three interlinked artifacts: (1) a five-type hallucination taxonomy (factual, referential, logical, procedural, scope-based) defined as structural predicates over Thought-Action-Observation traces; (2) a dataset of 225 expert-annotated agent trajectories from the AssetOpsBench industrial framework; and (3) a benchmarking framework comparing subtask-level (BERT), trajectory-level (NLI), and long-context (Longformer) detection paradigms.
The taxonomy is the most intellectually distinctive contribution. Defining hallucination types by the *scope of context required for detection* — factual needing only local evidence, referential/logical requiring trajectory history, procedural requiring workflow specifications, and scope requiring agent role definitions — provides a principled organizational scheme that maps directly to detection architecture choices. The finding that 48.7% of hallucinated trajectories exhibit multiple types simultaneously is a strong empirical justification for multi-label formulation.
2. Methodological Rigor
The paper demonstrates reasonable methodological care but has notable weaknesses:
Strengths in methodology:
Weaknesses:
3. Potential Impact
The work has moderate-to-high potential impact along several dimensions:
For the agentic AI safety community: The taxonomy and the empirical finding that procedural hallucinations dominate (38.5%) but are invisible to output-only evaluation is a concrete, actionable insight. The localization analysis showing hallucinations concentrate in Actions and Responses rather than Thoughts could directly inform guardrail architecture.
For industrial AI deployment: The execution-quality signal analysis suggests that lightweight runtime monitors (particularly clarity-and-justification flags) could serve as effective "kill switches" — the 97.1% hallucination rate when both CJ and RV are absent is operationally useful.
For benchmark design: The demonstration that high binary accuracy masks systematic failure on subtle types (referential and logical) is an important cautionary finding for the evaluation community.
However, the impact is limited by domain specificity (single industrial domain, single orchestrator framework) and dataset scale. The 225-trajectory dataset, while carefully annotated, is too small to serve as a definitive training resource.
4. Timeliness & Relevance
The paper is well-timed. The rapid deployment of agentic LLM systems in production settings has created urgent demand for evaluation frameworks that go beyond static benchmarks. The paper explicitly targets this gap and provides a concrete framework. The connection to AssetOpsBench grounds the work in realistic industrial scenarios rather than toy environments. The submission to NeurIPS Datasets and Benchmarks is appropriate.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
Trajel makes a meaningful conceptual contribution by formalizing trajectory-level hallucination as a multi-type, multi-label detection problem and providing initial empirical evidence for its importance. The taxonomy and the empirical findings about type prevalence, co-occurrence, and detection difficulty are the paper's strongest contributions. However, the dataset is small, the supervised modeling results are preliminary, and the low inter-annotator agreement on key categories tempers enthusiasm. The work opens a useful research direction but represents an early-stage contribution rather than a definitive benchmark.
Generated May 26, 2026
Comparison History (17)
Paper 1 offers a profound paradigm shift by evaluating AI agents based on human empowerment rather than economic replacement. Its extensive scope (130 tasks across 35 occupations) and detailed evaluation framework position it to become a foundational benchmark for human-AI collaboration. While Paper 2 tackles the crucial technical issue of trajectory hallucinations, Paper 1's broader socio-technical relevance, interdisciplinary appeal, and potential to steer the future development of agentic AI towards human-centric workflows give it a higher overall scientific and societal impact.
Paper 2 addresses a fundamental bottleneck in LLM agents—long-horizon memory—by formalizing operations and isolating specific failure modes. This architectural focus offers broader applicability across any LLM system relying on memory. While Paper 1 is highly valuable for multi-agent workflows, Paper 2's foundational approach to stress-testing memory components likely provides more generalized insights that will influence future agent design and evaluation.
LLM agents are rapidly being deployed, making the evaluation of intermediate reasoning steps a critical bottleneck. Paper 2 addresses this urgent issue by introducing a novel dataset, taxonomy, and evaluation framework for trajectory-level hallucinations. This has immediate, widespread applicability for AI safety and reliability across diverse industries, likely resulting in broader adoption and higher cross-disciplinary impact compared to the more domain-specific, albeit rigorous, robotics and control focus of Paper 1.
Paper 1 addresses a critical and timely gap in LLM agent safety—trajectory-level hallucination detection in multi-agent systems—introducing a novel taxonomy, dataset, and evaluation framework. As agentic AI deployment accelerates in industry, this work has broad implications for reliability and safety. Paper 2 offers a useful but narrower practical insight (median vs. mean CE as a training metric), which is more of an incremental diagnostic recommendation than a paradigm shift. Paper 1's contribution is more novel, has broader cross-field impact, and addresses a more pressing problem in the AI safety landscape.
Paper 2 has higher likely impact due to strong real-world applicability (data-center energy and QoS), clear methodological rigor (implemented in vLLM, evaluated across dense+MoE and multi-GPU), and timely relevance as power becomes a limiting factor for LLM deployment. Its approach (treating power caps as a controllable runtime knob with feedback control) can generalize to many serving stacks and aligns with industry needs (energy proportionality, grid interaction). Paper 1 is novel and valuable for safety evaluation, but its impact depends on adoption of specific datasets/taxonomies and is narrower operationally.
Paper 2 addresses the broadly relevant problem of hallucination detection in LLM agents with a clear, practical taxonomy and dataset (Trajel) that fills a well-recognized gap—most benchmarks only evaluate final outputs. Its contribution is more accessible, empirically grounded, and applicable across the rapidly growing multi-agent AI ecosystem. Paper 1, while intellectually ambitious in applying actuarial concepts to AI agent control, introduces highly specialized formalism (authority frontiers, reserve capital budgets) that may have narrower adoption. Paper 2's dataset and taxonomy are more likely to be widely cited and built upon by the safety and evaluation communities.
Paper 1 addresses a critical and universally relevant bottleneck in modern AI: trajectory-level hallucinations in autonomous LLM agents. Its framework for auditing intermediate reasoning steps has broad applicability across AI safety, multi-agent systems, and industrial deployments. Paper 2, while highly rigorous and valuable for Operations Research, focuses on a narrower domain of optimization algorithm design, giving Paper 1 a higher potential for widespread cross-disciplinary impact.
Paper 2 introduces a novel causal intervention framework (CMI) for memory selection in LLM agents, addressing a fundamental limitation in how agents utilize persistent memory. The causal approach is methodologically innovative, bridging causal inference with LLM agent design—a connection with broad implications beyond the specific application. It includes a reusable benchmark and open-source code. Paper 1, while valuable in introducing a hallucination taxonomy for agent trajectories, is more narrowly focused on auditing industrial workflows. Paper 2's causal framework has wider applicability across agent architectures and tasks, and its principled approach to memory selection addresses a more general challenge in the rapidly growing LLM agent field.
Paper 2 likely has higher impact due to strong novelty (signal-language contrastive foundation model), very large-scale training (2.8M ECGs) and extensive external validation (~1.5M ECGs across nine cohorts) over 89 tasks, supporting methodological rigor and generalizability. Its real-world clinical applications are substantial (broad cardiovascular assessment, opportunistic screening, rare disease detection) with clear timeliness in foundation models for healthcare. Paper 1 is timely and useful for safer agentic LLM deployment, but its domain-specific dataset/evaluation framework and narrower application scope suggest comparatively less immediate cross-field and societal impact.
Paper 2 likely has higher impact due to its scale (auditing 2.1M repos), strong quantitative finding (a measurable “governance horizon” with predictive fit), and immediate policy relevance for open-weight model governance and supply-chain accountability. Its conclusions generalize across platforms (Hugging Face vs PyPI) and inform actionable platform/policy designs, affecting research, industry compliance, and regulation. Paper 1 is timely and useful for agent safety evaluation, but its domain specificity (industrial multi-agent workflows) and narrower stakeholder reach suggest comparatively smaller cross-field and real-world governance impact.
Paper 2 addresses a fundamental challenge in agentic AI by auditing intermediate trajectory-level hallucinations rather than just final outputs. This has broad applicability across all multi-step reasoning systems, offering significant real-world impact for safe agent deployment. Paper 1, while novel in addressing multi-turn clarification, is largely confined to the niche domain of computational science, limiting its cross-disciplinary impact compared to Paper 2.
Paper 1 likely has higher impact due to its timely, broadly applicable evaluation framework for agentic LLM safety: trajectory-level hallucination auditing with a clear taxonomy, expert-annotated traces, and benchmarking across granularities. This directly targets a major deployment bottleneck in industry (multi-step tool-using agents), making real-world adoption plausible and cross-cutting across LLM agents, evaluation, safety, and reliability. Paper 2 is novel in embodied ToM under perceptual bottlenecks, but appears narrower in applicability and more dependent on task design and CoT prompting, with less obvious standardization path than Paper 1’s dataset/taxonomy-driven evaluation.
Paper 1 addresses a fundamental and broadly impactful question about how to build generalizable agents through environment scaling, proposing a unifying taxonomy and research agenda that could shape the entire field of agent learning. Its scope spans reinforcement learning, foundation models, and open-ended learning. Paper 2, while practically valuable for LLM agent safety, addresses a narrower problem (trajectory-level hallucination detection) with a specific dataset/benchmark contribution. Paper 1's conceptual framework has greater potential to influence multiple research communities and drive long-term paradigm shifts in agent training methodology.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: trajectory-level hallucination auditing is a central, cross-domain problem for agentic LLM systems, and Trajel offers a reusable dataset, taxonomy, and evaluation protocol that can become a community benchmark. This enables follow-on work in safety, evaluation, tool-use reliability, and industrial deployment. Paper 1 is innovative and high-value clinically, but its impact may be narrower and harder to translate without prospective trials, regulatory validation, and real-world deployment evidence.
Paper 2 addresses a critical, timely bottleneck in LLM agent deployment (trajectory-level hallucinations) by introducing a concrete dataset, taxonomy, and framework. Methodological contributions like benchmarks typically garner high citations and drive immediate follow-up research. Paper 1 offers a valuable conceptual paradigm but lacks the immediate empirical utility and falsifiable technical contribution of Paper 2.
Paper 2 is more timely and broadly impactful: it targets safety and reliability of agentic LLM systems, a rapidly growing deployment setting across industries. It contributes a dataset and evaluation framework (Trajel) plus a hallucination taxonomy at the trajectory level, enabling standardized benchmarking and follow-on research. The methodological framing (multi-level evaluation, expert annotations, comparison to post-hoc verification) suggests stronger rigor and clearer community reuse. Paper 1 is useful for ontology engineering, but appears more niche and tool-centric, with narrower cross-field reach.
Paper 2 likely has higher impact: it introduces a generally applicable, model-agnostic framework (state-adaptive memory) that directly improves long-horizon agent performance across multiple established benchmarks and backbones, with optimization via supervision and RL. This is timely for practical agent deployments and can influence systems, retrieval/memory research, and tool-using agents broadly. Paper 1 is valuable for safety/evaluation (trajectory-level hallucination taxonomy and dataset), but its impact may be narrower to auditing/benchmarking and dependent on adoption of Trajel/AssetOpsBench, whereas SAM offers an immediately deployable capability improvement.