StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems
Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang
Abstract
LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.
AI Impact Assessments
(1 models)Scientific Impact Assessment: StepFinder
1. Core Contribution
StepFinder addresses the problem of automated step-level failure attribution in LLM-based multi-agent systems (MAS). The key insight is that rather than using LLMs end-to-end for reasoning over execution trajectories (which is costly and noise-sensitive), one can decouple the problem: use LLMs only for encoding execution logs into dense semantic embeddings, then apply lightweight deep learning modules for temporal modeling and root cause identification.
The framework consists of three stages: (1) trajectory encoding via a pre-trained embedding model (Qwen3 Embedding), (2) a hybrid architecture combining BiLSTM-based temporal feature extraction with agent-aware attention mechanisms, and (3) a step-level error scoring module enhanced by multi-scale temporal differencing and position bias. The model is trained with a joint loss combining classification and a self-supervised temporal consistency objective.
The problem formulation is sensible—casting failure attribution as a structured temporal modeling task rather than a free-form reasoning task is a meaningful conceptual shift. The "decisive error" definition based on counterfactual intervention and Occam's Razor (earliest correctable step) provides a clean formal grounding.
2. Methodological Rigor
Strengths in methodology:
Weaknesses:
3. Potential Impact
The paper addresses a genuine and growing need in the MAS ecosystem. As LLM-based multi-agent systems become more prevalent in production (coding assistants, scientific discovery, software development), automated failure attribution becomes critical for reliability engineering.
Practical implications:
Broader influence:
However, the impact is somewhat limited by the narrow evaluation scope (single benchmark, specific MAS configurations) and the relatively modest absolute performance levels.
4. Timeliness & Relevance
This work is highly timely. MAS failure rates of 41-86.7% reported in the literature represent a critical barrier to adoption. The Who&When benchmark (2025) established this as a formal research problem only very recently, and StepFinder represents an early and meaningful contribution to this nascent subfield. The shift from expensive LLM-based reasoning to efficient neural approaches aligns with broader trends toward making AI systems more practical and cost-effective.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
The paper's framing as a KDD contribution is appropriate given its focus on execution trace mining, though the connection to knowledge discovery could be strengthened. The trajectory regeneration strategy for training data augmentation is practical but raises questions about whether the model learns genuine failure patterns or artifacts of the generation process. The fact that different hyperparameter configurations are optimal for the two subsets suggests limited generalization without subset-specific tuning.
Generated Jun 3, 2026
Comparison History (21)
Paper 2 is likely to have higher scientific impact due to broader, more immediate applicability: efficient failure attribution is relevant across many LLM-based multi-agent workflows (software agents, automation, evaluation, safety), not tied to a single robotics domain. Its approach (LLM only for offline feature construction + lightweight temporal modeling) addresses a pressing timeliness issue—cost/latency and reliability of agentic systems—and shows strong efficiency gains on a public benchmark with released code, supporting rigor and adoption. Paper 1 is novel and valuable for UAV navigation, but its impact is narrower and benchmark-specific.
Paper 1 addresses a critical bottleneck in autonomous web agents by enabling dynamic, state-grounded skill retrieval, moving beyond static task-level planning. Given the massive interest and broad real-world applicability of web automation agents, improving their adaptability to changing environments offers higher potential for direct impact on core capabilities compared to the debugging and failure attribution framework presented in Paper 2.
Paper 1 has higher likely scientific impact due to clearer novelty and methodological rigor: it reframes failure attribution by converting trajectories into temporal semantic sequences once, then applying efficient temporal/attention modeling with explicit refinement (multi-scale differences, position bias), yielding strong benchmarked gains plus large inference-time reductions. This targets a broadly relevant, timely problem (reliability/debugging of LLM multi-agent systems) with reusable ideas across agent evaluation, monitoring, and ML systems. Paper 2 is compelling for applications and scale, but reads more as a system-engineering pipeline with less generalizable methodological contribution and weaker evidentiary grounding in the abstract.
Paper 1 has higher likely scientific impact: it introduces a concrete, technically novel framework for failure attribution in multi-agent LLM systems, demonstrates measurable performance and large efficiency gains on a benchmark, and provides code—supporting methodological rigor, reproducibility, and near-term adoption in real deployments. Its contributions can generalize to debugging, monitoring, and reliability engineering across LLM-agent platforms. Paper 2 is timely and broad but is a scoping review; while valuable for governance and agenda-setting, it is less likely to drive immediate, measurable downstream technical advances compared with a deployable method.
Paper 2 provides a fundamental theoretical contribution by proving that success conditioning—a technique used across multiple major fields (RLHF, goal-conditioned RL, Decision Transformers)—exactly solves a trust-region optimization problem. This unifying theoretical insight has broad impact across reinforcement learning, LLM alignment, and decision-making, connecting disparate methods under one framework. Paper 1, while practically useful, addresses a narrower engineering problem (failure attribution in multi-agent systems) with incremental improvements on a specific benchmark. The theoretical breadth and cross-field relevance of Paper 2 give it substantially higher potential impact.
Paper 1 addresses fundamental limitations in LLM reasoning (hallucinations and poor numerical computation) by introducing a novel framework combining symbolic anchoring and dynamic memory. This approach broadly enhances mathematical and multi-step reasoning, offering significant implications for foundational AI capabilities. In contrast, Paper 2 focuses on a narrower, specialized problem of failure attribution and debugging in multi-agent systems. Thus, Paper 1 has a broader potential impact across various domains.
Paper 1 (ChemCoTBench-V2) addresses a critical and timely problem—evaluating the reasoning process of LLMs in chemistry rather than just final answers—with a novel, scalable, rule-verifiable benchmark spanning 5,620 samples across 18 tasks. It introduces a methodologically rigorous framework (deterministic verifiers, three separate evaluation signals) that has broad implications for AI-assisted scientific discovery and trustworthy LLM deployment in chemistry. Paper 2, while useful, addresses a narrower problem (failure attribution in multi-agent systems) with incremental improvements over existing methods on a single benchmark. Paper 1's domain impact and novel evaluation paradigm give it higher potential.
TSQAgent addresses a more fundamental and broadly applicable problem—time series data quality assessment—which impacts numerous scientific and industrial domains. It introduces both a benchmark (TSQBench) and a novel agentic framework with demonstrated downstream utility improvements. Paper 1 (StepFinder) solves a narrower problem (failure attribution in multi-agent systems) with strong engineering contributions but more limited scope. Paper 2's combination of benchmark creation, novel methodology, and demonstrated real-world applicability across eleven datasets suggests broader scientific impact and greater potential for adoption across fields.
StepFinder addresses a practical and timely problem in LLM-based multi-agent systems—failure attribution—with a novel lightweight framework that significantly outperforms existing methods while reducing inference time by 79%. Given the rapid growth of multi-agent LLM systems in both research and industry, this work has broad applicability and immediate real-world relevance. Paper 2, while theoretically rigorous in extending non-monotonic reasoning to defeasible standpoint logic, addresses a niche area in formal logic with a narrower audience and fewer direct practical applications, limiting its broader impact.
Paper 2 has higher likely impact due to stronger novelty and timeliness in a rapidly growing area (LLM multi-agent reliability). It proposes a new, efficient framework that reduces dependence on expensive LLM inference while improving attribution performance, with clear methodological components (temporal semantic encoding, temporal/attention modeling, refinement) and strong efficiency gains on a known benchmark, supporting real-world deployment. Paper 1 is solid but largely applies established CNN ensembles and augmentation to a limited 3-class WiFi HAR setting; the incremental gains and narrower scope reduce expected cross-field impact.
EvoTrainer introduces a fundamentally novel paradigm shift—co-evolving both LLM policies and training harnesses—addressing a core limitation in autonomous RL training. Its breadth across mathematical reasoning, code generation, and software engineering demonstrates wide applicability. The concept of moving beyond static recipe search toward joint evolution is a more transformative contribution with broader implications for the entire LLM training ecosystem. StepFinder, while useful for failure attribution in multi-agent systems, addresses a narrower diagnostic problem with incremental improvements over existing methods.
Paper 2 has higher potential impact due to greater novelty and broader relevance: it targets a key bottleneck in LLM-for-formalization—turning generated proofs into library-quality, reusable artifacts—via a process-guided, multi-phase agentic workflow that aligns with human refactoring practices. This can directly affect formal methods, theorem proving, software verification, and AI-assisted mathematics, with strong real-world applications in maintaining large proof/codebases. Paper 1 is useful and efficient but is more incremental (feature+temporal model for attribution) and narrower in cross-field reach.
Paper 2 has higher potential impact because it introduces a more general, field-spanning framework: converting LLM reasoning traces into verifiable dependency graphs and defining topology-based metrics (including reasoning efficiency). This is broadly applicable to evaluation, interpretability, scaling analysis, and failure diagnosis across many reasoning tasks and model classes, beyond multi-agent settings. Its benchmark+measurement approach is timely and likely to become a reusable evaluation primitive. Paper 1 is rigorous and practically useful, but is more domain-specific (failure attribution in multi-agent trajectories) and thus narrower in cross-field influence.
DeltaMem addresses the fundamental and broadly relevant problem of experience memory organization for LLM agents, introducing novel concepts (residual experience trees, autonomous consolidation) with wide applicability across diverse interactive environments. Its hierarchical memory structure with delta nodes tackles redundancy and retrieval conflicts in a principled way that could influence memory architectures broadly. StepFinder, while valuable for failure attribution in multi-agent systems, addresses a narrower diagnostic task with a more incremental contribution (combining temporal modeling with attention for root cause identification). DeltaMem's conceptual novelty and broader applicability give it higher impact potential.
ForeSci introduces a novel benchmark paradigm for evaluating LLM agents' forward-looking research judgment with temporal controls, addressing a fundamental gap in how we assess AI systems for scientific decision-making. Its broader scope (500 tasks, multiple domains, multiple agent architectures) and the novel concept of evidence-decision decoupling provide foundational insights for the growing field of AI-for-science. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with incremental improvements. ForeSci's timeliness and potential to shape how research agents are evaluated gives it higher impact potential.
Paper 2 (StepFinder) likely has higher scientific impact due to broader and timelier relevance: reliability and debugging of LLM-based multi-agent systems is a fast-moving, widely applicable problem across AI, software engineering, and deployment. Its lightweight framework (LLM only for feature construction, efficient temporal/attention modeling at inference) offers clear real-world utility via large latency/cost reductions and improved attribution accuracy on a public benchmark. Paper 1 is novel within EEG/BCI and important clinically, but its impact is narrower and depends more on data heterogeneity, regulatory pathways, and domain-specific adoption.
MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous real-world impact. It introduces a novel benchmark covering 18 clinical scenarios with safety evaluation dimensions, revealing a significant performance gap (best model at 54.2%, open-source at 2.5%) that will drive substantial future research. Its breadth across 10 medical domains, evaluation of 23 agents, and focus on clinical safety make it highly relevant and timely. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with more incremental contributions.
Paper 1 is likely to have higher scientific impact due to broader relevance and novelty: it proposes a general LLM–knowledge graph integration paradigm (schema-to-code, executable reasoning) that addresses scalability and compositionality limits of prompt-injection retrieval, with strong gains across multiple standard KGQA benchmarks. This could influence LLM tool-use, neuro-symbolic reasoning, and retrieval-augmented systems beyond QA. Paper 2 is timely and practically valuable for multi-agent reliability, but is narrower (failure attribution on a specific benchmark) and more incremental in methodology (feature encoding + temporal/attention modeling), likely yielding more limited cross-field impact.
Paper 1 likely has higher scientific impact due to stronger methodological novelty and broader relevance: it reframes failure attribution by moving LLM use to offline feature construction and applying efficient temporal/attention modeling for root-cause step identification, addressing a core reliability bottleneck in multi-agent systems. This is timely as agentic workflows proliferate and has cross-domain applicability to debugging, monitoring, and trustworthy AI. Paper 2 is practically valuable for cost reduction in coding agents, but it is closer to systems/prompt-engineering middleware (translation + rewriting) with narrower scientific generality and less fundamental contribution.
Paper 2 addresses failure attribution in multi-agent systems, a critical bottleneck for deploying reliable, complex, real-world AI applications. By introducing a lightweight framework that significantly improves both accuracy and inference efficiency over standard LLM-based debugging, it provides a foundational tool for system reliability. While Paper 1 offers valuable cost optimizations for tool use, Paper 2 tackles a broader and more pressing challenge—understanding and fixing cascading failures in autonomous systems—giving it higher potential for widespread methodological impact.