StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

Jun 2, 2026

arXiv:2606.03467v1 PDF

cs.AI(primary)

#2380of 3404·Artificial Intelligence

#2380 of 3404 · Artificial Intelligence

Tournament Score

1350±44

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty6

Clarity7

Tournament Score

1350±44

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: StepFinder

1. Core Contribution

StepFinder addresses the problem of automated step-level failure attribution in LLM-based multi-agent systems (MAS). The key insight is that rather than using LLMs end-to-end for reasoning over execution trajectories (which is costly and noise-sensitive), one can decouple the problem: use LLMs only for encoding execution logs into dense semantic embeddings, then apply lightweight deep learning modules for temporal modeling and root cause identification.

The framework consists of three stages: (1) trajectory encoding via a pre-trained embedding model (Qwen3 Embedding), (2) a hybrid architecture combining BiLSTM-based temporal feature extraction with agent-aware attention mechanisms, and (3) a step-level error scoring module enhanced by multi-scale temporal differencing and position bias. The model is trained with a joint loss combining classification and a self-supervised temporal consistency objective.

The problem formulation is sensible—casting failure attribution as a structured temporal modeling task rather than a free-form reasoning task is a meaningful conceptual shift. The "decisive error" definition based on counterfactual intervention and Occam's Razor (earliest correctable step) provides a clean formal grounding.

2. Methodological Rigor

Strengths in methodology:

The formal problem definition (Eq. 1-2) is clean and well-motivated.

The ablation study is thorough, systematically removing each component (TFE, ASI, agent identity, multi-scale differencing, position bias, temporal consistency loss) and evaluating impact.

Sensitivity analysis of four key hyperparameters across both subsets demonstrates reasonable robustness.

Efficiency analysis with concrete metrics (token counts, inference time) provides practical evidence.

Weaknesses:

The evaluation is conducted exclusively on the Who&When benchmark, which comprises only 126 (Alg) and 58 (HC) test trajectories. This is a very small evaluation set, raising concerns about statistical reliability. While standard deviations are reported, the small sample sizes limit confidence in the reported improvements.

The training data is synthetically generated via LLM-prompted trajectory regeneration (17 trajectories per task for Alg, 14 for HC). The quality and diversity of this synthetic data is not rigorously validated, and potential distribution shift between synthetic training data and real/benchmark test data is not discussed.

The position bias (Eq. 9) introduces a linearly decaying prior favoring earlier steps. While motivated by cascading failure theory, this is a strong structural assumption. The paper acknowledges it as "mild," but the ablation shows removing it drops Alg accuracy by ~2.65% while barely affecting HC, suggesting it may be overfitting to dataset characteristics rather than capturing a universal principle.

Absolute accuracy numbers remain quite low (29.63% on Alg, 22.99% on HC), which, while representing improvements over baselines, still indicate the task is far from solved.

The comparison with concurrent methods (AgenTracer, CDC-MAS) in Appendix C reveals that StepFinder actually underperforms on the Alg subset (29.63% vs. 42.86% for AgenTracer with ground truth, 36.20% for CDC-MAS). The paper somewhat downplays this, attributing the gap to shorter/more structured trajectories favoring LLM reasoning.

3. Potential Impact

The paper addresses a genuine and growing need in the MAS ecosystem. As LLM-based multi-agent systems become more prevalent in production (coding assistants, scientific discovery, software development), automated failure attribution becomes critical for reliability engineering.

Practical implications:

The 79% inference time reduction over the fastest LLM-based method is significant for real-time or high-throughput monitoring.

Zero text generation overhead eliminates a major cost driver for LLM-based diagnosis.

The ranking-based approach (Acc@K) provides actionable outputs for human-in-the-loop debugging.

Broader influence:

The decoupled architecture (LLM for encoding, lightweight model for reasoning) could inspire similar approaches in other LLM-heavy diagnostic pipelines.

The temporal semantic modeling perspective may transfer to other sequence debugging tasks (e.g., workflow debugging, process mining).

However, the impact is somewhat limited by the narrow evaluation scope (single benchmark, specific MAS configurations) and the relatively modest absolute performance levels.

4. Timeliness & Relevance

This work is highly timely. MAS failure rates of 41-86.7% reported in the literature represent a critical barrier to adoption. The Who&When benchmark (2025) established this as a formal research problem only very recently, and StepFinder represents an early and meaningful contribution to this nascent subfield. The shift from expensive LLM-based reasoning to efficient neural approaches aligns with broader trends toward making AI systems more practical and cost-effective.

5. Strengths & Limitations

Key Strengths:

Clear problem formulation with formal definitions

Principled architectural design with well-motivated components

Significant efficiency gains (5x speedup) with competitive or superior accuracy

Comprehensive ablation and sensitivity analysis

Code availability enhances reproducibility

Notable Limitations:

Very small test sets (126 and 58 trajectories) undermine statistical confidence

Underperforms concurrent methods on the Alg subset

Synthetic training data generation process may introduce biases

Position bias is a dataset-specific heuristic rather than a principled solution

Single-benchmark evaluation limits generalizability claims

The multi-scale differencing uses only scales {1, 2}, which is quite limited for "multi-scale"

Hyperparameter sensitivity varies substantially between subsets (e.g., optimal λ is 0.9 for Alg vs. 0.02 for HC), suggesting the framework requires careful per-domain tuning

Additional Observations:

The paper's framing as a KDD contribution is appropriate given its focus on execution trace mining, though the connection to knowledge discovery could be strengthened. The trajectory regeneration strategy for training data augmentation is practical but raises questions about whether the model learns genuine failure patterns or artifacts of the generation process. The fact that different hyperparameter configurations are optimal for the two subsets suggests limited generalization without subset-specific tuning.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 6Clarity 7

Generated Jun 3, 2026

Comparison History (21)

vs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

gpt-5.26/6/2026

Paper 2 is likely to have higher scientific impact due to broader, more immediate applicability: efficient failure attribution is relevant across many LLM-based multi-agent workflows (software agents, automation, evaluation, safety), not tied to a single robotics domain. Its approach (LLM only for offline feature construction + lightweight temporal modeling) addresses a pressing timeliness issue—cost/latency and reliability of agentic systems—and shows strong efficiency gains on a public benchmark with released code, supporting rigor and adoption. Paper 1 is novel and valuable for UAV navigation, but its impact is narrower and benchmark-specific.

vs. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in autonomous web agents by enabling dynamic, state-grounded skill retrieval, moving beyond static task-level planning. Given the massive interest and broad real-world applicability of web automation agents, improving their adaptability to changing environments offers higher potential for direct impact on core capabilities compared to the debugging and failure attribution framework presented in Paper 2.

vs. Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

gpt-5.26/5/2026

Paper 1 has higher likely scientific impact due to clearer novelty and methodological rigor: it reframes failure attribution by converting trajectories into temporal semantic sequences once, then applying efficient temporal/attention modeling with explicit refinement (multi-scale differences, position bias), yielding strong benchmarked gains plus large inference-time reductions. This targets a broadly relevant, timely problem (reliability/debugging of LLM multi-agent systems) with reusable ideas across agent evaluation, monitoring, and ML systems. Paper 2 is compelling for applications and scale, but reads more as a system-engineering pipeline with less generalizable methodological contribution and weaker evidentiary grounding in the abstract.

vs. A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

gpt-5.26/5/2026

Paper 1 has higher likely scientific impact: it introduces a concrete, technically novel framework for failure attribution in multi-agent LLM systems, demonstrates measurable performance and large efficiency gains on a benchmark, and provides code—supporting methodological rigor, reproducibility, and near-term adoption in real deployments. Its contributions can generalize to debugging, monitoring, and reliability engineering across LLM-agent platforms. Paper 2 is timely and broad but is a scoping review; while valuable for governance and agenda-setting, it is less likely to drive immediate, measurable downstream technical advances compared with a deployable method.

vs. Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

claude-opus-4.66/5/2026

Paper 2 provides a fundamental theoretical contribution by proving that success conditioning—a technique used across multiple major fields (RLHF, goal-conditioned RL, Decision Transformers)—exactly solves a trust-region optimization problem. This unifying theoretical insight has broad impact across reinforcement learning, LLM alignment, and decision-making, connecting disparate methods under one framework. Paper 1, while practically useful, addresses a narrower engineering problem (failure attribution in multi-agent systems) with incremental improvements on a specific benchmark. The theoretical breadth and cross-field relevance of Paper 2 give it substantially higher potential impact.

vs. eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

gemini-3.16/3/2026

Paper 1 addresses fundamental limitations in LLM reasoning (hallucinations and poor numerical computation) by introducing a novel framework combining symbolic anchoring and dynamic memory. This approach broadly enhances mathematical and multi-step reasoning, offering significant implications for foundational AI capabilities. In contrast, Paper 2 focuses on a narrower, specialized problem of failure attribution and debugging in multi-agent systems. Thus, Paper 1 has a broader potential impact across various domains.

vs. From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

claude-opus-4.66/3/2026

Paper 1 (ChemCoTBench-V2) addresses a critical and timely problem—evaluating the reasoning process of LLMs in chemistry rather than just final answers—with a novel, scalable, rule-verifiable benchmark spanning 5,620 samples across 18 tasks. It introduces a methodologically rigorous framework (deterministic verifiers, three separate evaluation signals) that has broad implications for AI-assisted scientific discovery and trustworthy LLM deployment in chemistry. Paper 2, while useful, addresses a narrower problem (failure attribution in multi-agent systems) with incremental improvements over existing methods on a single benchmark. Paper 1's domain impact and novel evaluation paradigm give it higher potential.

vs. TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

claude-opus-4.66/3/2026

TSQAgent addresses a more fundamental and broadly applicable problem—time series data quality assessment—which impacts numerous scientific and industrial domains. It introduces both a benchmark (TSQBench) and a novel agentic framework with demonstrated downstream utility improvements. Paper 1 (StepFinder) solves a narrower problem (failure attribution in multi-agent systems) with strong engineering contributions but more limited scope. Paper 2's combination of benchmark creation, novel methodology, and demonstrated real-world applicability across eleven datasets suggests broader scientific impact and greater potential for adoption across fields.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

claude-opus-4.66/3/2026

StepFinder addresses a practical and timely problem in LLM-based multi-agent systems—failure attribution—with a novel lightweight framework that significantly outperforms existing methods while reducing inference time by 79%. Given the rapid growth of multi-agent LLM systems in both research and industry, this work has broad applicability and immediate real-world relevance. Paper 2, while theoretically rigorous in extending non-monotonic reasoning to defeasible standpoint logic, addresses a niche area in formal logic with a narrower audience and fewer direct practical applications, limiting its broader impact.

vs. WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

gpt-5.26/3/2026

Paper 2 has higher likely impact due to stronger novelty and timeliness in a rapidly growing area (LLM multi-agent reliability). It proposes a new, efficient framework that reduces dependence on expensive LLM inference while improving attribution performance, with clear methodological components (temporal semantic encoding, temporal/attention modeling, refinement) and strong efficiency gains on a known benchmark, supporting real-world deployment. Paper 1 is solid but largely applies established CNN ensembles and augmentation to a limited 3-class WiFi HAR setting; the incremental gains and narrower scope reduce expected cross-field impact.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

claude-opus-4.66/3/2026

EvoTrainer introduces a fundamentally novel paradigm shift—co-evolving both LLM policies and training harnesses—addressing a core limitation in autonomous RL training. Its breadth across mathematical reasoning, code generation, and software engineering demonstrates wide applicability. The concept of moving beyond static recipe search toward joint evolution is a more transformative contribution with broader implications for the entire LLM training ecosystem. StepFinder, while useful for failure attribution in multi-agent systems, addresses a narrower diagnostic problem with incremental improvements over existing methods.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

gpt-5.26/3/2026

Paper 2 has higher potential impact due to greater novelty and broader relevance: it targets a key bottleneck in LLM-for-formalization—turning generated proofs into library-quality, reusable artifacts—via a process-guided, multi-phase agentic workflow that aligns with human refactoring practices. This can directly affect formal methods, theorem proving, software verification, and AI-assisted mathematics, with strong real-world applications in maintaining large proof/codebases. Paper 1 is useful and efficient but is more incremental (feature+temporal model for attribution) and narrower in cross-field reach.

vs. Reasoning Structure of Large Language Models

gpt-5.26/3/2026

Paper 2 has higher potential impact because it introduces a more general, field-spanning framework: converting LLM reasoning traces into verifiable dependency graphs and defining topology-based metrics (including reasoning efficiency). This is broadly applicable to evaluation, interpretability, scaling analysis, and failure diagnosis across many reasoning tasks and model classes, beyond multi-agent settings. Its benchmark+measurement approach is timely and likely to become a reusable evaluation primitive. Paper 1 is rigorous and practically useful, but is more domain-specific (failure attribution in multi-agent trajectories) and thus narrower in cross-field influence.

vs. DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

claude-opus-4.66/3/2026

DeltaMem addresses the fundamental and broadly relevant problem of experience memory organization for LLM agents, introducing novel concepts (residual experience trees, autonomous consolidation) with wide applicability across diverse interactive environments. Its hierarchical memory structure with delta nodes tackles redundancy and retrieval conflicts in a principled way that could influence memory architectures broadly. StepFinder, while valuable for failure attribution in multi-agent systems, addresses a narrower diagnostic task with a more incremental contribution (combining temporal modeling with attention for root cause identification). DeltaMem's conceptual novelty and broader applicability give it higher impact potential.

vs. ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

claude-opus-4.66/3/2026

ForeSci introduces a novel benchmark paradigm for evaluating LLM agents' forward-looking research judgment with temporal controls, addressing a fundamental gap in how we assess AI systems for scientific decision-making. Its broader scope (500 tasks, multiple domains, multiple agent architectures) and the novel concept of evidence-decision decoupling provide foundational insights for the growing field of AI-for-science. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with incremental improvements. ForeSci's timeliness and potential to shape how research agents are evaluated gives it higher impact potential.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

gpt-5.26/3/2026

Paper 2 (StepFinder) likely has higher scientific impact due to broader and timelier relevance: reliability and debugging of LLM-based multi-agent systems is a fast-moving, widely applicable problem across AI, software engineering, and deployment. Its lightweight framework (LLM only for feature construction, efficient temporal/attention modeling at inference) offers clear real-world utility via large latency/cost reductions and improved attribution accuracy on a public benchmark. Paper 1 is novel within EEG/BCI and important clinically, but its impact is narrower and depends more on data heterogeneity, regulatory pathways, and domain-specific adoption.

vs. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

claude-opus-4.66/3/2026

MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous real-world impact. It introduces a novel benchmark covering 18 clinical scenarios with safety evaluation dimensions, revealing a significant performance gap (best model at 54.2%, open-source at 2.5%) that will drive substantial future research. Its breadth across 10 medical domains, evaluation of 23 agents, and focus on clinical safety make it highly relevant and timely. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with more incremental contributions.

vs. Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

gpt-5.26/3/2026

Paper 1 is likely to have higher scientific impact due to broader relevance and novelty: it proposes a general LLM–knowledge graph integration paradigm (schema-to-code, executable reasoning) that addresses scalability and compositionality limits of prompt-injection retrieval, with strong gains across multiple standard KGQA benchmarks. This could influence LLM tool-use, neuro-symbolic reasoning, and retrieval-augmented systems beyond QA. Paper 2 is timely and practically valuable for multi-agent reliability, but is narrower (failure attribution on a specific benchmark) and more incremental in methodology (feature encoding + temporal/attention modeling), likely yielding more limited cross-field impact.

vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to stronger methodological novelty and broader relevance: it reframes failure attribution by moving LLM use to offline feature construction and applying efficient temporal/attention modeling for root-cause step identification, addressing a core reliability bottleneck in multi-agent systems. This is timely as agentic workflows proliferate and has cross-domain applicability to debugging, monitoring, and trustworthy AI. Paper 2 is practically valuable for cost reduction in coding agents, but it is closer to systems/prompt-engineering middleware (translation + rewriting) with narrower scientific generality and less fundamental contribution.

vs. ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

gemini-3.16/3/2026

Paper 2 addresses failure attribution in multi-agent systems, a critical bottleneck for deploying reliable, complex, real-world AI applications. By introducing a lightweight framework that significantly improves both accuracy and inference efficiency over standard LLM-based debugging, it provides a foundational tool for system reliability. While Paper 1 offers valuable cost optimizations for tool use, Paper 2 tackles a broader and more pressing challenge—understanding and fixing cascading failures in autonomous systems—giving it higher potential for widespread methodological impact.