Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →
#994 of 2292 · Artificial Intelligence
Share
Tournament Score
1428±44
10501800
71%
Win Rate
17
Wins
7
Losses
24
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Insights Generator

1. Core Contribution

The paper formalizes "corpus-level trace diagnostics" as a distinct problem: given a corpus of LLM agent execution traces, produce grounded natural-language insights characterizing systematic behavioral patterns, each linked to supporting evidence. The key architectural contribution is the scout-investigator decomposition — scouts propose hypotheses from sampled traces (breadth), while investigators validate them at corpus scale with statistical evidence (depth). All trace access is mediated through a stateful Python data processing layer with pre-injected analysis primitives, avoiding raw trace content flooding the LLM context.

This addresses a genuine gap: prior work focuses on single-trajectory debugging (AgentDebug, DoVer, AgentRX), fixed taxonomy classification (MAST, AgentFail), or treats traces as secondary to scaffold optimization (Meta-Harness). The IG system is the first to explicitly frame and address iterative, open-ended pattern discovery across trace populations as a first-class problem.

2. Methodological Rigor

The evaluation framework is thoughtfully designed along two axes (who evaluates × what is measured), yielding four complementary settings. This is a genuine methodological contribution.

LLM-as-a-Judge (Section 4.1): The three-layer evaluation (coverage against union-gold clusters, pairwise win rates with position-swap debiasing, per-dimension rubric scoring) is thorough. The union-gold construction — pooling findings across all systems to create a shared reference — is a reasonable approach for evaluating open-ended outputs. IG achieves 77.9% average pairwise win rate versus 62.4% for the next-best system, with quality advantages concentrated in mechanism explanation and specificity.

Human Expert Intervention (Section 4.2): This is the most compelling evaluation. The 30.4pp improvement over baseline (vs. 16.2pp for CC Subagents) with p≈0.016 is notable, and the robustness checks (leave-one-out, permutation test, mixed-effects model) are appropriate for the small sample. However, n=6 per arm is genuinely small, and the authors acknowledge this. The high ICC (0.86) indicates participant skill dominates variance, raising questions about generalizability across practitioner populations.

Human Expert-as-a-Judge (Section 4.3): This result is less decisive — aggregate ratings are nearly identical (4.25 vs 4.22), with IG leading only on depth (+0.17). The authors' explanation that humans struggle to evaluate claims across large trace sets is plausible but somewhat convenient.

Iterative Patcher Loop (Section 4.4): All report-equipped systems converge to similar terminal performance (0.81-0.84), but the pure-patcher regression to 0.58 demonstrates the value of grounded analysis input. The detailed regression analysis (12/13 regressions traced to a specific prompt edit) is convincing.

Weaknesses in rigor: The comparison systems, while spanning useful architectural variations, all use the same backbone model (Claude Opus 4.6). The study is conducted on only two benchmarks. The LLM-as-a-judge evaluation introduces circular concerns when using Claude to evaluate reports generated by Claude-based systems. Cost is substantially higher (76vs76 vs23-38 for alternatives), which is acknowledged but not deeply analyzed in terms of cost-effectiveness tradeoffs.

3. Potential Impact

Practical impact: For organizations running LLM agents at scale, IG addresses a real workflow pain point. The insight that report depth and evidentiary grounding (not just coverage) drive practitioner effectiveness is actionable. The 14.2pp gap between report conditions suggests that investment in diagnostic quality has measurable ROI.

Methodological impact: The formalization of corpus-level trace diagnostics as a problem, and the four-setting evaluation framework, provide useful scaffolding for future work. The union-gold clustering approach for evaluating open-ended analytical outputs could transfer to other domains.

Integration potential: IG is explicitly designed to be complementary to agent optimization frameworks (VeRO, AFlow, ADAS), potentially serving as the diagnostic front-end to automated improvement loops. The patcher loop experiments demonstrate this integration pathway.

4. Timeliness & Relevance

This is highly timely. As LLM agents proliferate in production environments (coding agents, research assistants, tool-using systems), the debugging bottleneck is increasingly acute. The paper correctly identifies that evaluation dashboards showing aggregate metrics miss the cohort-level patterns that drive actionable improvements. The specific example of "silent computation failures" (code runs without errors but implements the wrong mathematical model) illustrates a failure class that is genuinely difficult to surface without corpus-level analysis.

5. Strengths & Limitations

Key Strengths:

  • Clean problem formulation that distinguishes corpus-level diagnostics from trajectory-level debugging
  • The scout-investigator decomposition is well-motivated and ablation-validated (Table 1 ablations show 35.5pp from parallel dispatch + 10.4pp from role specialization)
  • The stateful Python data processing layer is a practical design choice that avoids context window limitations
  • The human intervention study, despite small n, provides the most direct evidence of practical value
  • Thorough appendices with full prompts, protocols, and statistical analyses
  • Notable Limitations:

  • Small sample sizes in human studies (n=6 per arm) limit generalizability claims
  • Single benchmark (SpreadsheetBench) for human evaluation; spreadsheet manipulation may not represent harder agent domains
  • 3x cost premium over alternatives (76vs76 vs23-38) without clear cost-effectiveness analysis
  • Human expert-as-a-judge results show no significant difference, undermining the quality narrative
  • All systems use Claude Opus 4.6; model-dependent effects are unexamined
  • The patcher loop shows all report-equipped systems converge similarly, suggesting IG's advantage may be primarily in human-consumed settings
  • Overall Assessment: This is a solid systems paper that formalizes an important problem, proposes a reasonable architecture, and provides multi-faceted evaluation. The human intervention result is the standout finding. The work would benefit from larger-scale validation across diverse benchmarks and practitioner populations, and from deeper analysis of when the additional cost of IG is justified over simpler alternatives.

    Rating:6.5/ 10
    Significance 7Rigor 6.5Novelty 6Clarity 7.5

    Generated May 21, 2026

    Comparison History (24)

    vs. Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
    gemini-3.15/22/2026

    Paper 1 addresses a critical and highly timely bottleneck in AI development: diagnosing and debugging LLM agents at scale. Its automated, corpus-level approach offers broad applicability and significant real-world utility for AI researchers and practitioners, demonstrated by substantial performance gains. In contrast, Paper 2 focuses on computational gastronomy and ingredient embeddings, which, while methodologically rigorous, has a much narrower domain of application and lower potential for widespread cross-disciplinary impact.

    vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
    claude-opus-4.65/22/2026

    Paper 2 addresses a more fundamental and broadly applicable problem—systematic diagnostics for LLM agents across any domain—while Paper 1 focuses on a specific application (spreadsheet automation). Paper 2's formalization of corpus-level trace diagnostics introduces a novel framework applicable to all LLM agent development, with strong empirical validation (30.4pp improvements). Its multi-agent diagnostic architecture has broader methodological impact across the rapidly growing LLM agent ecosystem, whereas Paper 1's contributions, while practical, are more incremental and domain-specific.

    vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
    gemini-3.15/22/2026

    Paper 1 addresses a critical bottleneck in LLM agent development—diagnosing systematic failures at scale. By formalizing corpus-level trace diagnostics and providing a framework that yields actionable insights, it significantly enhances the interpretability, reliability, and real-world deployability of agentic systems. Paper 2 offers a valuable, simplified training methodology, but Paper 1's focus on diagnostics and evaluation provides broader, field-wide utility for understanding and improving complex LLM interactions.

    vs. KAPPS: A knowledge-based CPPS Architecture for the Circular Factory
    gpt-5.25/22/2026

    Paper 2 is likely to have higher impact due to timeliness and broad applicability: systematic diagnostics for LLM agents addresses a rapidly expanding, cross-domain need (software engineering, ML ops, HCI, safety). The proposed formalization of corpus-level trace diagnostics and an evaluated multi-agent hypothesis-testing system with quantified downstream gains (e.g., +30.4pp) suggests strong real-world utility and measurable effect. Paper 1 is solid and relevant to circular manufacturing, but its impact is more domain-specific and architecture-centric, with narrower breadth beyond industrial informatics.

    vs. LACO: Adaptive Latent Communication for Collaborative Driving
    claude-opus-4.65/22/2026

    Paper 1 addresses a fundamental and broadly applicable problem—systematic diagnosis of LLM agent failures at scale—which is highly relevant given the rapid deployment of LLM agents across industries. It formalizes a new problem (corpus-level trace diagnostics), introduces a principled multi-agent architecture, and demonstrates strong empirical results (30.4pp improvement). The breadth of impact is larger since it applies to any LLM agent system, not just autonomous driving. Paper 2, while technically sound, addresses a narrower domain (collaborative driving) with incremental advances over existing communication paradigms.

    vs. A Subjective Logic-based method for runtime confidence updates in safety arguments
    gpt-5.25/22/2026

    Paper 2 has higher estimated impact due to strong timeliness and broad applicability: corpus-level diagnostics for LLM agents addresses a rapidly growing, cross-domain deployment need. Its proposed formalization plus a scalable multi-agent methodology can influence research in agent evaluation, observability, debugging, and human-AI tooling. The evaluation claims concrete downstream gains (e.g., 30.4pp improvements) and comparative coverage, suggesting practical utility and methodological rigor. Paper 1 is valuable for safety assurance, but its niche focus and less standard updating rule (not Bayesian) likely limits breadth and adoption relative to LLM agent diagnostics.

    vs. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
    gpt-5.25/21/2026

    Paper 1 likely has higher scientific impact: it introduces a novel, general training framework (SERL) that directly improves long-horizon credit assignment for multi-turn LLM agents by leveraging per-step environment feedback, and demonstrates strong, quantitative gains on widely used benchmarks (ALFWorld, WebShop). This contributes a broadly applicable learning method that can influence RL, agent training, and distillation research. Paper 2 is valuable and timely for tooling/diagnostics and shows practical gains, but its core contribution is more systems-oriented and may be less foundational than a new learning objective/framework.

    vs. Self-supervised Hierarchical Visual Reasoning with World Model
    gemini-3.15/21/2026

    Paper 2 addresses a critical, timely bottleneck in AI: diagnosing failures in LLM agents at scale. By formalizing corpus-level trace diagnostics and demonstrating substantial downstream performance improvements (30.4pp), it offers immediate, widespread utility for researchers and practitioners deploying agentic systems. While Paper 1 provides a strong architectural contribution to RL world models, Paper 2's potential to standardize and automate LLM agent evaluation promises broader and more immediate real-world impact.

    vs. Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
    gpt-5.25/21/2026

    Paper 1 has higher likely impact: it tackles a broadly shared, timely bottleneck—debugging and improving LLM agent behavior at scale—introducing a formalized corpus-level diagnostics problem and an evidence-backed multi-agent methodology with demonstrated downstream gains. Its applications span many domains deploying LLM agents and can influence evaluation, monitoring, and reliability practices. Paper 2 is a strong systems contribution enabling faster RL research in a specific game, but its scope and cross-field applicability are narrower, and impact depends on adoption within a smaller subcommunity.

    vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
    claude-opus-4.65/21/2026

    Paper 1 addresses a fundamental and broadly applicable challenge in LLM agent development—systematic diagnosis of agent failures at scale. Its formalization of corpus-level trace diagnostics creates a new problem framework, and the multi-agent architecture with rigorous evaluation (30.4pp improvement, expert ratings) demonstrates strong methodological rigor. Paper 2, while practically useful in bridging LLM agents and RPA, addresses a narrower optimization problem (reducing token usage for repetitive GUI tasks). Paper 1's insights are more transferable across the entire LLM agent ecosystem, giving it broader potential impact.

    vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
    claude-opus-4.65/21/2026

    Paper 2 addresses a more fundamental and broadly applicable problem—systematic diagnosis of LLM agent failures at scale—which impacts the entire LLM agent ecosystem. It introduces a novel formalization (corpus-level trace diagnostics) and demonstrates concrete downstream improvements (30.4pp gains). Paper 1, while valuable as a benchmark for programmatic video generation, serves a narrower community. Paper 2's methodology (multi-agent diagnostic system) has wider applicability across any LLM agent deployment, making it more likely to influence research and practice broadly.

    vs. High Quality Embeddings for Horn Logic Reasoning
    gpt-5.25/21/2026

    Paper 2 has higher impact potential: it addresses a timely, widely felt bottleneck (scalable diagnosis of LLM-agent failures) with broad applicability across AI engineering, agent evaluation, safety, and production reliability. The proposed corpus-level trace diagnostics framing and multi-agent hypothesis-testing system is relatively novel and directly actionable, with evidence of substantial downstream gains (e.g., 30.4pp improvements) and comparative evaluation. Paper 1 is valuable but more incremental within a narrower niche (Horn logic embedding training heuristics) and likely affects fewer fields and deployments.

    vs. Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines
    gpt-5.25/21/2026

    Paper 1 is likely higher impact scientifically due to its broader, more novel contribution: formalizing corpus-level trace diagnostics for LLM agents and introducing a general multi-agent hypothesis-testing system that produces evidence-backed insights. It targets a widely shared bottleneck (debugging/diagnosing agent failures) across many domains and demonstrates downstream performance gains (30.4pp) plus qualitative evaluation. Paper 2 is timely and practically valuable for industrial agent latency, but its contributions are more engineering/benchmark-specific (caching/workflow optimizations) and narrower in cross-field reach.

    vs. Evaluating the Utility of Personal Health Records in Personalized Health AI
    gemini-3.15/21/2026

    Paper 2 introduces a foundational, domain-agnostic framework for LLM agent diagnostics, solving a critical scalability bottleneck in AI development. Its broad applicability across all fields of AI engineering gives it higher potential impact compared to Paper 1, which, while highly relevant to healthcare, is primarily an empirical evaluation of an existing model in a specific application domain.

    vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
    gpt-5.25/21/2026

    Paper 1 likely has higher scientific impact due to a more technically novel and broadly applicable approach: formalizing corpus-level trace diagnostics and introducing a hypothesis-driven multi-agent system with evidence-backed reporting. It targets a pressing real-world problem in deploying LLM agents (debugging at scale) and reports measurable downstream performance gains, suggesting methodological rigor and practical utility. Paper 2 is timely and valuable for AI education and accountability, but its impact is more domain-specific (pedagogy/benchmarking) and may diffuse more slowly compared to a tooling contribution that can generalize across agent systems and industries.

    vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
    gpt-5.25/21/2026

    Paper 2 has higher estimated scientific impact due to a more broadly applicable and timely methodological contribution: a formalized problem setting (corpus-level trace diagnostics) and a systematic, evidence-backed multi-agent approach that demonstrably improves agent performance. Its applications span many LLM-agent deployments (debugging, reliability, production monitoring), enabling impact across NLP, software engineering, and ML ops. Paper 1 is novel and valuable for AI education and accountable knowledge work, but its primary impact is narrower (pedagogy/benchmarking) and less directly generalizable than a scalable diagnostics framework with measurable downstream gains.

    vs. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
    gpt-5.25/21/2026

    Paper 2 has higher likely impact because it introduces a scalable, actionable method for corpus-level agent trace diagnostics with demonstrated downstream improvements (e.g., 30.4pp scaffold gains) and an end-to-end system practitioners can apply in production. Its methodology connects diagnostics to measurable performance changes and supports long, real-world traces, increasing applicability and timeliness. Paper 1 offers valuable evaluation taxonomies and an important critique of leaderboards, but it is positioned mainly as a measurement-protocol demonstration on a small fixed model set and may translate less directly into deployed gains.

    vs. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
    gpt-5.25/21/2026

    Paper 2 is likely higher impact due to broader, more generalizable contributions: unified taxonomies for agent control decisions and failures, a methodology to disentangle real capability from prompt-provided supervision, and a cross-benchmark coverage audit. These elements can standardize evaluation across many agent domains and influence how the community reports results beyond single-number leaderboards—highly timely for 2024–2026. Paper 1 is valuable and applied, but its impact is more tool-/workflow-specific to trace diagnostics and may be narrower in cross-field standardization and evaluation protocol influence.

    vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
    gemini-3.15/21/2026

    Paper 1 demonstrates higher potential scientific impact because it directly accelerates open-ended scientific discovery. While Paper 2 offers a valuable debugging tool for LLM engineers, Paper 1 presents a framework to autonomously synthesize interoperable multi-agent workflows for complex domains like genomics. By successfully integrating disparate scientific tools and specialized agents without requiring global redesign, AgentCo-op provides a highly scalable, practical solution for automating scientific research pipelines. Its strong performance across both real-world scientific case studies and general benchmarks indicates a broad, transformative impact across multiple scientific disciplines.

    vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems
    gpt-5.25/21/2026

    Paper 2 has higher estimated impact: it targets a rapidly growing, high-visibility problem (debugging/diagnosing LLM agents) with clear production relevance and broad applicability across domains using agentic systems. Its corpus-level diagnostic framing plus an implemented multi-agent “hypothesis propose/test” workflow is timely and likely to be adopted as tooling, amplified by demonstrated downstream gains (e.g., +30.4pp). Paper 1 is methodologically principled and useful for uncertainty evaluation, but its impact is narrower (evaluation metrics) and less immediately transformative for current large-scale deployments.