Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz

May 20, 2026

arXiv:2605.21347v2 PDF

v1v2

cs.AI(primary)cs.LGcs.SE

#729of 2292·Artificial Intelligence

#729 of 2292 · Artificial Intelligence

Tournament Score

1450±47

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty5.5

Clarity7.5

Tournament Score

1450±47

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Insights Generator

1. Core Contribution

The paper formalizes a genuinely underexplored problem: corpus-level trace diagnostics for LLM agents. Rather than debugging individual trajectories (the dominant paradigm), IG seeks to discover population-level behavioral patterns across entire trace corpora. The key architectural innovation is a scout-investigator decomposition where scout agents propose hypotheses from sampled traces (breadth) and investigator agents validate them at corpus scale with statistical evidence (depth), coordinated by an orchestrator in an iterative loop. All trace access is mediated through a structured Python data processing layer, preventing raw trace content from flooding model context windows.

This is a meaningful reframing. The distinction between hypothesis generation and corpus-scale validation addresses a real bottleneck: manual trace inspection doesn't scale, and single-trajectory debugging tools miss cohort-level patterns. The motivating example — "84% of incorrect Python-using traces exhibit silent computation failures" — illustrates the kind of insight that per-trace methods fundamentally cannot surface.

2. Methodological Rigor

The paper's four-setting evaluation framework (automated vs. human evaluator × report quality vs. downstream impact) is well-designed and addresses the inherent difficulty of evaluating free-form diagnostic outputs. The evaluation includes:

LLM-as-a-Judge: Three-layer assessment (coverage, pairwise win rate, per-dimension rubric) against a union-gold reference set. The position-swapped pairwise protocol is sound.

Human Expert Intervention: The primary downstream evaluation showing 30.4pp improvement over baseline, with p≈0.016 (Welch's t-test), confirmed by exact permutation test (p=0.013) and leave-one-out sensitivity analysis.

Human Expert-as-a-Judge: Rubric-based insight quality assessment.

Iterative Patcher Loop: Tests sustained improvement across multiple rounds.

Strengths in rigor: The statistical analysis (Appendix A.9) is thorough — mixed-effects models, permutation tests, and leave-one-out sensitivity analyses all support the central claim. The ablation study isolating typed role decomposition from generic subagent dispatch is informative.

Weaknesses in rigor: The human intervention study has n=6 per arm — adequate for detecting the large observed effect (Cohen's d=1.89) but insufficient for generalizable conclusions. The study is conducted on a single benchmark (SpreadsheetBench). The coverage metric shows all multi-agent systems cluster between 87-98%, suggesting it's not very discriminating. The LLM-as-a-judge results and human expert ratings diverge (IG and CC reports rated similarly by humans despite clear LLM-judge separation), which the authors attribute to human cognitive limitations rather than examining whether the LLM judge might be miscalibrated — a somewhat one-sided interpretation.

3. Potential Impact

Practical impact is potentially high. Agent debugging is genuinely a pain point in production LLM systems. A system that can surface "silent computation failures" or "cross-checking that verifies internal consistency within the wrong framework" from hundreds of traces addresses a real need. The 30.4pp improvement demonstrates concrete practitioner value.

Research impact: The formalization of corpus-level trace diagnostics as a distinct problem (separate from single-trajectory debugging and taxonomy classification) could spawn a subfield. The four-setting evaluation framework provides a template for future work. The union-gold clustering approach for evaluating open-ended diagnostic outputs is a useful methodological contribution.

Limitations to impact: The system costs ~ $76 p e r a n a l y s i s r u n (v s .$ 23-38 for alternatives) and takes ~48 minutes. This is manageable for production debugging but limits iteration speed. The approach is fundamentally tied to the quality of the underlying LLM (Claude Opus 4.6 throughout), making it hard to separate system design contributions from model capability.

4. Timeliness & Relevance

This paper arrives at a critical moment. LLM agents are rapidly moving into production (coding assistants, research agents, workflow automation), and the gap between building agents and debugging them at scale is widening. The paper directly addresses this bottleneck. The comparison against recent systems (Trace2Skill, HALO/RLM, AgentRx) demonstrates awareness of the fast-moving landscape. The connection to agent optimization frameworks (VeRO, AFlow, Meta-Harness) positions IG as a complementary component in the emerging agent development stack.

5. Strengths & Limitations

Key strengths:

Clear problem formalization with a compelling motivating example

Principled architectural decomposition (scout for breadth, investigator for depth) with ablations demonstrating each component's contribution

Comprehensive evaluation across four complementary settings

Strong downstream utility signal (30.4pp improvement)

Thorough statistical analysis with multiple robustness checks

The iterative patcher loop demonstrates that analysis-grounded patching avoids the regressions seen in analysis-free patching (the pure-patcher reversal from 0.80 to 0.58 is a striking demonstration)

Notable weaknesses:

Single-benchmark human study: The headline 30.4pp result is from SpreadsheetBench only; generalization is acknowledged but untested

Small sample sizes: n=6 per arm limits confidence in effect size estimates

Cost premium: ~2-3x more expensive than alternatives for comparable coverage

Confounded comparison: All systems use Claude Opus 4.6, but IG's specialized tools give it structural advantages beyond the architectural contribution

Evaluation circularity concern: The LLM judge (Claude Opus 4.6) is the same model family used by all systems, raising questions about systematic biases in quality assessment

Limited benchmark diversity: SpreadsheetBench and HLE are quite different tasks but only two benchmarks total

The human-expert-as-judge results (4.25 vs 4.22) show no meaningful difference, undermining the strong claims from the LLM judge evaluation

Additional Observations

The paper is well-written with extensive appendices providing reproducibility-critical details (prompts, protocols, statistical analyses). The honest reporting of the human-expert-as-judge near-parity and the edit-size non-significance is commendable. However, the framing sometimes oversells: "nearly doubling the 16.2pp gain" is technically correct but the CC Subagents baseline is itself strong, and the human judges couldn't distinguish the reports.

The contribution sits at the intersection of systems engineering and evaluation methodology rather than fundamental algorithmic innovation. The scout-investigator pattern is sensible but not deeply novel — it mirrors standard exploratory data analysis workflows operationalized through multi-agent dispatch.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 5.5Clarity 7.5

Generated May 22, 2026

Comparison History (21)

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gpt-5.25/22/2026

Paper 1 is more novel and potentially higher-impact: it moves agent adaptation from prompt/text artifacts to source-level self-rewriting with verification, promotion, and rollback—addressing structural failures unreachable by existing self-improvement methods. This could materially change how production agents are maintained, enabling continuous, deterministic evolution with safety gates. While Paper 2 is timely and useful for scalable diagnostics, corpus-level trace analysis is a more incremental extension of existing observability/analysis paradigms and depends on humans/agents to apply insights, whereas Paper 1 directly closes the loop to autonomous, deployable fixes.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

gemini-3.15/22/2026

While Paper 1 offers a valuable practical tool for LLM debugging, Paper 2 presents a highly counterintuitive and paradigm-challenging scientific finding: that higher observation fidelity hurts embodied LLM problem-solving. By demonstrating that perceptual noise disrupts LLM reasoning failures like repetitive loops, Paper 2 fundamentally challenges current evaluation assumptions in embodied AI. This conceptual disruption is likely to spark significant theoretical debate, broader follow-up research across robotics and LLM reasoning, and a reevaluation of how cognitive architectures are tested, giving it a higher potential scientific impact.

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental and broadly applicable problem—systematic diagnosis of LLM agent failures at scale—which is relevant across virtually all LLM agent deployments. It formalizes a new problem (corpus-level trace diagnostics), proposes a multi-agent architecture, and demonstrates concrete downstream improvements (30.4pp). This has high practical impact for the rapidly growing LLM agent ecosystem. Paper 2, while novel in benchmarking prompter proficiency for T2I systems, addresses a narrower problem space with more limited cross-field applicability. Paper 1's infrastructure-level contribution has broader and more lasting impact potential.

vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

gemini-3.15/22/2026

Paper 2 addresses a universal and critical bottleneck in LLM agent development: corpus-level trace diagnostics and debugging. While Paper 1 provides a valuable domain-specific benchmark for terminal tasks, Paper 2 introduces a novel methodology applicable across any LLM agent domain, offering scalable, evidence-backed diagnostics that lead to significant downstream performance improvements. This broader applicability and methodological innovation give Paper 2 a higher potential scientific impact.

vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

gpt-5.25/22/2026

Paper 2 has higher estimated impact: it addresses a timely, widely felt bottleneck (scalable diagnosis of LLM agent failures) with a more general methodology applicable across domains and agent frameworks. Its corpus-level formalization plus evidence-grounded insight generation can influence evaluation, debugging, and MLOps practices broadly, and the reported downstream gains (e.g., +30.4pp scaffold improvement) suggest strong real-world value. Paper 1 is innovative for modular specialization/efficiency, but its impact is more concentrated in deployment/parameter-efficient adaptation, with somewhat narrower cross-field methodological implications.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

gemini-3.15/22/2026

Paper 2 addresses a critical bottleneck in the rapidly growing field of LLM agent development by introducing a scalable, automated diagnostic system. Its methodological rigor, demonstrated by significant objective performance improvements (30.4pp gain), gives it a stronger potential for broad and immediate technical impact compared to Paper 1's qualitative, interview-based exploration of organizational culture, which, while valuable, has a narrower methodological scope and less quantifiable impact.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

claude-opus-4.65/22/2026

ExComm addresses a fundamental problem in test-time scaling—error propagation in multi-agent reasoning—with a principled, well-evaluated solution. Its communication protocol for detecting cross-agent conflicts, soft belief updates, and trajectory diversification are novel contributions with broad applicability across agentic AI systems. The method shows consistent gains across multiple benchmarks and models, demonstrating generalizability. While Paper 1 (Insights Generator) tackles an important practical problem of diagnosing LLM agent failures, it is more of a tooling/workflow contribution. Paper 2's methodological innovations in multi-agent coordination and error correction have broader theoretical and practical impact potential.

vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments

gemini-3.15/22/2026

Paper 1 addresses a highly timely and critical bottleneck in a booming field (LLM agent debugging). It offers a scalable, automated solution with strong empirical validation, demonstrating significant downstream performance gains (30.4pp). Paper 2, while methodologically sound, focuses on a niche theoretical framework for assurance arguments, which has a narrower scope and less potential for immediate, widespread impact across multiple disciplines.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

gemini-3.15/22/2026

Paper 1 bridges AI and computational biology by introducing a comprehensive benchmark for a high-stakes, real-world scientific problem (drug design). Its interdisciplinary nature, rigorous standardization of complex tasks, and potential to catalyze breakthroughs in life sciences give it a higher potential for broad and profound scientific impact compared to the methodological software-engineering focus of Paper 2.

vs. A Subjective Logic-based method for runtime confidence updates in safety arguments

claude-opus-4.65/22/2026

Paper 1 addresses a timely and broadly impactful problem—systematic diagnostics for LLM agents at scale—which is relevant to the rapidly growing field of LLM-based systems. It formalizes a new problem (corpus-level trace diagnostics), presents a novel multi-agent architecture, and demonstrates strong empirical results (30.4pp improvement). The breadth of applicability across LLM agent development and production monitoring gives it wide impact. Paper 2 contributes a useful method for runtime safety assurance using Subjective Logic, but addresses a narrower domain (safety cases for ML components) with a demonstration limited to a single simulation scenario.

vs. Von Neumann Networks

gemini-3.15/22/2026

Paper 1 proposes a fundamentally novel neural network architecture backed by theoretical proofs of computational universality and connections to fundamental physics/math (Green's functions, diffusion). Its potential to shift deep learning paradigms and influence hardware design gives it higher long-term scientific impact than Paper 2, which, while highly practical and timely, focuses primarily on diagnostic tooling and debugging for existing LLM agent systems.

vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

gemini-3.15/22/2026

Paper 2 addresses a fundamental research problem in AI (systematic evaluation and diagnostics of LLM agents at scale), formalizes a novel methodology, and demonstrates significant performance improvements. Paper 1, while highly useful, is primarily a software engineering framework that reduces boilerplate code for API deployment, offering more practical utility than foundational scientific innovation.

vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

gpt-5.25/22/2026

Paper 2 has broader and more durable impact: corpus-level trace diagnostics is a cross-domain problem affecting most LLM agent deployments, and a systematic, evidence-backed diagnostic framework can improve reliability across many tasks, models, and toolchains. Its methodology (hypothesis propose/test over large trace corpora) targets a key bottleneck—debugging and iteration—and shows measurable downstream gains, suggesting strong real-world applicability. Paper 1 is timely and useful but more domain-specific (Excel/spreadsheets) and its RL+benchmark contribution, while solid, is narrower in breadth.

vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental research problem—corpus-level diagnostics for LLM agents—with a novel multi-agent architecture, rigorous evaluation (human experts, benchmarks, rubric-based assessment), and demonstrated 30.4pp performance improvements. It formalizes a new problem space with broad applicability across LLM agent development. Paper 2 is a useful engineering contribution (a Python framework reducing boilerplate) but is incremental, narrowly scoped to Python tooling infrastructure, and lacks scientific novelty or evaluation beyond lines-of-code metrics. Paper 1 has far greater potential to influence research methodology and practice in the rapidly growing LLM agents field.

vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

gemini-3.15/22/2026

While Paper 1 presents a highly practical application for spreadsheet automation, Paper 2 tackles a fundamental bottleneck in agentic AI: systematic debugging and diagnostics at scale. By formalizing corpus-level trace diagnostics, Paper 2 provides tooling that can accelerate research and development across all LLM agent domains, giving it a broader and more foundational scientific impact.

vs. Imperfect World Models are Exploitable

claude-opus-4.65/22/2026

Paper 1 establishes fundamental theoretical results connecting reward hacking and model exploitation in reinforcement learning, proving near-inevitability of exploitation on large policy sets and deriving safe planning horizons. This addresses a core challenge in AI safety with broad implications for any system using learned world models. Its formal framework bridges two important concepts and provides foundational results that will influence future theoretical and practical work in RL safety. Paper 2, while practically useful, addresses a more narrowly scoped engineering problem (LLM agent debugging) with less fundamental theoretical contribution.

vs. Imperfect World Models are Exploitable

gemini-3.15/22/2026

Paper 1 addresses foundational theoretical issues in AI alignment and reinforcement learning by formally characterizing model exploitation and reward hacking. Its rigorous proofs establishing the limits of safe planning in world models offer profound, long-term implications for AI safety. While Paper 2 presents a highly useful and timely practical tool for debugging LLM agents, Paper 1's fundamental theoretical contributions to understanding agent behavior and vulnerabilities represent a broader and deeper scientific impact.

vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

gemini-3.15/22/2026

Paper 1 addresses a universal bottleneck in LLM agent development—diagnosing failures at scale across large execution traces. Its corpus-level diagnostic approach is highly generalizable and has broad impact across any field utilizing LLM agents. While Paper 2 presents an innovative test-time scaling method, its primary focus on Electronic Design Automation (EDA) and Verilog makes its immediate impact more niche compared to the foundational debugging framework proposed in Paper 1.

vs. GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

gemini-3.15/22/2026

Paper 1 addresses a highly critical and universal bottleneck in the rapidly growing field of LLM agents: systematic debugging and diagnostics at scale. Its framework for corpus-level trace diagnostics offers broad, cross-domain utility for developers and researchers, leading to substantial performance improvements. While Paper 2 presents an innovative self-play approach for geospatial reasoning, its direct impact is largely confined to a specific subfield, whereas Paper 1's methodology will impact the foundational development and deployment of LLM agents across all domains.

vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

claude-opus-4.65/22/2026

Paper 2 addresses a pressing practical problem—diagnosing LLM agent failures at scale—with a concrete system (Insights Generator) that demonstrates measurable downstream improvements (30.4pp gains). It has broader immediate applicability across the rapidly growing LLM agent ecosystem, strong empirical validation, and addresses a bottleneck (manual trace inspection) that affects many practitioners. Paper 1, while theoretically elegant in formalizing trust calibration as preferential Bayesian optimization, is more incremental—reframing an existing framework for a specific use case—and lacks empirical validation beyond the formalization itself.