SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo

Apr 22, 2026

arXiv:2604.20779v1 PDF

cs.AI(primary)cs.CYcs.SE

#75of 2292·Artificial Intelligence

#75 of 2292 · Artificial Intelligence

Tournament Score

1551±35

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance8

Rigor6.5

Novelty7.5

Clarity8

Tournament Score

1551±35

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: SWE-chat

1. Core Contribution

SWE-chat introduces the first large-scale dataset of real-world coding agent sessions collected from open-source developers, containing ~6,000 sessions with 63K+ user prompts and 355K+ tool calls across 200+ repositories. The key differentiator from prior datasets (Table 1) is the combination of four elements: human prompts, agent tool-use trajectories, code diffs, and line-level human vs. agent code attribution. This enables studying the full lifecycle of human-agent coding collaboration — not just what agents produce, but how users prompt, steer, override, and ultimately commit or discard agent output.

The paper also provides an empirical characterization of real-world usage, revealing several notable findings: bimodal coding patterns (41% "vibe coding" vs. 23% human-only), low code survival rates (44% of agent code survives into commits), higher security vulnerability rates for vibe-coded commits (9× human-only), and frequent user pushback (44% of turns).

2. Methodological Rigor

Strengths: The data collection pipeline is well-designed, leveraging Entire.io's opt-in CLI tool that hooks into git commits with line-level attribution. The analysis methodology is multi-faceted, combining raw session log metrics with LLM-annotated labels. The authors demonstrate commendable transparency in their annotation pipeline: they develop codebooks with inter-annotator agreement evaluation (moderate to high Cohen's κ across tasks), test 9-11 LLMs with multiple prompt paraphrases against 100 human gold labels per task, and select the best-performing model-prompt combination.

Weaknesses: The LLM annotation approach, while scalable, introduces meaningful noise. The authors acknowledge this but still draw empirical conclusions from these labels. The ICC(2,1) for session success ratings is only 0.50 between humans and 0.60 for the best LLM, suggesting limited reliability for this key metric. The code survival rate metric conflates intentional exploration (where discarded code was useful for understanding) with genuine waste. The security vulnerability analysis relies solely on Semgrep static analysis with default rules, which captures only a subset of true vulnerabilities and may produce false positives. The causal interpretation of vibe coding inefficiency is unclear — are vibe-coded projects inherently harder, or is the mode itself less efficient?

The selection bias is significant: the dataset captures only developers who (a) use specific supported agents (85% Claude Code), (b) opt into Entire.io, (c) work on public repositories, and (d) actually commit session logs. Failed sessions where users abandon entirely are not captured, likely inflating success metrics substantially.

3. Potential Impact

Dataset contribution: The living dataset aspect is valuable — the pipeline continuously discovers and processes new sessions, enabling longitudinal analysis as coding agents evolve. This could become a foundational resource for the HCI-meets-AI-agents research community.

Benchmark design: The finding that "understanding existing code" is the most common user intent (19%) directly challenges the patch-generation focus of existing benchmarks like SWE-bench. This could redirect benchmark construction efforts.

Agent design: The asymmetry between agent autonomy and user oversight (agents ask clarifying questions in <2% of turns while users push back in 44%) provides concrete evidence for designing more interactive agents that know when to pause.

Security implications: The vibe coding vulnerability findings (9× human-only rate) are timely and could influence policy discussions about AI-generated code in production systems.

Training data: The dataset could serve as training data for user simulators, reward models, and more realistic agent evaluation frameworks.

4. Timeliness & Relevance

This paper addresses a critical gap at precisely the right moment. Coding agents (Claude Code, Cursor, Gemini CLI) are seeing explosive adoption, yet empirical understanding of real-world usage has been almost entirely absent. The disconnect between benchmark performance and real-world utility is widely recognized, and this dataset provides the first systematic evidence base. The rapid growth of "vibe coding" (doubling from 20% to 40% over three months) makes this especially urgent for understanding safety and efficiency implications.

5. Strengths & Limitations

Key Strengths:

First-of-its-kind dataset combining all four critical dimensions (prompts, trajectories, diffs, attribution)

Living dataset design ensures continued relevance

Rich empirical findings that challenge assumptions embedded in current benchmarks

Transparent methodology with extensive appendices on annotation validation

Practical implications clearly articulated for benchmark design, agent interaction, and user simulation

Notable Limitations:

Heavy selection bias toward early adopters using Claude Code on public repos

~85% Claude Code concentration limits generalizability across agents

Survivorship bias (abandoned sessions not captured) likely inflates success metrics

LLM-based annotations are noisy; some tasks show low agreement (session success ICC ~0.6)

Security analysis is limited to static analysis; no dynamic testing or manual verification

Causal claims about vibe coding efficiency are confounded by project and task difficulty

The "lines of code" metric is a crude proxy for productivity and code quality

Entire.io's own repository contributing ~20% of data raises concerns about representativeness

Missing elements: No analysis of how agent choice affects outcomes (limited by Claude Code dominance), no comparison of experienced vs. novice developers, and no analysis of code quality beyond security vulnerabilities (e.g., maintainability, correctness).

Overall Assessment

SWE-chat makes a timely and important contribution by providing the first empirical window into real-world coding agent usage. The dataset itself is the primary contribution and has significant potential as a community resource. The empirical findings, while preliminary and subject to selection bias, surface important patterns that challenge prevailing assumptions about coding agent capabilities. The paper is well-written and transparent about limitations. Its impact will ultimately depend on community adoption and whether the dataset's biases are addressed through broader adoption of the collection pipeline.

Rating:7.4/ 10

Significance 8Rigor 6.5Novelty 7.5Clarity 8

Generated Apr 23, 2026

Comparison History (48)

vs. Fusion-fission forecasts when AI will shift to undesirable behavior

gemini-3.15/16/2026

Paper 2 addresses a critical, universal challenge in AI safety—predicting unpredictable shifts to undesirable behavior—with a highly novel, mathematically grounded framework. Its architecture-agnostic nature and strong empirical validation across frontier models give it profound implications for AI deployment in high-stakes environments, offering broader and deeper scientific impact than the strictly empirical dataset provided in Paper 1.

vs. CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

claude-opus-4.65/5/2026

SWE-chat provides the first large-scale empirical dataset of real-world coding agent usage, filling a critical gap in understanding how AI coding tools perform in practice. Its key findings (44% of agent code discarded, bimodal usage patterns, increased security vulnerabilities) have broad implications for AI tool design, software engineering practices, and policy. As a living dataset capturing real user interactions, it enables ongoing research beyond curated benchmarks. CoSearch, while technically solid, represents an incremental improvement in retrieval-augmented reasoning with narrower scope. SWE-chat's timeliness amid rapid AI coding adoption amplifies its impact.

vs. CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

claude-opus-4.65/5/2026

SWE-chat introduces the first large-scale empirical dataset of real-world AI coding agent interactions, filling a critical gap in understanding how AI coding tools are actually used. Its findings (e.g., only 44% of agent code survives, bimodal usage patterns, security vulnerabilities) have broad implications for AI tool design, software engineering practice, and policy. As a living dataset, it enables ongoing research. CoSearch, while technically sound, represents an incremental advance in agentic RAG by jointly training reasoning and retrieval components—a narrower contribution within an already crowded space.

vs. Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

gemini-34/30/2026

Paper 1 offers high impact due to its extreme timeliness and relevance to the rapid adoption of AI coding agents. By providing the first large-scale, empirical dataset of real-world human-agent interactions, it establishes a crucial foundation for HCI and LLM evaluation. While Paper 2 presents valuable theoretical insights for neuro-symbolic AI, Paper 1 has broader, more immediate implications for software engineering and AI safety, addressing a critical gap between curated benchmarks and practical utility.

vs. Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

gemini-34/29/2026

Paper 2 introduces a large-scale, living dataset of real-world AI coding agent interactions, addressing a critical gap between curated benchmarks and actual usage. While Paper 1 offers valuable theoretical insights into AI safety, Paper 2's empirical foundation will likely drive widespread, immediate research across AI evaluation, software engineering, and HCI, yielding a broader and more immediate scientific impact.

vs. Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

gemini-34/29/2026

Paper 1 introduces a large-scale, living dataset of real-world AI coding agent interactions, addressing a critical gap between curated benchmarks and actual developer workflows. Datasets of this scale in rapidly growing areas like AI-assisted software engineering typically have massive scientific impact, spurring widespread empirical research, tooling improvements, and direct industry applications. While Paper 2 offers valuable theoretical insights into AI safety, Paper 1's immediate practical relevance, novelty of 'in the wild' data, and broader audience across software engineering and AI give it a higher potential for widespread impact and citations.

vs. Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

gemini-34/28/2026

Paper 1 introduces a pioneering, large-scale dataset of real-world AI coding agent interactions, addressing a critical gap in empirical evaluation. Its insights into human-agent collaboration and real-world failure modes provide foundational knowledge likely to influence software engineering, HCI, and AI research broadly. In contrast, Paper 2 offers a valuable but more incremental algorithmic improvement for LVLM hallucination mitigation. Paper 1's broader applicability and potential to establish a new evaluation paradigm give it a significantly higher expected scientific impact.

vs. Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

gemini-34/28/2026

Paper 2 introduces a foundational, large-scale dataset of real-world human-AI interactions for coding agents. Living datasets documenting 'in the wild' behavior typically have a broader and longer-lasting scientific impact than specific algorithmic improvements, as they enable extensive future research across HCI, AI evaluation, and software engineering. Paper 1 offers a valuable but narrower methodological improvement for mitigating hallucinations in LVLMs.

vs. Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

gpt-5.24/28/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: a large, continuously updated dataset of real-world coding-agent sessions can become a community benchmark underpinning research in agent evaluation, HCI, software engineering, security, and ML reliability. Its methodology (in-the-wild collection, full interaction traces, authorship attribution, and empirical findings on code survival and vulnerabilities) enables many follow-on studies and tool improvements. Paper 1 is novel and rigorous with strong real-world relevance, but it is more domain- and jurisdiction-specific (traffic laws/Chinese context), narrowing cross-field reuse compared to a general-purpose coding-agent dataset.

vs. Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

gpt-5.24/28/2026

Paper 2 likely has higher impact due to broader relevance and timeliness: a continuously growing, real-world dataset of coding-agent interactions can influence evaluation standards, agent design, safety/security research, and software engineering practices across many domains. Its empirical findings (code survival rate, bimodal usage, vulnerability increases, user pushback) are immediately actionable and generalizable beyond a single country or regulatory regime. Paper 1 is innovative and application-driven for AV compliance, but its scope is narrower (traffic-law jurisdiction, AV domain) and may face generalization/maintenance challenges across regions and evolving laws.

vs. Efficient Agent Evaluation via Diversity-Guided User Simulation

gpt-5.24/26/2026

Paper 2 likely has higher impact: it introduces a large, continuously updated real-world dataset of coding-agent usage with rich interaction traces and empirical findings (usefulness, inefficiency, security vulnerabilities). This enables broad downstream research across software engineering, HCI, AI evaluation, and security, and is highly timely as coding agents proliferate. Paper 1 is novel and useful methodologically for efficient, coverage-guided agent evaluation, but its scope is narrower and impact depends on adoption of its specific framework and simulator assumptions, whereas SWE-chat can become a community benchmark and measurement substrate.

vs. Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

gemini-34/23/2026

Paper 2 introduces the first large-scale, living dataset of real-world AI coding agent interactions, a rapidly growing area with massive industry and academic interest. High-quality datasets in active AI subfields consistently drive extensive follow-on research and citations. While Paper 1 provides valuable theoretical frameworks for AI ethics and governance, Paper 2's empirical foundation and immediate applicability to improving software engineering AI agents give it a higher potential for broad, immediate, and measurable scientific impact.

vs. Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

gpt-5.24/23/2026

Paper 2 likely has higher scientific impact due to its concrete, scalable empirical contribution: a large, living dataset of real-world coding-agent interactions with tool-call traces and code authorship attribution. This enables reproducible measurement, benchmarking, and safety/security research, with immediate applications for agent design, evaluation, and deployment in widely used developer workflows. Its findings on efficiency, survivability of agent code, vulnerability introduction, and human pushback are timely and actionable. Paper 1 is conceptually novel and important for governance, but may have narrower methodological uptake than a broadly reusable dataset.

vs. Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

gpt-5.24/23/2026

Paper 1 has higher likely impact: it introduces a large, continuously updated real-world dataset of coding-agent interactions with rich traces (prompts, tool calls, code attribution) that can underpin many follow-on studies and benchmarks. Its findings (survival rate of agent code, security vulnerability comparisons, user pushback) are directly actionable for agent design, evaluation, and policy, with broad relevance across SE, ML, HCI, and security. Paper 2 proposes a reasoning framework with benchmark gains, but appears narrower in scope and more susceptible to benchmark-specific effects.

vs. Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

claude-opus-4.64/23/2026

SWE-chat introduces the first large-scale empirical dataset of real-world AI coding agent interactions, addressing a critical gap in understanding how AI coding tools are actually used in practice. Its findings (e.g., 44% code survival rate, bimodal usage patterns, security vulnerabilities) have broad implications for AI-assisted software engineering, tool design, and policy. As a living dataset, it enables ongoing research. Paper 2 proposes an incremental reasoning framework (SABA) tested primarily on puzzle benchmarks, with narrower applicability and less novelty relative to existing self-reflection and chain-of-thought methods.

vs. AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

gpt-5.24/23/2026

Paper 2 is likely to have higher scientific impact because it introduces a large, continuously updated real-world dataset of coding-agent sessions with rich interaction traces and human/agent attribution, enabling broad empirical research across ML, HCI, software engineering, and security. Its findings directly inform deployment, evaluation, and safety of coding agents—an urgent, timely topic with wide practical relevance. Paper 1 is methodologically rigorous and novel for admissible differentiable heuristics, but its applications are narrower (ALT/shortest-path heuristic search) and thus likely to influence a smaller set of fields.

vs. AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

claude-opus-4.64/23/2026

SWE-chat addresses a timely, high-impact gap: empirical understanding of how AI coding agents are actually used in practice. Its large-scale, living dataset of 6,000+ real-world sessions provides foundational infrastructure for the rapidly growing field of AI-assisted software development. Key findings (e.g., only 44% of agent code survives, increased security vulnerabilities) have broad implications for tool design, policy, and research benchmarks. Paper 2, while technically rigorous, addresses a narrow problem (landmark compression for ALT heuristics) with incremental improvements over strong baselines, limiting its breadth of impact.

vs. Diagnosing CFG Interpretation in LLMs

gemini-34/23/2026

Paper 1 provides a large-scale, living dataset of real-world AI coding agent interactions, addressing a critical gap between benchmarks and actual developer workflows. Given the explosive adoption of AI coding assistants, empirical datasets like SWE-chat are highly timely and will likely drive extensive follow-up research and tool development. While Paper 2 offers valuable theoretical insights into LLM reasoning capabilities, Paper 1's broad real-world applicability, novel data collection, and immediate relevance to applied software engineering give it a significantly higher potential for widespread scientific impact.

vs. Diagnosing CFG Interpretation in LLMs

claude-opus-4.64/23/2026

SWE-chat introduces the first large-scale empirical dataset of real-world AI coding agent usage, addressing a critical gap in understanding how AI tools perform in practice versus benchmarks. Its findings (e.g., only 44% of agent code survives, increased security vulnerabilities, bimodal usage patterns) have immediate implications for tool design, software engineering practices, and AI safety. As a living dataset, it enables ongoing research. Paper 2 provides useful diagnostics on LLM grammar interpretation but addresses a narrower technical question with less immediate broad impact.

vs. ACIArena: Toward Unified Evaluation for Agent Cascading Injection

gpt-5.24/23/2026

Paper 1 has higher impact potential due to its large-scale, real-world “in-the-wild” dataset of coding-agent interactions (6,000 sessions, full traces, authorship attribution) and concrete empirical findings on efficiency, security, and human-agent dynamics. Such a living dataset can become a widely used resource across software engineering, HCI, ML evaluation, and security, enabling many follow-on studies and benchmarking beyond curated tasks. Paper 2 is timely and valuable for MAS security evaluation, but its benchmark is smaller and more niche; transfer to broad developer workflows may be narrower than SWE-chat’s cross-field applicability.