TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr

May 21, 2026

arXiv:2605.22535v1 PDF

cs.AI(primary)

#1342of 2292·Artificial Intelligence

#1342 of 2292 · Artificial Intelligence

Tournament Score

1392±47

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6

Novelty7

Clarity7.5

Tournament Score

1392±47

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TerminalWorld

1. Core Contribution

TerminalWorld introduces a scalable data engine that automatically converts publicly shared terminal recordings (from asciinema) into executable, validated evaluation tasks for terminal agents. The core insight is that naturally occurring terminal operations can be reverse-engineered into benchmarks that are "authentic by construction." The pipeline has four stages: (1) collecting and filtering 80,870 recordings, (2) synthesizing task instructions and reference solutions via LLM distillation, (3) reproducing executable Docker environments through an execution-feedback loop, and (4) generating calibrated test suites via a trial-based refinement process (AllPassing, Nop, and Partial trials). This yields 1,530 validated tasks (200 manually verified), spanning 18 categories and 1,280 unique commands.

The key problem addressed is the disconnect between expert-curated terminal benchmarks (which tend toward adversarial puzzles) and authentic developer workflows. The weak correlation (Pearson r=0.20) between Terminal-Bench and TerminalWorld scores empirically validates this claim.

2. Methodological Rigor

Strengths in methodology:

The trial-based test refinement loop is well-designed: AllPassing prevents false negatives, Nop prevents trivial false positives, and Partial trials (ablated solutions) enhance discriminability. This is a thoughtful approach to the oracle problem.

The execution-based environment reproduction loop provides concrete validation that synthesized tasks are actually runnable, rather than relying on static synthesis alone.

The filtering pipeline is comprehensive, addressing PII, TUI interactions, inaccessible dependencies, and low-quality recordings.

Weaknesses and concerns:

The pipeline relies heavily on Claude Sonnet 4.6 and Claude Code for task synthesis, environment reproduction, and test generation. This creates a potential circularity concern: the benchmark is substantially shaped by one model family's understanding of terminal tasks. The paper does not analyze how different synthesis models might produce different benchmark characteristics.

The conversion from 80,870 recordings to 1,530 tasks (1.9% yield) raises questions about selection bias—are the surviving tasks representative of real-world usage, or do they represent the subset that LLMs can successfully process?

The "Verified" subset selection explicitly prioritizes "longer command sequences and non-trivial domain-specific tools," which introduces curation bias contrary to the paper's stated goal of capturing natural usage distributions.

The comparison with Terminal-Bench uses officially reported scores under potentially different configurations (Table 6 shows varying scaffolds and reasoning efforts), weakening the r=0.20 correlation claim. GPT-5.5's Terminal-Bench score uses Codex CLI with "xhigh" reasoning while TerminalWorld uses Terminus-2 with "medium"—this alone could explain much of the ranking reshuffling.

3. Potential Impact

High impact areas:

The automated pipeline concept is genuinely valuable. As terminal tools evolve rapidly, static benchmarks become outdated quickly. A continuously refreshable benchmark addresses a real infrastructure need for the agent evaluation community.

The finding that agents solve tasks through alternative command paths (21.4% median overlap with human solutions) provides useful insight for benchmark design—outcome-based evaluation is essential, and process-based metrics would be misleading.

The "efficiency paradox" finding (failed attempts consuming 3.3× more tokens) has practical implications for agent design, suggesting that exploration strategies need fundamental improvement.

Broader influence:

The methodology of reverse-engineering evaluation benchmarks from naturally occurring traces could generalize beyond terminal tasks to other domains (e.g., web browsing recordings, IDE interactions).

The 91% command coverage gap with Terminal-Bench demonstrates concrete blind spots in expert curation, which should motivate complementary evaluation approaches.

4. Timeliness & Relevance

This work is highly timely. The rapid deployment of commercial CLI agents (Claude Code, Codex CLI, Gemini CLI) in 2025-2026 creates an urgent need for realistic evaluation. The paper directly addresses the tension between benchmark difficulty and authenticity that has been a growing concern in the AI evaluation community. The focus on evolving developer practices and tool diversity is particularly relevant as the terminal ecosystem continues to expand.

5. Strengths & Limitations

Key strengths:

The "authentic by construction" framing is compelling and well-executed—grounding tasks in real human behavior rather than synthetic generation or expert curation fills a genuine gap.

Comprehensive evaluation across 8 models and 6 agents with detailed cost analysis provides actionable insights.

The ethical design (pointer-based architecture respecting right-to-be-forgotten) is thoughtful.

Open-sourced data and code enhance reproducibility.

Notable limitations:

TUI exclusion significantly narrows scope—many real terminal workflows involve vim, nano, or other TUI applications, and excluding them systematically biases the benchmark toward pipe-oriented workflows.

No formal analysis of test suite quality beyond the three-trial validation. There is no measurement of test suite coverage, mutation testing, or analysis of how many distinct correct solutions each test suite admits.

The benchmark's difficulty may be partially artificial. Some failures may stem from environment reproduction issues rather than genuine task difficulty. The paper acknowledges harness errors (2.5-21.5%) but doesn't deeply analyze whether subtle environment mismatches affect pass rates.

No analysis of inter-annotator agreement for the Verified subset beyond stating that four authors conducted review.

Scalability claims are aspirational—the paper shows one snapshot, not a demonstrated continuous update cycle.

The 62.5% ceiling is somewhat ambiguous without understanding how many failures are due to genuine capability gaps versus test brittleness or environment issues.

Summary

TerminalWorld makes a solid contribution to the agent evaluation ecosystem by providing an automated, scalable pipeline for generating authentic terminal benchmarks. Its core insight—that real-world recordings can be reverse-engineered into evaluation tasks—is valuable and well-executed. However, the reliance on a single LLM family for synthesis, the unfair Terminal-Bench comparison conditions, and the lack of rigorous test quality analysis somewhat temper the strength of its claims. The work is most impactful as infrastructure (the reusable pipeline) rather than as a static benchmark.

Rating:6.8/ 10

Significance 7Rigor 6Novelty 7Clarity 7.5

Generated May 22, 2026

Comparison History (15)

vs. LACO: Adaptive Latent Communication for Collaborative Driving

claude-opus-4.65/22/2026

TerminalWorld introduces a novel scalable benchmarking methodology that addresses a significant gap in evaluating AI agents on real-world terminal tasks. Its automated data engine processing 80K+ recordings, coverage of 1,280 unique commands across 18 categories, and demonstration that current frontier models achieve only 62.5% pass rate provide high-impact insights for the rapidly growing AI agent community. The benchmark fills a distinct niche (weak correlation with existing benchmarks) and is designed to scale with evolving practices. While LACO presents solid technical contributions to collaborative driving, its scope is narrower, constrained to a specific simulation environment, and builds incrementally on existing paradigms.

vs. Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

gemini-3.15/22/2026

Paper 1 addresses a critical, high-stakes domain (healthcare) by revealing significant flaws in how medical LLMs are currently evaluated. Demonstrating that static benchmarks overestimate interactive diagnostic performance has profound implications for patient safety and AI deployment in clinical settings. While Paper 2 provides a valuable benchmark for software agents, Paper 1's focus on clinical reasoning under uncertainty directly impacts the safe integration of AI in life-critical applications, giving it a broader and more urgent scientific and societal impact.

vs. Planning in the LLM Era: Building for Reliability and Efficiency

claude-opus-4.65/22/2026

TerminalWorld introduces a novel, scalable benchmark with concrete data (1,530 tasks from 80,870 recordings), open-source resources, and empirical findings showing current agents achieve only 62.5% on real terminal tasks. It fills a clear gap in agent evaluation with a reproducible methodology. Paper 2 is a position/survey paper arguing for a shift in LLM-based planning approaches, which, while insightful, offers less concrete novelty and empirical contribution. Benchmarks tend to have outsized impact by enabling standardized evaluation across the community.

vs. Parametric Modular Answer Set Programs Made Declarative

claude-opus-4.65/22/2026

TerminalWorld addresses the timely and high-impact area of benchmarking AI agents on real-world tasks. Its scalable data engine for generating evaluation tasks from real terminal recordings is novel, and the finding that current frontier models struggle (max 62.5% pass rate) with weak correlation to existing benchmarks highlights important gaps. The benchmark serves a broad community working on LLM agents and has immediate practical applications. Paper 1 makes solid theoretical contributions to modular ASP but addresses a narrower community with less immediate broad impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gemini-3.15/22/2026

Paper 2 addresses a universal and critical bottleneck in LLM agent development: corpus-level trace diagnostics and debugging. While Paper 1 provides a valuable domain-specific benchmark for terminal tasks, Paper 2 introduces a novel methodology applicable across any LLM agent domain, offering scalable, evidence-backed diagnostics that lead to significant downstream performance improvements. This broader applicability and methodological innovation give Paper 2 a higher potential scientific impact.

vs. SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

gpt-5.25/22/2026

Paper 1 likely has higher impact due to its scalable, automated pipeline that converts 80k real terminal recordings into a large, diverse benchmark (1,530 tasks) with a verified subset, enabling continual updating as developer practices change. This methodological innovation and dataset scale increase reuse potential and make it broadly relevant to agent evaluation, RL, systems, and software engineering. Paper 2 is timely and provides insightful failure analysis for web retrieval, but is smaller (100 tasks) and more domain-specific, which may limit breadth and long-term extensibility.

vs. Claw AI Lab: An Autonomous Multi-Agent Research Team

gemini-3.15/22/2026

While Paper 1 offers a rigorous and highly useful benchmark for terminal agents, Paper 2 tackles the automation of the scientific process itself. By introducing an interactive, multi-agent autonomous research laboratory, Paper 2 has a much broader potential impact, as it aims to accelerate and steer research and discovery across domains. This paradigm shift in how research is conducted holds greater transformative potential than a specialized evaluation benchmark.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gemini-3.15/22/2026

Paper 1 presents a highly scalable, automated methodology to reverse-engineer real-world tasks, overcoming the bottleneck of manual curation. Its application to AI agents in terminal environments addresses a critical and rapidly growing area with immense practical and economic implications (software automation). While Paper 2 offers valuable insights into LLM emotional intelligence, Paper 1's scale (80k recordings, 1530 tasks) and innovative data engine provide a more substantial and reproducible contribution to AI benchmarking methodologies.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

claude-opus-4.65/22/2026

TerminalWorld addresses a critical gap in evaluating AI agents on real-world terminal tasks with a scalable, automated data engine processing 80K+ recordings. Its methodology of reverse-engineering tasks from real recordings is novel and practically important as AI coding agents proliferate. The benchmark reveals significant limitations of frontier models (62.5% max pass rate) and shows weak correlation with existing benchmarks, demonstrating unique coverage. Its scalability and real-world grounding give it broader impact for the rapidly growing AI agent ecosystem. AttuneBench contributes meaningfully to EI evaluation but addresses a narrower, less urgently demanded capability with a smaller dataset.

vs. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

claude-opus-4.65/22/2026

TerminalWorld introduces a scalable, automated benchmark with 1,530 validated tasks from real-world terminal recordings, addressing a clear gap in evaluating LLM agents on authentic terminal workflows. Its comprehensive evaluation of 8 frontier models and 6 agents, showing only 62.5% max pass rate, provides actionable insights for the community. The open-source data engine enables continuous benchmark evolution. Paper 2 presents an early-stage prototype (self-described as 'proof of life') for structured agentic discovery of data-system compositions, with limited evaluation on a single workload. Paper 1's broader applicability, methodological rigor, and community utility give it substantially higher impact potential.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gemini-3.15/22/2026

Paper 1 introduces a paradigm shift in AI agent architecture by applying event-sourcing to agentic systems. This novel approach addresses critical bottlenecks in current frameworks, such as auditability, determinism, and debugging. While Paper 2 provides a valuable benchmark for terminal tasks, benchmarks are often transient in their impact as models rapidly improve. In contrast, Paper 1 offers foundational design principles that could broadly influence the long-term development and deployment of robust, enterprise-ready AI agents across various domains.

vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

gpt-5.25/22/2026

Paper 2 has higher likely impact due to strong novelty (automated reverse-engineering of tasks from real terminal recordings), broad applicability (benchmarking LLM/agent systems, automation, HCI, software engineering, security), and timeliness given rapid growth of agentic coding/terminal tools. Its large-scale dataset (1,530 tasks from 80,870 recordings), verified subset, and systematic evaluation across models/agents provide methodological rigor and a reusable community resource that can shape future research. Paper 1 is useful for manufacturing scheduling but is narrower and uses a constrained DRL-over-dispatching-rules approach with more limited cross-field influence.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

gpt-5.25/22/2026

Paper 2 (ArborKV) likely has higher impact: it introduces a novel, generally applicable systems technique for scaling tree-based LLM inference by addressing a clear bottleneck (KV-cache memory) with structure-aware eviction and rehydration, yielding large memory savings (~4x) while preserving accuracy. This has immediate real-world applicability for deploying search-based reasoning under fixed hardware budgets and can influence LLM serving, inference optimization, and algorithm–systems co-design broadly. Paper 1 is valuable as an authentic benchmark, but benchmarks typically have narrower cross-field impact than a reusable core systems method.

vs. Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

gemini-3.15/22/2026

Paper 2 addresses a highly timely and rapidly growing field (LLM agents in real-world environments) with broad implications for software engineering and automation. Its scalable, automated approach to reverse-engineering real-world tasks offers significant methodological innovation over expert curation. While Paper 1 provides a novel application of graph embeddings to food science, Paper 2's focus on foundational AI capabilities gives it a much broader potential impact across disciplines.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/22/2026

Paper 1 is more likely to have higher scientific impact due to stronger novelty and broader cross-field implications: demonstrating AI-assisted formal proof search solving multiple open Erdős problems and many OEIS conjectures is a rare, high-signal advance with direct consequences for mathematics and adjacent scientific domains. Its results suggest a new research paradigm (LLM+formal verification) with immediate real-world application in producing trustworthy proofs. Paper 2 is timely and useful for agent evaluation, but benchmarks typically yield narrower, tooling-focused impact than a demonstrated capability to resolve open scientific problems.