TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr
Abstract
We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.
AI Impact Assessments
(1 models)Scientific Impact Assessment: TerminalWorld
1. Core Contribution
TerminalWorld introduces a scalable data engine that automatically converts publicly shared terminal recordings (from asciinema) into executable, validated evaluation tasks for terminal agents. The core insight is that naturally occurring terminal operations can be reverse-engineered into benchmarks that are "authentic by construction." The pipeline has four stages: (1) collecting and filtering 80,870 recordings, (2) synthesizing task instructions and reference solutions via LLM distillation, (3) reproducing executable Docker environments through an execution-feedback loop, and (4) generating calibrated test suites via a trial-based refinement process (AllPassing, Nop, and Partial trials). This yields 1,530 validated tasks (200 manually verified), spanning 18 categories and 1,280 unique commands.
The key problem addressed is the disconnect between expert-curated terminal benchmarks (which tend toward adversarial puzzles) and authentic developer workflows. The weak correlation (Pearson r=0.20) between Terminal-Bench and TerminalWorld scores empirically validates this claim.
2. Methodological Rigor
Strengths in methodology:
Weaknesses and concerns:
3. Potential Impact
High impact areas:
Broader influence:
4. Timeliness & Relevance
This work is highly timely. The rapid deployment of commercial CLI agents (Claude Code, Codex CLI, Gemini CLI) in 2025-2026 creates an urgent need for realistic evaluation. The paper directly addresses the tension between benchmark difficulty and authenticity that has been a growing concern in the AI evaluation community. The focus on evolving developer practices and tool diversity is particularly relevant as the terminal ecosystem continues to expand.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Summary
TerminalWorld makes a solid contribution to the agent evaluation ecosystem by providing an automated, scalable pipeline for generating authentic terminal benchmarks. Its core insight—that real-world recordings can be reverse-engineered into evaluation tasks—is valuable and well-executed. However, the reliance on a single LLM family for synthesis, the unfair Terminal-Bench comparison conditions, and the lack of rigorous test quality analysis somewhat temper the strength of its claims. The work is most impactful as infrastructure (the reusable pipeline) rather than as a static benchmark.
Generated May 22, 2026
Comparison History (15)
TerminalWorld introduces a novel scalable benchmarking methodology that addresses a significant gap in evaluating AI agents on real-world terminal tasks. Its automated data engine processing 80K+ recordings, coverage of 1,280 unique commands across 18 categories, and demonstration that current frontier models achieve only 62.5% pass rate provide high-impact insights for the rapidly growing AI agent community. The benchmark fills a distinct niche (weak correlation with existing benchmarks) and is designed to scale with evolving practices. While LACO presents solid technical contributions to collaborative driving, its scope is narrower, constrained to a specific simulation environment, and builds incrementally on existing paradigms.
Paper 1 addresses a critical, high-stakes domain (healthcare) by revealing significant flaws in how medical LLMs are currently evaluated. Demonstrating that static benchmarks overestimate interactive diagnostic performance has profound implications for patient safety and AI deployment in clinical settings. While Paper 2 provides a valuable benchmark for software agents, Paper 1's focus on clinical reasoning under uncertainty directly impacts the safe integration of AI in life-critical applications, giving it a broader and more urgent scientific and societal impact.
TerminalWorld introduces a novel, scalable benchmark with concrete data (1,530 tasks from 80,870 recordings), open-source resources, and empirical findings showing current agents achieve only 62.5% on real terminal tasks. It fills a clear gap in agent evaluation with a reproducible methodology. Paper 2 is a position/survey paper arguing for a shift in LLM-based planning approaches, which, while insightful, offers less concrete novelty and empirical contribution. Benchmarks tend to have outsized impact by enabling standardized evaluation across the community.
TerminalWorld addresses the timely and high-impact area of benchmarking AI agents on real-world tasks. Its scalable data engine for generating evaluation tasks from real terminal recordings is novel, and the finding that current frontier models struggle (max 62.5% pass rate) with weak correlation to existing benchmarks highlights important gaps. The benchmark serves a broad community working on LLM agents and has immediate practical applications. Paper 1 makes solid theoretical contributions to modular ASP but addresses a narrower community with less immediate broad impact.
Paper 2 addresses a universal and critical bottleneck in LLM agent development: corpus-level trace diagnostics and debugging. While Paper 1 provides a valuable domain-specific benchmark for terminal tasks, Paper 2 introduces a novel methodology applicable across any LLM agent domain, offering scalable, evidence-backed diagnostics that lead to significant downstream performance improvements. This broader applicability and methodological innovation give Paper 2 a higher potential scientific impact.
Paper 1 likely has higher impact due to its scalable, automated pipeline that converts 80k real terminal recordings into a large, diverse benchmark (1,530 tasks) with a verified subset, enabling continual updating as developer practices change. This methodological innovation and dataset scale increase reuse potential and make it broadly relevant to agent evaluation, RL, systems, and software engineering. Paper 2 is timely and provides insightful failure analysis for web retrieval, but is smaller (100 tasks) and more domain-specific, which may limit breadth and long-term extensibility.
While Paper 1 offers a rigorous and highly useful benchmark for terminal agents, Paper 2 tackles the automation of the scientific process itself. By introducing an interactive, multi-agent autonomous research laboratory, Paper 2 has a much broader potential impact, as it aims to accelerate and steer research and discovery across domains. This paradigm shift in how research is conducted holds greater transformative potential than a specialized evaluation benchmark.
Paper 1 presents a highly scalable, automated methodology to reverse-engineer real-world tasks, overcoming the bottleneck of manual curation. Its application to AI agents in terminal environments addresses a critical and rapidly growing area with immense practical and economic implications (software automation). While Paper 2 offers valuable insights into LLM emotional intelligence, Paper 1's scale (80k recordings, 1530 tasks) and innovative data engine provide a more substantial and reproducible contribution to AI benchmarking methodologies.
TerminalWorld addresses a critical gap in evaluating AI agents on real-world terminal tasks with a scalable, automated data engine processing 80K+ recordings. Its methodology of reverse-engineering tasks from real recordings is novel and practically important as AI coding agents proliferate. The benchmark reveals significant limitations of frontier models (62.5% max pass rate) and shows weak correlation with existing benchmarks, demonstrating unique coverage. Its scalability and real-world grounding give it broader impact for the rapidly growing AI agent ecosystem. AttuneBench contributes meaningfully to EI evaluation but addresses a narrower, less urgently demanded capability with a smaller dataset.
TerminalWorld introduces a scalable, automated benchmark with 1,530 validated tasks from real-world terminal recordings, addressing a clear gap in evaluating LLM agents on authentic terminal workflows. Its comprehensive evaluation of 8 frontier models and 6 agents, showing only 62.5% max pass rate, provides actionable insights for the community. The open-source data engine enables continuous benchmark evolution. Paper 2 presents an early-stage prototype (self-described as 'proof of life') for structured agentic discovery of data-system compositions, with limited evaluation on a single workload. Paper 1's broader applicability, methodological rigor, and community utility give it substantially higher impact potential.
Paper 1 introduces a paradigm shift in AI agent architecture by applying event-sourcing to agentic systems. This novel approach addresses critical bottlenecks in current frameworks, such as auditability, determinism, and debugging. While Paper 2 provides a valuable benchmark for terminal tasks, benchmarks are often transient in their impact as models rapidly improve. In contrast, Paper 1 offers foundational design principles that could broadly influence the long-term development and deployment of robust, enterprise-ready AI agents across various domains.
Paper 2 has higher likely impact due to strong novelty (automated reverse-engineering of tasks from real terminal recordings), broad applicability (benchmarking LLM/agent systems, automation, HCI, software engineering, security), and timeliness given rapid growth of agentic coding/terminal tools. Its large-scale dataset (1,530 tasks from 80,870 recordings), verified subset, and systematic evaluation across models/agents provide methodological rigor and a reusable community resource that can shape future research. Paper 1 is useful for manufacturing scheduling but is narrower and uses a constrained DRL-over-dispatching-rules approach with more limited cross-field influence.
Paper 2 (ArborKV) likely has higher impact: it introduces a novel, generally applicable systems technique for scaling tree-based LLM inference by addressing a clear bottleneck (KV-cache memory) with structure-aware eviction and rehydration, yielding large memory savings (~4x) while preserving accuracy. This has immediate real-world applicability for deploying search-based reasoning under fixed hardware budgets and can influence LLM serving, inference optimization, and algorithm–systems co-design broadly. Paper 1 is valuable as an authentic benchmark, but benchmarks typically have narrower cross-field impact than a reusable core systems method.
Paper 2 addresses a highly timely and rapidly growing field (LLM agents in real-world environments) with broad implications for software engineering and automation. Its scalable, automated approach to reverse-engineering real-world tasks offers significant methodological innovation over expert curation. While Paper 1 provides a novel application of graph embeddings to food science, Paper 2's focus on foundational AI capabilities gives it a much broader potential impact across disciplines.
Paper 1 is more likely to have higher scientific impact due to stronger novelty and broader cross-field implications: demonstrating AI-assisted formal proof search solving multiple open Erdős problems and many OEIS conjectures is a rare, high-signal advance with direct consequences for mathematics and adjacent scientific domains. Its results suggest a new research paradigm (LLM+formal verification) with immediate real-world application in producing trustworthy proofs. Paper 2 is timely and useful for agent evaluation, but benchmarks typically yield narrower, tooling-focused impact than a demonstrated capability to resolve open scientific problems.