What Makes Interaction Trajectories Effective for Training Terminal Agents?
Sidi Yang, Chaofan Tao, Jierun Chen, Tiezheng Yu, Ruoyu Wang, Yuxin Jiang, Yiming Du, Wendong Xu
Abstract
Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper challenges the widely-held assumption that stronger code agents produce better training data for student models. Through controlled experiments using a new pipeline called Terminal-Lego, the authors demonstrate a "pedagogical paradox": DeepSeek-V3.2, which scores lower as a standalone agent on Terminal-Bench 2.0 (39.3% vs. Claude Opus 4.6's 69.4%), produces substantially stronger students when its trajectories are used for supervised fine-tuning. The paper attributes this to Environment-Grounded Supervision (EGS) — trajectories that expose explicit inspect-act-verify loops through harness-visible commands are more teachable than compact, efficient solutions from stronger models.
The key conceptual contribution is the Targeted Observation Ratio (TOR), which measures the fraction of actions preceded by path-aligned observation commands. This metric serves as a proxy for trajectory teachability and is shown to correlate with downstream student performance. The practical outcome is remarkable data efficiency: 15.3k trajectories achieve results competitive with methods using 30× more data.
Methodological Rigor
The experimental design is commendably careful in several respects:
1. Controlled comparisons: The authors use matched task sets (8.1k instances solved by all four teachers), identical harness (Terminus-2), identical student architectures, and identical training hyperparameters. This isolates trajectory quality from confounds like task difficulty and scaffold design.
2. Systematic ablations: The paper rules out simpler explanations (trajectory length, error recovery) through well-designed experiments. The comparison of longest vs. shortest successful rollouts (Table 2) and error-free trajectory filtering are particularly convincing.
3. Masking experiments: The observation masking study (Tables 4 and 7) provides direct causal evidence that observation supervision contributes to learning, distinguishing between targeted and untargeted observations.
4. Failed trajectory analysis: Showing that failed DeepSeek-V3.2 trajectories can still produce competitive 32B students (Table 6) is a strong test of the EGS hypothesis.
However, several methodological concerns exist. The TOR metric, while intuitive, is defined via path-alignment heuristics whose precision isn't validated. The authors acknowledge this limitation—they note TOR doesn't establish whether observations are necessary or sufficient for justifying actions. The study is also limited to four teacher models and two student scales (8B, 32B), making it difficult to assess how general the findings are. The three-trial averaging for evaluation is reasonable but modest for establishing statistical significance of the differences observed.
The Terminal-Lego pipeline itself is well-engineered, with a 41.8% pass rate through Docker round-trip verification from 36.8k candidates, yielding 15.4k verified tasks. The cascaded generation approach with test review loops is a sensible design for quality control.
Potential Impact
The paper's implications are potentially significant for the agent post-training community:
Practical impact: The finding that trajectory *interaction quality* matters more than teacher *capability* has immediate implications for data curation strategies. Teams can potentially achieve better results by selecting teachers that produce EGS-rich trajectories rather than simply using the strongest available model. The 30× data efficiency claim, if robust, could democratize competitive agent training.
Conceptual impact: The distinction between "solving ability" and "teaching ability" is an important conceptual contribution that could influence how the community thinks about distillation more broadly. The concept of "Harness Engineering" — designing interaction structures rather than just optimizing outcomes — offers a productive reframing of the post-training problem.
Adjacent fields: The EGS framework could extend beyond terminal agents to other agentic settings (web agents, robotics) where environment-grounded reasoning is similarly important.
Timeliness & Relevance
This paper arrives at a critical moment. Code agents are rapidly becoming production tools (Cursor, Claude Code, Codex CLI), and there is intense competition around post-training data pipelines (TermiGen, TerminalTraj, CLI-Gym, Nemotron-Terminal). The question of what makes training data effective is a genuine bottleneck. The "stronger-is-better" assumption is indeed prevalent and largely untested. The paper directly addresses this gap.
The comparison against contemporary systems on Terminal-Bench 2.0 (including GPT-5-Mini, Grok 4, and Qwen 3 Coder 480B references) grounds the work in a current, competitive landscape.
Strengths
1. Clear, counterintuitive finding: The pedagogical paradox is well-documented and likely to generate discussion. The experimental controls are sufficient to make the finding credible.
2. Multiple convergent ablations: The paper doesn't rely on a single experiment but builds its case through complementary analyses (TOR correlation, masking studies, failed trajectories, prompt engineering).
3. Practical data efficiency: The 15.3k trajectory result achieving competitive performance is compelling and immediately actionable.
4. Detailed appendix: The observation-action pattern analysis (Appendix D) provides rich behavioral characterization. The training dynamics analysis (Appendix C) revealing the "imitation difficulty paradox" — hardest-to-imitate trajectories producing best students — adds an interesting optimization perspective.
5. Reproducibility infrastructure: The StackOverflow-based pipeline with Docker verification is well-documented and appears reproducible.
Limitations
1. Limited teacher/student diversity: Only four teachers and two student scales. The generality of the pedagogical paradox across different model families, sizes, and architectures remains uncertain.
2. TOR as a proxy: The metric captures one dimension of EGS but may miss important aspects (reasoning quality, plan coherence, adaptation strategies). Its predictive validity is demonstrated on limited comparisons.
3. Scaffold dependency: All experiments use Terminus-2. Whether the findings transfer to other scaffolds (OpenHands, custom implementations) is untested, despite the paper's claim about general "Harness Engineering" principles.
4. Causal claims: While the masking experiments provide some causal evidence, the overall narrative still relies substantially on correlation between TOR and student performance. The observation-prompting experiment (Section 4.3.5) strengthens causality but uses only one teacher.
5. Statistical robustness: Performance differences (e.g., 20.6% vs. 19.5% in Table 1) are sometimes small, and three-trial averaging provides limited statistical power.
6. The pipeline uses Claude Opus 4.6 for task generation, creating a potential confound: the teacher model that performs worst at teaching may nonetheless be advantaged by familiarity with task formats it helped generate.
Overall Assessment
This is a well-executed empirical study that identifies an important and counterintuitive phenomenon in agent post-training. The Terminal-Lego pipeline and EGS framework are useful contributions. The data efficiency results are striking. While the generality of findings needs further validation, the paper opens a productive research direction around interaction quality as a first-class concern in agent training data curation.
Generated Jun 3, 2026
Comparison History (18)
Paper 2 likely has higher scientific impact because it resolves a long-standing open algorithmic question: strongly-polynomial time complexity of policy iteration for a fundamental class of robust MDPs, extending classic results for MDPs. This is high-novelty, methodologically rigorous (theoretical guarantees), broadly relevant across optimization, control, RL, and game theory, and provides durable foundational value. Paper 1 is timely and practically relevant for agent training, but its impact may be more contingent on benchmark/harness design choices and could be superseded quickly as agent post-training practices evolve.
Paper 1 likely has higher impact due to its empirically grounded, immediately actionable contribution to improving agent training: a scalable task pipeline (Terminal-Lego), a counterintuitive and testable finding (“pedagogical paradox”), and a practical mechanism (environment-grounded supervision / harness engineering) with strong data-efficiency gains and clear relevance to current agent post-training. Paper 2 is novel and rigorous in formal methods and could influence standards, but its impact depends more on adoption of the proposed MCP+ and is narrower unless widely integrated into real systems.
Paper 1 addresses a fundamental and timely question in agent training—what makes training data effective—revealing a counterintuitive 'pedagogical paradox' with significant implications for the rapidly growing field of LLM-based code agents. Its contributions (Terminal-Lego pipeline, harness engineering concept, exceptional data efficiency findings) have broad impact across agent post-training research. Paper 2, while solid, offers an incremental improvement to LLM-KG integration for question answering, a more established and narrower problem space. Paper 1's insights about training data quality over teacher strength and environment-grounded supervision are more likely to reshape research practices broadly.
Paper 2 addresses a broader and more timely question in AI: what makes training data effective for code/terminal agents. Its 'pedagogical paradox' finding—that weaker agents can be better teachers—is counterintuitive and has wide implications for agent post-training paradigms. The concept of 'Harness Engineering' and Environment-Grounded Supervision could influence how the entire community approaches agent training. Paper 1, while methodologically sound and practically useful, addresses a narrower problem (LLM-guided heuristic synthesis for PDDL planning) with more incremental contributions to program synthesis methodology.
Paper 2 challenges a fundamental assumption in LLM post-training by demonstrating that trajectory quality (environment-grounded interactions) outweighs the teacher model's raw performance. By achieving comparable SOTA results with 30x less data, it introduces a highly impactful paradigm shift toward 'Harness Engineering,' affecting how future agentic models will be trained across the entire field. Paper 1 offers a valuable safety mechanism for a specific protocol, but Paper 2's insights into data efficiency and training methodologies have broader, foundational implications.
Paper 1 addresses a fundamental and broadly applicable gap in LLM deployment: whether compression preserves uncertainty calibration. It introduces a rigorous, distribution-free framework (conformal prediction) applicable across the entire LLM compression community, with clear safety implications. Its findings—that accuracy and uncertainty can decouple under compression—challenge standard evaluation practices and could reshape compression benchmarking standards. Paper 2, while insightful about agent training dynamics, addresses a narrower subfield (code agents/post-training) with findings more specific to current agent architectures and less likely to influence broad methodological standards across ML.
Paper 2 presents a concrete, empirically validated finding—the 'pedagogical paradox' where weaker agents can be better teachers—with actionable methodology (Environment-Grounded Supervision, Terminal-Lego pipeline) and strong quantitative results (matching SOTA with 30x less data). This offers immediate practical impact for agent training. Paper 1 is a position/perspective paper proposing 'Model Science' as a discipline, which, while thoughtful, lacks empirical contributions and proposes organizational frameworks rather than novel scientific findings. Paper 2's counterintuitive results and concrete methodology are more likely to influence near-term research directions.
Paper 2 tackles a critical, immediate bottleneck in AI: the exorbitant inference cost of Large Reasoning Models. By identifying process-level failures in extreme 2-bit quantization and offering lightweight, highly effective solutions (boosting an 8B model's accuracy from 17.2% to 74.2%), it provides massive real-world utility. While Paper 1 offers valuable insights into agent training and data efficiency, Paper 2's methodology directly democratizes and accelerates the deployment of state-of-the-art reasoning models on commodity hardware, guaranteeing broader, more immediate scientific and industrial impact.
While Paper 1 offers valuable insights into data curation and agent training, Paper 2 proposes a fundamental architectural innovation for foundation models. By fusing State Space Models with Attention at the score level (SISA), Paper 2 addresses a core bottleneck in hybrid language modeling. Foundational architectural improvements generally yield broader downstream impact across all AI applications, making Paper 2's contribution more universally significant.
Paper 1 likely has higher impact due to broader applicability and timeliness: it targets general agent post-training for real-world terminal tasks, proposes a scalable pipeline (Terminal-Lego), and identifies a novel “pedagogical paradox” explained by environment-grounded supervision and harness-visible interaction structure. The reported data-efficiency gains and emphasis on “harness engineering” could influence many agent-training setups beyond coding (tool use, robotics-like environments, web agents). Paper 2 is methodologically solid and valuable for formal methods, but its impact is narrower to Lean proof engineering and less cross-domain.
Paper 1 challenges a fundamental assumption in LLM distillation by revealing a 'pedagogical paradox' where lower-scoring agents can be superior teachers if they expose better reasoning behaviors. Its insights into Environment-Grounded Supervision and massive data efficiency improvements (matching SOTA with 30x less data) offer paradigm-shifting methodologies for training agentic AI. While Paper 2 provides valuable infrastructure and datasets for evaluation, Paper 1's discoveries directly address the critical bottleneck of scaling and post-training agentic models, giving it a higher potential for broad, immediate impact across the AI field.
Paper 1 addresses a fundamental question in AI agent training with broad implications across the field. The 'pedagogical paradox' finding—that stronger agents aren't necessarily better teachers—is counterintuitive and practically important. The concept of 'Harness Engineering' and Environment-Grounded Supervision introduces a new paradigm for agent post-training with demonstrated 30x data efficiency gains. Its impact spans all domains using agentic AI. Paper 2, while rigorous and valuable for chemistry AI evaluation, is more domain-specific and primarily a benchmark contribution with narrower cross-field applicability.
Paper 1 addresses a critical and highly timely issue in Large Reasoning Models (over-thinking in RLVR-trained CoTs). By reducing token usage by 56% without sacrificing accuracy, it offers massive implications for inference efficiency and computational cost reduction. While Paper 2 provides valuable insights into agent training dynamics, Paper 1's direct solution to a major bottleneck in state-of-the-art reasoning models promises broader and more immediate real-world and scientific impact across the AI community.
Paper 1 addresses a highly timely and critical problem in foundation models and agentic AI: post-training agents using synthetic interaction trajectories. Its findings on the 'pedagogical paradox' and exceptional data efficiency (30x reduction) challenge existing assumptions and offer broad implications for agent training and scaling. Paper 2, while presenting a solid neurosymbolic approach for VQA, operates in a more specialized niche (Answer-Set Programming) and relies on existing LLM capabilities, likely resulting in a narrower overall scientific impact.
Paper 2 has higher likely impact: it introduces a concrete, scalable methodology (Terminal-Lego) with measurable, reproducible gains and a testable mechanism (Environment-Grounded Supervision) explaining a counterintuitive “pedagogical paradox.” Its results are timely for post-training agents, offer immediate real-world applicability (data-efficient training, harness engineering), and can generalize across domains where interaction traces and verification exist. Paper 1 is conceptually novel and broadly relevant to AI alignment/institutions, but is more programmatic and less empirically grounded, reducing near-term methodological rigor and actionable uptake.
Paper 1 addresses a highly timely and critical bottleneck in AI: the post-training of autonomous agents. Its discovery of the 'pedagogical paradox'—that stronger agents aren't necessarily better teachers—and the introduction of Environment-Grounded Supervision offer immediate, highly practical advancements for generating synthetic data. This highly efficient approach to agentic intelligence is likely to have a broader and faster real-world impact across the AI community compared to the theoretical, bio-inspired sequence modeling improvements presented in Paper 2.
Paper 1 is more novel and broadly impactful: it challenges a common assumption (best teacher = best performer) and introduces Environment-Grounded Supervision plus “harness engineering” as a general post-training paradigm for agentic behavior. The Terminal-Lego pipeline and strong data-efficiency claims suggest wide applicability across code/terminal agents, benchmarking, and RLHF-style training, with implications for reproducible agent learning. Paper 2 is timely with clear real-world relevance (misinformation detection) and provides a dataset, but its core idea (conflict/inconsistency cues) is more domain-specific and likely to have narrower cross-field methodological impact.
Paper 2 addresses a critical and highly timely problem in AI: post-training LLM agents using synthetic trajectories. Its discovery of the 'pedagogical paradox' and introduction of 'Environment-Grounded Supervision' offer fundamental insights into model distillation and agentic training. The demonstration of matching SOTA performance with 1/30th the data volume will likely drive widespread adoption in the rapidly growing generative AI field. Paper 1 offers a valuable, practical application for urban planning, but its methodological innovation (genetic algorithms) and breadth of impact are narrower compared to the frontier AI research in Paper 2.