Orchard: An Open-Source Agentic Modeling Framework

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandrio Sordoni

May 14, 2026

arXiv:2605.15040v1 PDF

cs.AI(primary)cs.CL

#108of 2292·Artificial Intelligence

#108 of 2292 · Artificial Intelligence

Tournament Score

1540±46

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7.5

Novelty6.5

Clarity6.8

Tournament Score

1540±46

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Orchard — An Open-Source Agentic Modeling Framework

1. Core Contribution

Orchard addresses a genuine structural bottleneck in agentic AI research: the tight coupling between environment infrastructure, agent harnesses, and training pipelines. The paper's central thesis—that the environment layer should be a thin, standalone, harness-agnostic service—is architecturally clean and well-motivated. The framework comprises:

Orchard Env: A Kubernetes-native sandbox lifecycle service with agent injection, direct Pod-IP routing, and network isolation.

Three domain-specific training recipes: Orchard-SWE (software engineering), Orchard-GUI (browser navigation), and Orchard-Claw (personal assistant), each demonstrating the reusability of the shared environment layer.

Two methodological contributions stand out: credit-assignment SFT, which extracts productive segments from failed trajectories via retrospective value estimation, and Balanced Adaptive Rollout (BAR), a progressive group-assembly algorithm for sparse-reward RL that addresses the wasted-compute and group-imbalance problems inherent in fixed-N GRPO rollouts.

2. Methodological Rigor

The paper is methodologically thorough across multiple dimensions:

Infrastructure evaluation is convincing: 0.28s execution latency matching native Docker, 100% success at 1,000 concurrent sandboxes, and functional equivalence verified on Terminal-Bench 2.0 across three models. The cost analysis (Table 2) is detailed with clear methodology, though the spot-instance comparison somewhat favors Orchard since managed services cannot offer equivalent pricing structures.

SWE results are strong. The 67.5% on SWE-bench Verified with ~3B active parameters is impressive, and the ablation suite is among the most comprehensive in the SWE-agent literature. The cross-harness generalization study (Table 10) is particularly valuable—revealing that single-harness training produces severe format lock-in—and the controlled comparison against Scale-SWE and OpenSWE-32B (Table 8) on unseen harnesses provides compelling evidence for multi-harness training.

GUI results demonstrate remarkable data efficiency: 68.4% average across three benchmarks with only 2.6K training tasks on a 4B model. The finding that the 4B student surpasses its 235B teacher on Online-Mind2Web and DeepShop is noteworthy and supports the claim that environment-grounded RL can extract capabilities beyond teacher distillation.

Claw results are less mature (only 0.2K synthetic tasks), but the cross-harness transfer finding—that models trained end-to-end with both harnesses better exploit advanced harnesses at inference—is a useful insight.

However, several methodological concerns warrant mention:

The credit-assignment SFT gain (+1.9 points in the controlled ablation) is modest relative to the complexity of the retrospective value estimation pipeline.

BAR's contribution is not independently ablated against simpler baselines (e.g., difficulty filtering alone).

The RL from heavy SFT showing slight OOD regression on SWE-bench Multilingual raises questions about the robustness of the full recipe.

Evaluation on SWE-bench Verified, while standard, is a single 500-instance benchmark prone to leaderboard saturation effects.

3. Potential Impact

Infrastructure impact: Orchard Env fills a genuine gap. The open-source, self-hosted alternative to managed sandbox services (E2B, Daytona) at ~10× lower cost with spot instances could meaningfully democratize agentic training for academic labs. The agent-injection mechanism supporting heterogeneous Docker images without per-image modification is a practical engineering contribution.

Methodological impact: The cross-harness generalization findings (Tables 8, 10, 15) could shift community practices toward multi-harness data collection. The credit-assignment SFT approach—learning from failed trajectories—opens a direction for better data utilization, though the current gains are incremental.

Community impact: Releasing 107K SWE trajectories, the GUI trajectory dataset, training recipes, and the environment service collectively lower the barrier to entry for agentic modeling research. The breadth across three domains demonstrates the framework's generality.

4. Timeliness & Relevance

This paper arrives at a critical inflection point. Agentic AI training is scaling rapidly, but most high-performing systems remain proprietary or tightly coupled to specific infrastructure stacks. The explicit identification of the environment layer as the reusability bottleneck is timely and actionable. The paper addresses the emerging need for reproducible, cost-effective agentic training infrastructure that the community has been lacking.

5. Strengths & Limitations

Key Strengths:

Exceptional breadth: three domains, multiple harnesses, comprehensive ablations—rare in a single paper

Strong empirical results, particularly Orchard-GUI's data efficiency and Orchard-SWE's cross-harness generalization

Clean architectural separation enabling genuine reusability

Detailed cost analysis and system benchmarks

Commitment to full open-source release (code, data, models)

Notable Limitations:

The paper is extremely long (~50 pages) and mixes infrastructure contribution with three separate training papers; each recipe individually would benefit from deeper analysis

Credit-assignment SFT's retrospective value estimation relies on the teacher model's self-assessment, introducing potential calibration issues; the 98.9% inverted-U claim needs independent validation

BAR lacks standalone ablation—its contribution over simpler rejection sampling or difficulty filtering is unclear

The Claw domain results are preliminary (0.2K tasks, no comparison to concurrent methods beyond basic baselines)

Generalization claims are partially undermined by the SFT+RL OOD regression finding

Kubernetes dependency may limit adoption for researchers without cloud infrastructure expertise

Summary

Orchard is a significant systems-and-methods contribution that combines a well-designed open infrastructure layer with competitive training recipes across three agentic domains. Its strongest contributions are the architectural insight about environment-layer decoupling, the cross-harness generalization analysis, and the demonstration that small models with targeted SFT+RL can match or exceed much larger systems. The individual methodological innovations (credit-assignment SFT, BAR) show promise but deliver incremental gains that need further validation. The paper's primary impact will likely be as enabling infrastructure and as a comprehensive reference for open-source agentic training practices.

Rating:7.8/ 10

Significance 8Rigor 7.5Novelty 6.5Clarity 6.8

Generated May 15, 2026

Comparison History (24)

vs. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

gemini-3.15/18/2026

Paper 1 represents a highly novel, transformative application of AI to scientific discovery and epidemiology. By autonomously generating executable disease forecasting models that outperform human-curated CDC ensembles in a real-time, prospective evaluation, it solves a critical real-world labor bottleneck in public health. While Paper 2 provides an excellent open-source framework for AI agents with strong benchmark results, Paper 1 demonstrates a more profound cross-disciplinary scientific breakthrough with direct, large-scale societal implications.

vs. Imperfect World Models are Exploitable

gemini-3.15/18/2026

Paper 2 introduces an open-source framework and state-of-the-art models for LLM agents, addressing a critical bottleneck in empirical AI research. Its practical utility, scalable infrastructure, and strong benchmark results across multiple domains will likely drive widespread adoption and accelerate applied research, leading to a broader and faster scientific impact than the theoretical contributions of Paper 1.

vs. Can Large Language Models Reinvent Foundational Algorithms?

gemini-3.15/16/2026

Paper 2 presents a highly practical, open-source framework for training autonomous agents, achieving state-of-the-art results across multiple domains. Its provision of reusable infrastructure, datasets, and training recipes ensures immediate and widespread adoption by researchers and developers. While Paper 1 asks a profound theoretical question about LLM innovation, Paper 2 provides tools and methodologies that directly advance the rapidly growing field of agentic AI, resulting in higher potential for broad, real-world impact and citations.

vs. Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

gpt-5.25/16/2026

Paper 1 is more novel scientifically: it reframes self-improvement in reasoning RL as verifiable environment synthesis with an explicit, testable condition (stable solve–verify asymmetry) and a concrete instantiation (EvoEnv) that targets a core failure mode of synthetic-data/self-play loops. If robust, this could generalize across tasks and influence how future RLVR/self-training systems are built. Paper 2 is highly useful and timely infrastructure with strong applied results, but its main contribution is engineering/integration and distillation+RL recipes, likely yielding more immediate tooling impact than a new learning paradigm.

vs. AIBuildAI: An AI Agent for Automatically Building AI Models

gpt-5.25/16/2026

Paper 2 (Orchard) likely has higher scientific impact due to broader, reusable infrastructure for agent training and evaluation across multiple domains (coding, GUI use, assistants), strong methodological contributions (environment layer, distillation scale, credit-assignment SFT, RL rollout strategy), and open-source release enabling rapid community adoption. Its results advance open agentic modeling capabilities and tooling, affecting many downstream research areas. Paper 1 is impactful for end-to-end AI model building, but is narrower in scope (AutoML-style workflows) and depends heavily on LLM-agent orchestration rather than generally reusable training infrastructure.

vs. What Do EEG Foundation Models Capture from Human Brain Signals?

gemini-3.15/16/2026

Paper 2 introduces an open-source framework for training autonomous LLM agents, a rapidly growing and highly influential field. By providing scalable training recipes and achieving state-of-the-art results on popular benchmarks like SWE-bench, it is likely to see widespread adoption and citation across the AI community. While Paper 1 provides valuable interpretability for EEG foundation models, its impact is largely confined to the niche intersection of neuroscience and clinical AI.

vs. Von Neumann Networks

gpt-5.25/16/2026

Paper 1 likely has higher near- to mid-term scientific impact due to strong methodological rigor (large-scale distillation + RL), clear reproducible infrastructure contribution (open-source, harness-agnostic environment layer), and demonstrated state-of-the-art results on widely used benchmarks (SWE-bench Verified, WebVoyager). Its applications (coding agents, GUI agents, assistants) are immediate and broadly relevant, aligning with current AI agent research trends. Paper 2 is more conceptually novel, but evidence is preliminary (“basic tasks”), impact depends on validation and adoption of a new formalism/architecture.

vs. ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis

gemini-3.15/16/2026

Paper 2 offers a general-purpose, open-source infrastructure for scalable agentic modeling, addressing a critical bottleneck in broader AI research. Its impact spans multiple high-interest domains (software engineering, GUI navigation, personal assistants) with state-of-the-art results. While Paper 1 is highly valuable for the vital domain of climate science, Paper 2 provides foundational tools, data, and training recipes that will likely be adopted by a much wider segment of the AI research community, leading to greater overall breadth of impact.

vs. Context-Value-Action Architecture for Value-Driven Large Language Model Agents

gemini-3.15/16/2026

Paper 1 addresses a critical infrastructure gap in open-source LLM agent training. By providing a comprehensive framework (Orchard) and achieving state-of-the-art results on highly competitive benchmarks like SWE-bench and WebVoyager, it offers immense practical utility. Its broad applicability across coding, GUI, and personal assistant domains guarantees widespread adoption and foundational impact in the AI community, whereas Paper 2 has a narrower, albeit important, focus on behavioral fidelity.

vs. RewardHarness: Self-Evolving Agentic Post-Training

claude-opus-4.65/16/2026

Orchard provides a comprehensive open-source framework for agentic modeling with state-of-the-art results across multiple domains (coding, GUI, personal assistants), addressing fundamental infrastructure gaps in the field. Its breadth of impact is significantly larger—it enables the entire open-source community to train agents at scale across diverse tasks. While RewardHarness presents an innovative data-efficient reward modeling approach for image editing, it addresses a narrower problem. Orchard's open-source nature, multiple training recipes, and strong empirical results across benchmarks position it for broader adoption and downstream research impact.

vs. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

gpt-5.25/16/2026

Paper 2 introduces a broadly applicable algorithmic change to RL-with-verifiers—shifting from competitive to cooperative, coverage-driven credit assignment—to address exploration collapse and improve both accuracy and diversity. This conceptual contribution is novel, method-centric, and likely transferable across many LLM reasoning and RL settings, giving it wider cross-field impact and timeliness. Paper 1 is impactful as open infrastructure plus strong empirical results, but its main advances are engineering/system integration and recipe tuning around specific agent domains; algorithmic novelty and generality appear lower than GCPO’s.

vs. Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

gpt-5.25/16/2026

Paper 1 likely has higher impact due to a more novel, enabling contribution: an open-source, harness-agnostic environment layer plus scalable training recipes demonstrated across three major agent domains (SWE, GUI, assistant) with strong SOTA open-model results. This combination can directly accelerate reproducible agent training and deployment, with broad applicability across tasks and communities, and is highly timely given demand for open agentic infrastructure. Paper 2 is rigorous and valuable, but primarily advances evaluation/benchmarking within a narrower slice (hallucination detection in RAG), which may yield more incremental downstream change.

vs. Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

gpt-5.25/15/2026

Paper 1 likely has higher scientific impact due to broader novelty and field-wide relevance: it contributes an open, reusable environment layer plus scalable training recipes spanning coding, GUI, and assistant agents, and demonstrates strong, state-of-the-art open-model results with substantial datasets and RL/SFT methodology. Its infrastructure and training approach can generalize across many agent benchmarks and research labs. Paper 2 is practically valuable for production periodic tasks (deterministic replay, big cost cuts) but is narrower in scope and more engineering-oriented, with impact concentrated in workflow automation rather than advancing general agentic learning.

vs. How Sensitive Are Radiomic AI Models to Acquisition Parameters?

gemini-3.15/15/2026

Paper 2 demonstrates higher potential scientific impact due to its broad applicability and extreme timeliness in the rapidly growing field of LLM agents. While Paper 1 provides a rigorous framework for clinical radiomics, its impact is largely confined to medical imaging. In contrast, Paper 2 introduces an open-source framework (Orchard) that democratizes agentic modeling, a domain heavily dominated by proprietary systems. By providing reusable environments and setting new state-of-the-art benchmarks across coding, GUI, and personal assistant tasks, Paper 2 is poised to become a foundational tool with widespread adoption across multiple AI sub-disciplines.

vs. Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

claude-opus-4.65/15/2026

Orchard presents a comprehensive open-source framework addressing a critical infrastructure gap in agentic AI training, achieving state-of-the-art results across multiple domains (coding, GUI, personal assistants). Its practical contributions—reusable environment primitives, scalable training recipes, and open-source models—have broad impact potential for the rapidly growing agent research community. Paper 2, while a well-designed benchmark for multi-agent strategic reasoning, has narrower scope as one of many LLM evaluation benchmarks, with findings that primarily confirm known LLM limitations rather than enabling new capabilities.

vs. TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

claude-opus-4.65/15/2026

Orchard presents a comprehensive open-source framework with concrete state-of-the-art results across multiple benchmarks (SWE-bench, WebVoyager, etc.), novel training techniques (credit-assignment SFT, balanced adaptive rollout RL), and addresses a critical gap in open-source agentic AI infrastructure. Its immediate practical utility, reproducibility, strong empirical results, and breadth across coding, GUI, and personal assistant domains give it substantially higher near-term scientific impact compared to TeachAnything, which presents a platform concept for embodied AI data collection without comparable empirical validation.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

gemini-3.15/15/2026

Paper 2 presents a comprehensive, open-source framework for scalable agent training across multiple domains (coding, GUI, personal assistant). By addressing the critical infrastructure gap in agentic modeling and achieving state-of-the-art results on major benchmarks (e.g., SWE-bench Verified), it has tremendous potential for broad adoption. While Paper 1 introduces a valuable benchmark for proactive agents, Paper 2's provision of training recipes, environments, and significant performance improvements across diverse tasks will likely drive much wider scientific and practical impact in the AI community.

vs. From Table to Cell: Attention for Better Reasoning with TABALIGN

gpt-5.25/15/2026

Paper 1 likely has higher scientific impact due to its broad, reusable infrastructure contribution (open-source environment layer + scalable training recipes) and demonstrated state-of-the-art results across multiple major agent domains (coding, GUI, personal assistants). Its real-world applicability is immediate (agent training, evaluation, deployment) and can influence many downstream projects and fields that rely on agentic systems. Paper 2 is more novel methodologically for table reasoning with DLM-based planning and attention contracts, but its scope is narrower (structured table tasks) and depends on diffusion LM tooling that is less standard in current LLM pipelines.

vs. Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

gemini-3.15/15/2026

Paper 2 introduces an open-source framework for scalable agent training across diverse domains, addressing a critical infrastructure bottleneck in AI research. While Paper 1 presents an innovative approach with high clinical utility, Paper 2's broad applicability to coding, GUI, and personal assistants ensures widespread adoption and foundational impact across the broader AI community.

vs. A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

gemini-3.15/15/2026

Paper 2 presents an open-source framework for scalable agentic modeling, directly addressing a critical infrastructure gap in AI research. By providing reusable primitives and achieving state-of-the-art results across diverse domains (software engineering, GUI navigation, personal assistants), it offers massive utility to the broader AI community. While Paper 1 tackles an important issue (long-term memory), its RAG-based governance approach is narrower in scope. Paper 2's comprehensive methodology, open-source nature, and strong empirical results on major benchmarks like SWE-bench ensure significantly broader adoption, real-world applicability, and overall scientific impact.