From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

Shangding Gu

May 25, 2026

arXiv:2605.26112v1 PDF

cs.AI(primary)cs.LG

#994of 2682·Artificial Intelligence

#994 of 2682 · Artificial Intelligence

Tournament Score

1438±42

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance5

Rigor2

Novelty3.5

Clarity6.5

Tournament Score

1438±42

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper argues that the next bottleneck in agentic AI is system scaling rather than model scaling alone. The central concept is "scaling the harness" — treating the structured execution layer surrounding a foundation model (memory, context construction, skill routing, orchestration, verification/governance) as a first-class object of design and evaluation. The paper introduces a six-component decomposition of agentic systems: PH = Φ(R, M, C, S, O, G), identifies three core bottlenecks (context governance, trustworthy memory, dynamic skill routing), proposes a research agenda for harness-level benchmarks, and releases CheetahClaws as a reference implementation.

The fundamental thesis — that agent performance is a function of the whole system, not just the model — is reasonable and increasingly recognized in practice. However, this insight is not particularly new. The SWE-agent paper (cited here) already demonstrated that agent-computer interface design matters more than model choice for certain benchmarks. MemGPT, Voyager, and numerous agent frameworks have all implicitly or explicitly acknowledged system-level design as important.

Methodological Rigor

This is the paper's most significant weakness. Despite framing itself around a concrete reference harness (CheetahClaws) and promising comparison with Claude Code and OpenClaw, the paper contains no empirical evaluation whatsoever. There are no experiments, no quantitative comparisons, no ablation studies, and no benchmark results. The "comparison" between CheetahClaws, Claude Code, and OpenClaw is limited to a single qualitative table (Table 1) listing high-level design patterns. The decomposition PH = Φ(R, M, C, S, O, G) is explicitly acknowledged as conceptual rather than quantitative — Φ has "no closed form" and the factors are "not strictly orthogonal."

The paper proposes evaluation dimensions (Table 4) but does not implement or validate any of them. The sub-factorizations of M and C (Equations 2-3) name desirable properties but provide no operationalization, metrics, or measurement methodology. The "framework" is essentially a taxonomy with suggestive notation rather than a testable theoretical contribution.

Potential Impact

The paper identifies real and important problems. The shift toward viewing agent systems holistically is genuinely needed, and several specific insights are valuable:

Context governance as distinct from context capacity is a useful framing

Stale-but-confident as a failure mode for memory systems names a real and underappreciated problem

Confident-but-unchecked for skill routing similarly identifies an important reliability gap

The proposed benchmark dimensions (Table 4) — memory hygiene, communication fidelity, long-session drift, verification-aware recovery — are genuinely underexplored

However, the impact is limited by the paper's position paper nature without empirical grounding. The taxonomy, while sensible, is unlikely to be adopted without demonstrated utility. The released CheetahClaws harness could have impact if it enables reproducible research, but the paper provides insufficient detail about its architecture or capabilities to assess this.

Timeliness & Relevance

The paper is well-timed. Agentic AI systems are rapidly proliferating in production (Claude Code, Cursor, Devin, Codex CLI), and the gap between model benchmarks and real-world agent reliability is widely felt. The observation that "what is often reported as a model score is in fact a model-plus-harness score" resonates with practitioners. The call for longitudinal evaluation and agent evolution standards addresses a genuine gap as agents become more persistent.

However, similar arguments are being made contemporaneously by multiple groups. The "Code as Agent Harness" paper (ref [26], also 2026) addresses overlapping territory. Anthropic's own engineering blog posts on context engineering and multi-agent systems (cited extensively here) already articulate many of these ideas in more concrete terms.

Strengths

1. Clear problem identification: The three bottlenecks (context governance, trustworthy memory, dynamic skill routing) are well-chosen and well-articulated, with specific failure modes named for each.

2. Honest limitations section: The paper engages seriously with counterarguments (stronger models may solve system problems; end-to-end training may replace modularity).

3. Useful conceptual vocabulary: Terms like "stale-but-confident," "exposure without access," and "confident-but-unchecked" name real failure patterns in compact, communicable ways.

4. Comprehensive evaluation agenda: Table 4 provides a useful checklist for future benchmark designers.

5. Appropriate scope: The paper correctly identifies that system scaling and model scaling are complementary rather than competing.

Limitations

1. No empirical validation: This is the most critical gap. The paper makes claims about system design importance but provides no evidence beyond citing others' findings and qualitative system comparisons.

2. Weak novelty: The individual ideas (modular agent architectures, memory management, tool-use verification, process metrics) are well-established. The synthesis is useful but incremental.

3. Underdeveloped reference harness: CheetahClaws is mentioned but barely described. The comparison with Claude Code and OpenClaw is superficial (one table).

4. Self-citation concentration: A notable fraction of references are the first author's own work or closely affiliated projects, which raises questions about the breadth of engagement with the literature.

5. Notation without substance: The mathematical formalism (Equations 1-3) adds a veneer of precision to what are essentially qualitative arguments. The paper acknowledges this but still presents the equations prominently.

6. Missing related work: No engagement with classical software engineering concepts (separation of concerns, design patterns), operating systems research (scheduling, resource management), or the extensive literature on software architecture that has studied similar decomposition problems for decades.

7. "Under active development": The manuscript's self-described preliminary status limits confidence in its claims and completeness.

Overall Assessment

This is a position paper that identifies an important direction — treating agent harnesses as first-class objects of study — but falls short of making a compelling scientific contribution. The conceptual framework is sensible but not novel enough to stand without empirical support. The promised concrete contributions (CheetahClaws, comparative analysis) are underdeveloped. The paper would benefit substantially from: (1) ablation experiments showing how harness components independently affect performance on existing benchmarks, (2) concrete implementation of at least one proposed metric from Table 4, and (3) a detailed architectural comparison rather than a surface-level table. As a workshop paper or vision statement it would be appropriate; as a full research contribution it is premature.

Rating:3.5/ 10

Significance 5Rigor 2Novelty 3.5Clarity 6.5

Generated May 26, 2026

Comparison History (21)

vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

gemini-3.15/28/2026

Paper 1 addresses a critical and highly practical bottleneck in current AI research: system-level scaling and architecture for AI agents. By proposing a paradigm shift towards 'scaling the harness' and introducing new benchmarks and a reference framework, it offers broad, immediate real-world applications across multiple domains. While Paper 2 provides valuable mechanistic insights into LLM depth utilization, Paper 1's focus on the entire agentic ecosystem promises a more profound and widespread impact on how future AI systems are designed, evaluated, and deployed.

vs. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

gemini-3.15/27/2026

Paper 2 proposes a broad paradigm shift towards 'system scaling' in agentic AI, defining a comprehensive research agenda around the architecture surrounding foundation models. Its high-level perspective, focus on critical systemic bottlenecks, and introduction of new terminology are likely to shape future research directions and gather broader citations than the specific skill-management framework presented in Paper 1.

vs. Can LLMs Introspect? A Reality Check

gpt-5.25/27/2026

Paper 2 has higher likely scientific impact due to its methodological rigor and timeliness: it directly challenges prominent claims about LLM introspection with stronger controls, alternative explanations, and negative/clarifying results that can reshape evaluation standards. Its conclusions affect interpretability, safety, benchmarking, and cognitive-science-inspired ML, giving broad cross-field relevance. Paper 1 is a valuable systems agenda and tooling contribution, but it is more conceptual/engineering-oriented and its impact depends on adoption of a specific harness and benchmarks, whereas Paper 2 provides sharper falsification and widely applicable evaluation guidance.

vs. Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

gemini-3.15/27/2026

Paper 1 provides concrete, novel empirical insights into the internal mechanics of Large Reasoning Models, specifically how Chain-of-Thought interacts with activation steering and refusal. Its rigorous mechanistic approach directly addresses critical AI safety and alignment challenges in cutting-edge models. Paper 2, while relevant, is primarily a position paper on system architecture, offering conceptual frameworks rather than novel foundational discoveries. Thus, Paper 1's specific methodological breakthroughs offer higher potential for scientific impact.

vs. What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

claude-opus-4.65/27/2026

Paper 2 offers a novel mechanistic insight into chain-of-thought prompting—a widely used technique—showing that local token co-occurrence rather than logical derivation drives much of the gain. This is a surprising, empirically rigorous finding that challenges fundamental assumptions about why CoT works, with broad implications for prompt engineering, model interpretability, and reasoning research. Paper 1, while addressing an important systems-level perspective on agentic AI, is more of a position/framework paper with less empirical novelty and relies heavily on architectural proposals rather than falsifiable scientific discoveries.

vs. Advancing Graph Few-Shot Learning via In-Context Learning

claude-opus-4.65/26/2026

Paper 2 addresses a more fundamental and timely challenge in AI—the system-level architecture around foundation models for agentic AI. Its scope is broader, proposing a new research paradigm ('scaling the harness') with a concrete framework, open-source reference implementation, and a research agenda spanning benchmarks, governance, and verification. While Paper 1 makes solid contributions to graph few-shot learning, it operates in a narrower subfield. Paper 2's framing of system scaling as the next bottleneck has potential to reshape how the community thinks about agentic AI development, yielding broader cross-field impact.

vs. What Gets Cited: Competitive GEO in AI Answer Engines

gpt-5.25/26/2026

Paper 2 has higher potential impact: it reframes a timely, field-wide bottleneck (agentic AI performance and safety emerging from system architecture) and proposes a broad research agenda plus a reference harness, likely influencing benchmarks, tooling, and evaluation across many labs and applications. Its scope spans memory, governance, orchestration, verification, and routing—relevant to reliability and deployment. Paper 1 is methodologically rigorous and practically useful for GEO/LLM citation behavior, but its contribution is narrower (RAG citation optimization) with less cross-field breadth.

vs. ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact: it introduces a concrete, technically novel architecture (concept-guided multimodal MoE with residual concept bottlenecks), demonstrates methodological rigor with multi-cohort evaluation and expert (neuropathologist) validation, and targets an urgent real-world domain (clinical pathology) where interpretability and multimodal integration are high-impact and deployable. Paper 1 raises an important systems agenda for agentic AI and provides a reference harness, but it is more conceptual/architectural with less evidence of measurable performance gains or validated benchmarks, making near-term impact less certain.

vs. The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching

gemini-3.15/26/2026

Paper 1 proposes a fundamental paradigm shift for the broader field of agentic AI, emphasizing 'scaling the harness' over mere model scaling. By identifying core bottlenecks, defining a comprehensive research agenda for system-level benchmarking, and providing an open-source reference framework, it offers high theoretical novelty and broad applicability across AI domains. In contrast, Paper 2 presents a valuable but narrower applied architecture specific to mobile psychological coaching, making its overall scientific and methodological impact more localized.

vs. TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

gpt-5.25/26/2026

Paper 2 targets a broad, timely bottleneck in agentic AI—system-level “harness” design—likely affecting many domains using foundation-model agents (software engineering, HCI, security, governance, evaluation). It proposes a unifying framing, research agenda, new evaluation dimensions, and a reference implementation, which can catalyze community benchmarks and standards. Paper 1 is methodologically concrete and impactful within computational enzymology, but its scope is narrower. Overall, Paper 2 has higher potential cross-field adoption and near-term relevance as agent systems proliferate.

vs. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

gemini-3.15/26/2026

Paper 2 presents a visionary paradigm shift from model scaling to system scaling, addressing a critical bottleneck in deploying agentic AI. By defining the 'agent harness' and outlining a comprehensive research agenda, it has a broader potential to shape future research directions and architectures across the field compared to the specific methodological improvements in multi-agent reasoning offered by Paper 1.

vs. QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

gemini-3.15/26/2026

Paper 2 proposes a fundamental paradigm shift from model scaling to system scaling in Agentic AI, a highly timely and rapidly expanding frontier. While Paper 1 offers a valuable evaluation framework, Paper 2's conceptualization of the 'agent harness' and its proposed research agenda for architectural design, governance, and evaluation have the potential to broadly influence how future autonomous systems are built and assessed across the entire AI ecosystem.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

gemini-3.15/26/2026

Paper 1 addresses a highly timely bottleneck in modern AI: scaling agentic systems rather than just foundation models. Its focus on the 'agent harness' offers a conceptual paradigm shift with extremely broad real-world applications across any field deploying LLM agents. While Paper 2 presents rigorous, state-of-the-art algorithmic improvements for classical planning, its scope is much more niche. Paper 1's proposed research agenda and systemic framing have a significantly higher ceiling for widespread, cross-disciplinary scientific impact in the booming field of agentic AI.

vs. Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

gemini-3.15/26/2026

Paper 1 proposes a fundamental paradigm shift from model scaling to system scaling, defining a broad research agenda and new benchmarking criteria for agentic AI. While Paper 2 presents a highly rigorous, concrete multi-agent framework, Paper 1 offers a foundational conceptual framework that addresses structural bottlenecks in AI design. Its broader scope, focus on infrastructure evaluation, and establishment of future research directions give it higher potential for widespread scientific impact and foundational citations across the AI systems community.

vs. SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

gpt-5.25/26/2026

Paper 1 targets a broad, timely bottleneck in agentic AI: system-level “harness” scaling (memory, context governance, routing, verification, benchmarks). This reframing is novel and likely to influence how agents are built and evaluated across many domains, with wide applicability and cross-field impact (AI systems, safety, HCI, software engineering). The provided reference harness and benchmark agenda can catalyze follow-on work. Paper 2 is methodologically solid and useful, but its scope is narrower (SVA generation/verification) and its impact is more domain-specific.

vs. Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

gpt-5.25/26/2026

Paper 1 targets a broad, timely bottleneck in agentic AI—system/harness scaling—shifting evaluation and optimization beyond model capability to architecture, governance, memory, routing, and verification. This framing could influence benchmarks, tooling, and deployment practices across many agentic applications, with potentially wide cross-field impact (AI systems, safety, software engineering, HCI). While more agenda/position-oriented, it includes a reference harness and comparative analysis that can catalyze follow-on work. Paper 2 is methodologically solid but narrower (subjective NLP uncertainty on GoEmotions), likely yielding more incremental, domain-specific impact.

vs. Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

claude-opus-4.65/26/2026

Paper 2 introduces a concrete, novel inference-time protocol (PVD) grounded in interactive proof theory with clear empirical results showing ~30pp precision gaps. It addresses the critical problem of knowing when LLMs are reliable, offers a well-scoped contribution with rigorous methodology, and provides actionable comparisons against established baselines. Paper 1, while addressing an important systems-level perspective on agentic AI, is more of a position/framework paper with a broad research agenda but less focused empirical contribution. Paper 2's specificity, methodological rigor, and immediately applicable selective prediction mechanism give it higher near-term scientific impact.

vs. Energy Shields for Fairness

claude-opus-4.65/26/2026

Paper 2 introduces a novel, well-defined theoretical contribution—energy shields for runtime fairness—with formal safety and liveness guarantees, a synthesis procedure, and experimental evaluation. It bridges physics-inspired control theory with algorithmic fairness, offering clear methodological rigor and broad applicability. Paper 1, while addressing an important systems-level concern in agentic AI, is more of a position/framework paper with a reference implementation rather than a rigorous scientific contribution. Its claims are harder to validate and its novelty is more incremental, largely organizing known engineering concerns rather than introducing fundamentally new techniques.

vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

gemini-3.15/26/2026

Paper 1 presents a highly practical, rigorously tested simulation platform that directly solves a major bottleneck in GUI agent research: the high cost and lack of verifiability in training agents via RL. Its introduction of deterministic state-based judging, parallel rollouts, and demonstrated Sim-to-Real transfer offers immediate, high-impact utility to the AI community. While Paper 2 offers a valuable conceptual framework for agent systems, Paper 1 provides a concrete, empirically validated tool and benchmark that will likely catalyze a wide range of new experiments and advancements in agentic AI.

vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

gemini-3.15/26/2026

Paper 1 addresses a fundamental limitation in current AI—causal reasoning and discovery—by introducing a rigorous, scalable evaluation environment. Its focus on separating predictive success from true causal understanding provides a critical tool for developing future 'AI Scientists.' While Paper 2 offers a valuable systems-level perspective on agent architectures, Paper 1 tackles a deeper algorithmic and cognitive bottleneck with a concrete methodological framework, giving it broader implications for foundational AI research.