Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Tianshi Xu, Huifeng Wen, Meng Li

#470 of 2292 · Artificial Intelligence
Share
Tournament Score
1475±47
10501800
69%
Win Rate
11
Wins
5
Losses
16
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed during held-out evaluation. On seven deterministic environments from ττ-bench, τ2τ^2-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at GitHub.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents"

1. Core Contribution

This paper introduces LIFE-HARNESS, a lifecycle-aware runtime harness that improves frozen LLM agents by adapting the model-environment interface rather than model parameters. The key insight is that many agent failures in deterministic, rule-governed domains arise not from insufficient reasoning capability but from mismatches at the interface level — malformed actions, misunderstood tool contracts, trajectory degeneration, and missing procedural knowledge. The system organizes interventions into four lifecycle layers: (1) Environment Contract Layer (pre-interaction calibration), (2) Procedural Skill Layer (task-conditioning retrieval), (3) Action Realization Layer (pre-execution validation/canonicalization), and (4) Trajectory Regulation Layer (post-execution monitoring and recovery). The harness is evolved offline from training trajectories using a coding agent (Codex) and remains fixed during evaluation.

2. Methodological Rigor

The experimental setup is comprehensive: 7 deterministic environments across 3 benchmark suites (τ-bench, τ²-bench, AgentBench), 18 model backbones spanning instruction-tuned, reasoning, and agent-specialized models. The 88.5% average relative improvement across 116/126 settings is striking. Several design choices strengthen rigor:

  • Clean train/test separation: Harnesses are evolved from training trajectories and frozen during held-out evaluation.
  • Cross-model transfer: Harnesses evolved from Qwen3-4B-Instruct trajectories transfer to 17 other models, supporting the claim that interventions capture environment-side structure.
  • Ablation study: Leave-one-layer-out ablation demonstrates that different layers are critical for different environments, validating the multi-layer design.
  • Comparison with prompt evolution: LIFE-HARNESS substantially outperforms prompt-only optimization (+120% relative improvement), demonstrating that execution-level interventions matter beyond prompt text.
  • However, there are methodological concerns. The harness evolution uses Codex (a proprietary coding agent) to read trajectories and propose updates — this introduces an opaque optimization step that is difficult to reproduce exactly or control for. The "iterative evolution" process (Figure 6) converges in ~5 iterations, but the paper provides limited analysis of what fraction of the improvement comes from straightforward fixes (e.g., regex-based action parsing) versus genuinely sophisticated interventions. The failure taxonomy (Figure 3) is manually annotated with Codex assistance, introducing potential annotation bias toward categories the harness is designed to address.

    The temperature=0.0 evaluation eliminates stochastic variation but also means the results reflect a narrow slice of model behavior. The Pass^3 metric for τ-bench (requiring 3/3 successes) is informative but the three runs with deterministic temperature may not be truly independent — variation comes only from the simulated user LLM (DeepSeek-V4-Flash).

    3. Potential Impact

    The paper's central framing — that adapting the runtime interface is complementary to model training — is practically significant. If the results hold broadly, this suggests:

  • Cost-effective improvement: Organizations can improve agent performance without expensive fine-tuning or access to model weights, which is particularly relevant for proprietary/API-based models.
  • Composability with training: The demonstration that LIFE-HARNESS helps both base models and their fine-tuned derivatives (Qwen2.5-32B vs. xLAM-2-32B) suggests the two approaches are genuinely orthogonal.
  • Reusability: A single harness working across 18 models is practically valuable — it amortizes the engineering cost of environment-specific adaptation.
  • The approach is most applicable to structured, deterministic domains (databases, business workflows, web shopping, OS interaction). Extension to open-ended or stochastic environments is acknowledged as a limitation and would significantly broaden impact.

    4. Timeliness & Relevance

    This work arrives at an important moment. LLM agents are being deployed in production for tool-use, coding, and workflow automation, yet the gap between static benchmarks and interactive performance remains large. The paper correctly identifies that the "harness" — the glue code mediating model-environment interaction — is often the bottleneck in practice. The concurrent emergence of several harness optimization papers (Meta-Harness, AHE, HARBOR, Workspace Optimization) in 2026 validates the timeliness of this direction. LIFE-HARNESS distinguishes itself through its lifecycle-organized structure and focus on deterministic environments, though it is part of a broader trend rather than a singular breakthrough.

    5. Strengths & Limitations

    Strengths:

  • Principled decomposition of failures into lifecycle stages with corresponding intervention layers
  • Impressive breadth of evaluation: 7 environments × 18 models = 126 settings, with 92% showing improvement
  • Strong cross-model transfer result validates environment-specificity over model-specificity
  • The complementarity result (harness + training > training alone) is practically important
  • Detailed appendix with complete harness inventories (Tables 3-5) enables reproducibility of the final artifacts
  • Limitations:

  • The harness evolution process relies on Codex, making the "how" of harness construction somewhat opaque and expensive to reproduce
  • Many interventions in Tables 3-5 appear to be detailed, hand-crafted rules specific to each environment — raising questions about how much of the improvement comes from the framework versus careful domain engineering
  • The 88.5% relative improvement is inflated by very low baselines (e.g., ALFWorld improving from 0.055 to 0.826 for Llama-3.1-8B), where trivial action-parsing fixes might account for most gains
  • No analysis of which specific interventions contribute most to improvement (beyond layer-level ablation)
  • Limited to deterministic environments — the approach's value in stochastic or partially observable settings is unclear
  • The approach fundamentally requires training trajectories from the target environment, limiting zero-shot applicability
  • 6. Additional Observations

    The paper's framing elegantly separates "what the model knows" from "how the model acts," but many of the harness components (Tables 3-5) encode substantial domain knowledge — essentially building expert systems around the LLM. The line between "adapting the interface" and "building a task-specific wrapper with hard-coded rules" deserves more critical examination. The scalability concern is whether this level of per-environment engineering is sustainable as the number of target environments grows.

    Rating:6.8/ 10
    Significance 7Rigor 6.5Novelty 7Clarity 8

    Generated May 22, 2026

    Comparison History (16)

    vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
    gpt-5.25/22/2026

    Paper 1 offers a clearer, more novel scientific contribution: systematic runtime harness adaptation (without weight updates or environment changes) with strong, broad empirical validation across 7 benchmarks, 18 model backbones, and extensive transfer, suggesting generalizable environment-side structure. This methodological rigor and breadth make it likely to influence agent evaluation/training paradigms and interface design across many deterministic tool-use settings. Paper 2 is compelling and timely for production engineering, but evidence is narrower (single benchmark, few tasks) and the contribution leans more toward systems integration/DevOps than broadly validated scientific insight.

    vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
    gemini-3.15/22/2026

    Paper 1 proposes a paradigm-shifting approach by enabling autonomous agents to rewrite their own source code, achieving self-evolution at a Turing-complete level rather than just modifying text artifacts or prompts. This fundamentally expands the horizon of self-improving systems and addresses structural failures unreachable by previous methods. While Paper 2 offers a rigorous and highly transferable interface adaptation method, Paper 1's bold conceptual leap toward true self-modifying code presents a deeper theoretical and practical impact for the future of AGI.

    vs. Unlocking Proactivity in Task-Oriented Dialogue
    claude-opus-4.65/22/2026

    Paper 1 introduces a novel paradigm—adapting the runtime harness rather than model weights—demonstrating broad impact across 18 model backbones and 7 environments with strong transferability. Its 88.5% average relative improvement and cross-model generalization suggest a fundamental, reusable insight applicable across many LLM agent domains. Paper 2 addresses proactive dialogue with clever techniques but targets a narrower problem (proactive TOD/sales). Paper 1's breadth of applicability, methodological novelty in reframing agent improvement as an interface problem, and practical utility for frozen models give it higher potential impact.

    vs. Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
    claude-opus-4.65/22/2026

    Paper 1 introduces a novel paradigm—runtime harness adaptation for frozen LLM agents—that addresses a fundamental and timely problem in AI agent design. It demonstrates strong empirical results (88.5% average improvement across 126 settings and 18 models), shows cross-model transferability, and has broad applicability across deterministic agent environments. Paper 2 presents an interesting but niche contribution in food ingredient embeddings with limited breadth of impact. Paper 1's methodological contribution is more likely to influence the rapidly growing field of LLM agents and has significantly wider potential for real-world applications.

    vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
    gpt-5.25/22/2026

    Paper 2 has higher estimated impact due to a more broadly applicable training principle: generating step-level credit for search decisions via on-policy hindsight self-distillation without external teachers or annotations. This addresses a central bottleneck in search-augmented RL (credit assignment), is methodologically clean and easy to integrate into standard RL loops, and can extend to many tool-use/retrieval settings beyond specific benchmarks. Paper 1 is novel and practically useful for deterministic agents, but its scope is more tied to harness engineering and deterministic environments, potentially limiting cross-field generalization.

    vs. Von Neumann Networks
    gpt-5.25/22/2026

    Paper 1 proposes a new neural network paradigm grounded in von Neumann’s cellular diffusion ideas, links it to neural operators/Green’s functions, and claims universality via “Cellular Machines,” suggesting broad, cross-field impact (ML theory, dynamical systems/PDEs, neuromorphic/cellular computation, architecture). If rigor holds, it is a foundational contribution with long-term application potential. Paper 2 is timely and practically useful for LLM agents, but is more of a systems/engineering layer (runtime harness) with impact concentrated in agent evaluation settings and likely faster turnover as models/tooling evolve.

    vs. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
    gpt-5.25/22/2026

    Paper 2 is likely to have higher impact due to its broadly applicable, model-agnostic idea: improving agent performance by adapting the runtime harness rather than retraining models. This is timely for real-world deployments where models are frozen, expensive to fine-tune, or closed-source, and it transfers across 18 backbones with large gains on multiple established benchmarks, suggesting strong practicality and breadth. Paper 1 is novel and relevant to safety/oversight, but it requires training models to emit special cues and its demonstrated benefits are more domain- and setup-dependent, potentially narrowing adoption.

    vs. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
    gemini-3.15/22/2026

    Paper 2 tackles critical challenges in AI safety, scalable oversight, and reasoning efficiency. By introducing Behavior Cues, it allows real-time monitoring and intervention during the reasoning process, which is highly relevant for deploying safe and aligned advanced LLMs. This fundamental contribution to AI safety and controllability gives it broader scientific and societal impact compared to Paper 1's more engineering-focused approach of adapting the agent-environment interface.

    vs. FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
    gemini-3.15/22/2026

    While Paper 1 demonstrates massive industrial impact and solves a specific cold-start problem in recommender systems, Paper 2 offers higher scientific impact by proposing a novel paradigm for LLM agents. Shifting the focus from model parameter updates to runtime interface adaptation addresses a critical bottleneck in agentic AI. Its demonstrated transferability across 18 models and multiple environments ensures broad applicability, potentially influencing the fundamental methodology of how researchers build, train, and deploy autonomous LLM agents across diverse fields.

    vs. Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems
    gpt-5.25/22/2026

    Paper 1 is likely to have higher scientific impact due to its novel reframing of LLM agent improvement from model-centric training to runtime/interface (harness) adaptation, a timely problem with broad relevance across AI agents, tool use, and evaluation. The reported large gains across many environments and 18 model backbones, plus cross-model transfer from a single model’s trajectories, suggests strong generality and practical applicability without retraining or environment changes. Paper 2 is solid and application-relevant for control, but meta-learning for adaptive control is a more established direction with narrower cross-field reach.

    vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
    gemini-3.15/22/2026

    Paper 2 proposes a highly novel, model-agnostic approach to improving LLM agents by adapting the runtime harness rather than model weights. Its extensive empirical validation (126 settings, 18 models) demonstrating high transferability and an 88.5% average improvement indicates profound, immediate utility for the rapidly growing field of autonomous agents. While Paper 1 offers a valuable benchmark for emotional intelligence, Paper 2's methodological innovation solves a broader, critical bottleneck in agent-environment interaction across multiple domains.

    vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks
    gpt-5.25/22/2026

    Paper 2 likely has higher impact due to a more novel and broadly applicable shift: adapting the runtime harness (interface) rather than model weights, improving frozen agents across many backbones and deterministic environments. Its methodology emphasizes transfer (trained on one model, generalizes to 17 others) and large, systematic coverage (126 settings), suggesting strong robustness and reproducibility. The approach is timely for agent reliability and governance in rule-based domains, with clear real-world applicability (tool-use, workflow automation) without expensive retraining. Paper 1 is valuable but aligns more with existing modular/adapter and compression trends.

    vs. Open-World Evaluations for Measuring Frontier AI Capabilities
    gemini-3.15/22/2026

    Paper 1 introduces a highly practical and novel methodological framework for adapting LLM agents without retraining. Its rigorous empirical validation across 18 models and demonstration of high transferability and significant performance gains (88.5% average improvement) offer immediate, broad utility for agent development. Paper 2, while conceptually valuable for AI evaluation and policy, acts more as a position paper with a single qualitative case study, lacking the algorithmic rigor and broad, direct applicability of Paper 1.

    vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact: it addresses a broadly encountered limitation of inference-time guidance (composing multiple rewards) in diffusion/flow models, provides a principled diagnosis (gradient misalignment driving off-manifold drift), and proposes a general, lightweight, learnable fix validated across multiple domains (images and decision-making/control). This combination of theoretical insight + wide applicability can influence controlled generation, alignment, and planning. Paper 1 is novel and practical for LLM agents, but is currently demonstrated mainly on deterministic agent benchmarks and depends on environment-specific harness engineering, making impact potentially narrower.

    vs. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents
    gemini-3.15/22/2026

    Paper 1 introduces a novel paradigm shift in LLM agent design by adapting the runtime interface rather than model weights. This approach is highly computationally efficient, broadly applicable across various agent frameworks, and demonstrates impressive cross-model transferability. In contrast, while Paper 2 provides valuable contributions to persuasive dialogue and Theory of Mind, its impact is narrower and more domain-specific. Paper 1's foundational methodological innovation offers wider implications for the rapidly expanding field of autonomous agents.

    vs. AMEL: Accumulated Message Effects on LLM Judgments
    claude-opus-4.65/22/2026

    Paper 1 introduces a novel paradigm—adapting the runtime harness rather than model weights—demonstrating large improvements (88.5% average relative) across 18 models and 7 environments with strong transferability. This reframes agent improvement methodology and has broad practical applicability. Paper 2 identifies an important but relatively narrower bias (AMEL) in LLM-as-judge settings with a modest effect size (d=-0.17) and a straightforward mitigation (fresh context per item). While rigorous and useful, Paper 1's conceptual contribution and demonstrated breadth of impact position it for higher scientific influence.