Interactive Evaluation Requires a Design Science

Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua

May 18, 2026

arXiv:2605.17829v1 PDF

cs.AI(primary)

#611of 2292·Artificial Intelligence

#611 of 2292 · Artificial Intelligence

Tournament Score

1460±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty5.5

Clarity7.5

Tournament Score

1460±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This position paper argues that interactive evaluation of AI systems—where agents act through tools, environments, users, and other agents over time—should be treated as a principled design science rather than a loosely connected family of agent benchmarks. The authors formalize evaluation as an autonomous mapping E: X → Y, where X is admissible evidence and Y is the space of evaluative judgments. They show that interactive evaluation changes both sides: X expands from single responses to interaction-generated trajectories, while E must assess process quality, recoverability, coordination, robustness, and safety beyond final-answer correctness.

The paper's main intellectual contribution is a two-axis taxonomy organizing interactive evaluations by (1) what interaction artifacts enter as evidence (tools/environments, users, other agents, hybrid/dynamic systems) and (2) what evaluation programs map trajectories to judgments (task success, process quality, recoverability, safety/alignment). By mapping existing benchmarks into this 2D space, the authors identify systematic gaps—particularly the dominance of outcome-only scoring even when rich trajectory data is available, and the sparsity of hybrid/dynamic system evaluations.

2. Methodological Rigor

As a position paper, methodological rigor manifests differently than in empirical work. The paper's conceptual framework is clean and well-motivated. The E: X → Y formalism is simple but effective—it clarifies exactly where interactive evaluation departs from response-centered evaluation without overcomplicating the argument.

The taxonomy is derived systematically from the definition rather than imposed ad hoc. The boundary cases (multi-turn without action-dependence, tool calls without state change, chain-of-thought without external loops) are well-chosen and sharpen the definition's scope. The paper also includes a semi-automated benchmark collection methodology with quality filters (citation velocity, GitHub stars, venue) and LLM-based classification validated against manual labels (>90% agreement), lending some empirical grounding to the landscape analysis.

However, the empirical analysis remains largely descriptive. The χ² test comparing industry vs. academic evaluation-stage distributions (p=0.029) is suggestive but based on a relatively small industry sample (43 benchmark families from 4 companies). The 2D taxonomy mapping in Figure 3 relies on the authors' categorization of benchmarks, which, while reasonable, involves subjective judgment calls. The paper would benefit from inter-annotator agreement metrics on the taxonomy placement.

3. Potential Impact

The paper's impact potential is substantial but depends heavily on community adoption. Its contributions operate at three levels:

Conceptual clarity: The distinction between "recording trajectories" and "evaluating at the trajectory level" is genuinely important. Many current benchmarks collect rich interaction data but collapse it into pass/fail scores, discarding evidence about process quality, safety, and robustness. Making this explicit could shift how benchmark designers think about scoring.

Practical guidance: The design principles (specify the system and trajectory evidence, specify interaction protocols, design for perturbation and repair, separate outcome/process/risk) are actionable. The call for protocol documentation as "the interactive analogue of dataset documentation" is apt and could lead to standardized reporting practices.

Community organization: By providing a shared vocabulary and taxonomy, the paper could help reduce fragmentation across web agents, coding agents, multi-agent systems, and tool-use benchmarks that currently develop evaluation practices independently.

The two illustrative scenarios (coding agents, multi-agent social systems) in Appendix D effectively demonstrate how the framework applies in practice, though they remain at the conceptual level without implementing the proposed evaluation programs.

4. Timeliness & Relevance

The paper is highly timely. The rapid proliferation of agent benchmarks (WebArena, SWE-bench, τ-bench, SOTOPIA, etc.) has created exactly the kind of fragmentation the authors diagnose. The field is at an inflection point where agent evaluation is growing quickly but without shared conceptual infrastructure. The observation that many benchmarks admit trajectory evidence but score only outcomes captures a real and consequential gap.

The industry-vs-academic divergence highlighted in Figure 1 is also relevant: frontier labs are increasingly evaluating interactive capabilities while academic benchmarks retain a response-centered center of gravity. This creates a practical need for the kind of principled framework proposed here.

5. Strengths & Limitations

Strengths:

Clean conceptual framework: The E: X → Y formalism is minimal but productive—it generates the taxonomy, the design principles, and the risk analysis naturally.

Comprehensive gap analysis: The identification that trajectory evidence is underused even when collected, that evaluation programs remain substrate-bound, and that hybrid/dynamic systems lack coverage are well-supported observations.

Balanced positioning: The paper carefully avoids overclaiming—it doesn't argue against response-centered evaluation, acknowledges cost concerns, and addresses multiple alternative views substantively in Appendix A.

Extensive benchmark survey: The representative benchmark list with metadata provides a useful resource.

Risk analysis: The discussion of trajectory-level analogues of classic evaluation problems (gaming, leakage, brittleness) is forward-looking and practically important.

Limitations:

No empirical validation of the framework's utility: The paper proposes principles but doesn't demonstrate that following them produces better evaluations. A case study implementing the full framework on one benchmark would strengthen the argument considerably.

Taxonomy granularity: The four evaluation-program categories (task success, process quality, recoverability, safety) may be too coarse. Within "process quality," for instance, there are very different measurement challenges for code locality vs. communication clarity vs. action economy.

Limited treatment of cost-benefit tradeoffs: While the paper acknowledges that interactive evaluation is more expensive, it provides little guidance on when the additional cost is justified beyond the general principle that "interaction should be constitutive of the capability claim."

Incremental conceptual contribution: Many of the individual observations (trajectories matter, process matters, robustness matters) are recognized in existing work. The contribution is primarily in organizing and systematizing these ideas, which is valuable but less novel than introducing a fundamentally new concept.

Appendix-heavy structure: Key illustrative content (scenarios, alternative views, benchmark details) is relegated to appendices, making the main text somewhat abstract.

6. Additional Observations

The paper's framing as "design science" (citing Simon, Hevner, Wieringa) is interesting but underexplored. Design science methodology has specific prescriptions (design artifacts, evaluation cycles, communication protocols) that could have been more directly applied. The invocation feels more metaphorical than methodological.

The GitHub repository is referenced but its contents and utility for the community are unclear from the paper. Shared infrastructure—logging schemas, trajectory viewers, evaluation harnesses—is called for but not provided.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 5.5Clarity 7.5

Generated May 19, 2026

Comparison History (16)

vs. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

gemini-3.15/20/2026

Paper 1 addresses a foundational, field-wide challenge by proposing a new paradigm and taxonomy for evaluating interactive AI systems. While Paper 2 offers a strong methodological improvement for training agents, Paper 1 has broader implications for how the entire community benchmarks, designs, and assesses future interactive LLMs, giving it higher potential for widespread scientific impact.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gemini-3.15/20/2026

Paper 2 proposes a fundamental paradigm shift and theoretical framework for the entire field of interactive AI evaluation. While Paper 1 introduces a valuable and timely benchmark for privacy in LLM agents, Paper 2's methodological contributions, design principles, and taxonomy have the potential to influence the creation of all future interactive benchmarks, giving it a significantly broader impact across the rapidly growing field of AI agent research.

vs. Probabilistic Tiny Recursive Model

gemini-3.15/20/2026

Paper 2 addresses a critical and rapidly growing challenge in AI: evaluating interactive LLM systems and agents. By proposing a foundational framework, taxonomy, and design principles, it has the potential to shape evaluation standards across the entire field. Paper 1, while demonstrating impressive methodological innovation and efficiency gains for small models on specific reasoning tasks, has a narrower scope and applicability compared to the field-wide relevance of redefining AI evaluation.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/20/2026

Paper 2 provides a rigorous mathematical framework extending fundamental theories, introduces a novel algorithm, and offers broad interdisciplinary applications spanning artificial intelligence, cognitive science, and computational psychiatry. While Paper 1 is timely for LLM evaluation, Paper 2's methodological rigor, formal proofs, and cross-field theoretical impact give it a higher potential for deep, lasting scientific impact.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

gpt-5.25/20/2026

Paper 1 introduces a concrete, executable pipeline for generating editable indoor scenes with articulated, simulation-ready assets from language, addressing a clear technical bottleneck (static meshes, limited controllability) with an implemented system, validation/repair loop, and downstream robotics evaluation—high novelty, rigor, and immediate applicability to embodied AI, robotics, and simulation. Paper 2 is timely and potentially broad in influence, but as a position paper its impact is more indirect and depends on community adoption of proposed taxonomies/standards, with less methodological/empirical grounding.

vs. Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

gemini-3.15/19/2026

While Paper 1 offers a strong technical innovation in automated heuristic design, Paper 2 addresses a foundational, field-wide crisis in AI: the evaluation of interactive LLM agents. By proposing a unified taxonomy and design principles for interactive evaluation, Paper 2 has a much broader potential impact, as it will likely shape benchmarking standards and influence virtually all future research involving AI agents.

vs. From Prompts to Protocols: An AI Agent for Laboratory Automation

gemini-3.15/19/2026

Paper 1 proposes a foundational framework for evaluating interactive AI agents, addressing a critical, field-wide bottleneck. As the field shifts from static LLM benchmarks to autonomous agents, establishing standardized evaluation paradigms will impact nearly every sub-discipline of AI. While Paper 2 offers highly valuable real-world applications in laboratory automation, Paper 1's methodological contributions possess a broader scope. It has the potential to shape how all future interactive AI systems are tested, scored, and validated, giving it a higher potential for widespread scientific impact and foundational citations across the broader AI community.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it introduces a concrete, novel end-to-end system (trace-to-graph extraction + online multi-agent execution + reinforcement-based self-evolution) and demonstrates real-world deployment across multiple production services with strong quantitative and expert-evaluation gains over a baseline. This combination of methodological contribution, empirical validation, and immediate applicability to enterprise operations suggests broader adoption potential. Paper 1 is timely and valuable conceptually (taxonomy/standards for interactive evaluation) but, as a position paper, offers less direct evidence and fewer deployable artifacts, which may limit near-term measurable impact.

vs. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

gemini-3.15/19/2026

While Paper 1 presents a strong technical achievement in competitive programming, Paper 2 addresses a critical, field-wide challenge: the evaluation of interactive AI agents. By proposing foundational design principles, a taxonomy, and reporting standards for a new evaluation paradigm, Paper 2 has the potential to shape the methodology of future research across multiple domains, leading to broader and more enduring scientific impact.

vs. GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

claude-opus-4.65/19/2026

GuardAD presents a concrete, novel technical contribution—a Markovian neuro-symbolic safety framework for autonomous driving MLLMs—with strong empirical results (32% accident reduction, real-world vehicle validation). It addresses a critical safety problem with immediate practical applications. Paper 1, while intellectually valuable as a position paper proposing a taxonomy for interactive AI evaluation, lacks empirical validation and offers conceptual rather than technical contributions. Paper 2's combination of methodological novelty, rigorous experimentation across simulated and physical settings, and direct safety implications gives it broader and more immediate scientific impact.

vs. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

gemini-3.15/19/2026

Paper 2 proposes a foundational theoretical framework and design principles for the emerging paradigm of interactive AI evaluation. By addressing the structural shift from static responses to agentic, trajectory-based systems, it provides a roadmap that will likely shape future methodological standards across the field. While Paper 1 offers a valuable empirical critique of current corporate benchmarking practices, Paper 2 has broader potential to drive methodological innovation and establish new scientific norms in AI evaluation.

vs. Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

claude-opus-4.65/19/2026

Paper 1 presents a concrete, novel method (DoLQ) combining LLMs with symbolic regression for ODE discovery, with experimental validation and code availability. It addresses a fundamental challenge in scientific machine learning with a practical, reproducible contribution. Paper 2 is a position paper proposing a conceptual framework for interactive AI evaluation—valuable but less immediately impactful as it lacks empirical validation. Paper 1's methodological innovation bridging LLM reasoning with equation discovery has broader cross-disciplinary applications in physics, biology, and engineering.

vs. Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

gpt-5.25/19/2026

Paper 1 has higher potential impact because it proposes a general evaluation paradigm for interactive/agentic AI, offering a taxonomy, design principles, and reporting standards that could reshape how many interactive benchmarks and deployments are evaluated across domains. Its breadth (applicable to tool use, multi-agent, robustness, coordination, recoverability) and timeliness (shift from static to trajectory-based systems) make it widely influential. Paper 2 is methodologically rigorous and practically useful, but its findings are more domain- and setting-specific (CybORG cyber POMDP) and thus likely narrower in cross-field impact.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

gpt-5.25/19/2026

Paper 1 targets a broad, timely shift in how LLMs are deployed (interactive, tool-using agents) and proposes a general evaluation paradigm with taxonomy, design principles, and reporting standards. This can reshape evaluation methodology across many domains (agents, HCI, RL, safety, reliability), giving it wide cross-field impact. Paper 2 is methodologically rigorous and practically important for legal AI, but its scope is narrower (tax law) and its main contributions (contamination-aware testing, neuro-symbolic robustness) are more domain-specific, limiting breadth despite strong real-world relevance.

vs. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

gpt-5.25/19/2026

Paper 2 is likely to have higher near-term scientific impact: it presents concrete empirical results, a scalable method (diverse ensemble monitoring) with clear safety applications, and actionable findings (diversity/correlation, fine-tuning benefits, OOD performance) that others can reproduce and extend. Its methodological rigor and direct relevance to deployed agent safety make it timely and broadly useful across AI security, alignment, and software engineering. Paper 1 is conceptually novel and could shape evaluation thinking long-term, but as a position/design framework it is less immediately testable and may diffuse impact more slowly.

vs. When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

gpt-5.25/19/2026

Paper 2 has higher estimated impact because it targets a broad, timely shift in AI/LLM evaluation toward interactive, tool-using agents and proposes a general paradigm, taxonomy, and reporting standards that can influence many benchmarks and communities. Its breadth spans evaluation methodology, agent systems, HCI, and safety/robustness, with high relevance to current deployment trends. Paper 1 is novel and rigorous with concrete benchmarks and a valuable trace-based diagnostic concept, but its scope is narrower (pricing/hidden-state competitor settings) and its main contribution is a specific evaluation paradigm likely to affect a smaller set of applied RL domains.