Interactive Evaluation Requires a Design Science
Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua
Abstract
AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This position paper argues that interactive evaluation of AI systems—where agents act through tools, environments, users, and other agents over time—should be treated as a principled design science rather than a loosely connected family of agent benchmarks. The authors formalize evaluation as an autonomous mapping E: X → Y, where X is admissible evidence and Y is the space of evaluative judgments. They show that interactive evaluation changes both sides: X expands from single responses to interaction-generated trajectories, while E must assess process quality, recoverability, coordination, robustness, and safety beyond final-answer correctness.
The paper's main intellectual contribution is a two-axis taxonomy organizing interactive evaluations by (1) what interaction artifacts enter as evidence (tools/environments, users, other agents, hybrid/dynamic systems) and (2) what evaluation programs map trajectories to judgments (task success, process quality, recoverability, safety/alignment). By mapping existing benchmarks into this 2D space, the authors identify systematic gaps—particularly the dominance of outcome-only scoring even when rich trajectory data is available, and the sparsity of hybrid/dynamic system evaluations.
2. Methodological Rigor
As a position paper, methodological rigor manifests differently than in empirical work. The paper's conceptual framework is clean and well-motivated. The E: X → Y formalism is simple but effective—it clarifies exactly where interactive evaluation departs from response-centered evaluation without overcomplicating the argument.
The taxonomy is derived systematically from the definition rather than imposed ad hoc. The boundary cases (multi-turn without action-dependence, tool calls without state change, chain-of-thought without external loops) are well-chosen and sharpen the definition's scope. The paper also includes a semi-automated benchmark collection methodology with quality filters (citation velocity, GitHub stars, venue) and LLM-based classification validated against manual labels (>90% agreement), lending some empirical grounding to the landscape analysis.
However, the empirical analysis remains largely descriptive. The χ² test comparing industry vs. academic evaluation-stage distributions (p=0.029) is suggestive but based on a relatively small industry sample (43 benchmark families from 4 companies). The 2D taxonomy mapping in Figure 3 relies on the authors' categorization of benchmarks, which, while reasonable, involves subjective judgment calls. The paper would benefit from inter-annotator agreement metrics on the taxonomy placement.
3. Potential Impact
The paper's impact potential is substantial but depends heavily on community adoption. Its contributions operate at three levels:
Conceptual clarity: The distinction between "recording trajectories" and "evaluating at the trajectory level" is genuinely important. Many current benchmarks collect rich interaction data but collapse it into pass/fail scores, discarding evidence about process quality, safety, and robustness. Making this explicit could shift how benchmark designers think about scoring.
Practical guidance: The design principles (specify the system and trajectory evidence, specify interaction protocols, design for perturbation and repair, separate outcome/process/risk) are actionable. The call for protocol documentation as "the interactive analogue of dataset documentation" is apt and could lead to standardized reporting practices.
Community organization: By providing a shared vocabulary and taxonomy, the paper could help reduce fragmentation across web agents, coding agents, multi-agent systems, and tool-use benchmarks that currently develop evaluation practices independently.
The two illustrative scenarios (coding agents, multi-agent social systems) in Appendix D effectively demonstrate how the framework applies in practice, though they remain at the conceptual level without implementing the proposed evaluation programs.
4. Timeliness & Relevance
The paper is highly timely. The rapid proliferation of agent benchmarks (WebArena, SWE-bench, τ-bench, SOTOPIA, etc.) has created exactly the kind of fragmentation the authors diagnose. The field is at an inflection point where agent evaluation is growing quickly but without shared conceptual infrastructure. The observation that many benchmarks admit trajectory evidence but score only outcomes captures a real and consequential gap.
The industry-vs-academic divergence highlighted in Figure 1 is also relevant: frontier labs are increasingly evaluating interactive capabilities while academic benchmarks retain a response-centered center of gravity. This creates a practical need for the kind of principled framework proposed here.
5. Strengths & Limitations
Strengths:
Limitations:
6. Additional Observations
The paper's framing as "design science" (citing Simon, Hevner, Wieringa) is interesting but underexplored. Design science methodology has specific prescriptions (design artifacts, evaluation cycles, communication protocols) that could have been more directly applied. The invocation feels more metaphorical than methodological.
The GitHub repository is referenced but its contents and utility for the community are unclear from the paper. Shared infrastructure—logging schemas, trajectory viewers, evaluation harnesses—is called for but not provided.
Generated May 19, 2026
Comparison History (16)
Paper 1 addresses a foundational, field-wide challenge by proposing a new paradigm and taxonomy for evaluating interactive AI systems. While Paper 2 offers a strong methodological improvement for training agents, Paper 1 has broader implications for how the entire community benchmarks, designs, and assesses future interactive LLMs, giving it higher potential for widespread scientific impact.
Paper 2 proposes a fundamental paradigm shift and theoretical framework for the entire field of interactive AI evaluation. While Paper 1 introduces a valuable and timely benchmark for privacy in LLM agents, Paper 2's methodological contributions, design principles, and taxonomy have the potential to influence the creation of all future interactive benchmarks, giving it a significantly broader impact across the rapidly growing field of AI agent research.
Paper 2 addresses a critical and rapidly growing challenge in AI: evaluating interactive LLM systems and agents. By proposing a foundational framework, taxonomy, and design principles, it has the potential to shape evaluation standards across the entire field. Paper 1, while demonstrating impressive methodological innovation and efficiency gains for small models on specific reasoning tasks, has a narrower scope and applicability compared to the field-wide relevance of redefining AI evaluation.
Paper 2 provides a rigorous mathematical framework extending fundamental theories, introduces a novel algorithm, and offers broad interdisciplinary applications spanning artificial intelligence, cognitive science, and computational psychiatry. While Paper 1 is timely for LLM evaluation, Paper 2's methodological rigor, formal proofs, and cross-field theoretical impact give it a higher potential for deep, lasting scientific impact.
Paper 1 introduces a concrete, executable pipeline for generating editable indoor scenes with articulated, simulation-ready assets from language, addressing a clear technical bottleneck (static meshes, limited controllability) with an implemented system, validation/repair loop, and downstream robotics evaluation—high novelty, rigor, and immediate applicability to embodied AI, robotics, and simulation. Paper 2 is timely and potentially broad in influence, but as a position paper its impact is more indirect and depends on community adoption of proposed taxonomies/standards, with less methodological/empirical grounding.
While Paper 1 offers a strong technical innovation in automated heuristic design, Paper 2 addresses a foundational, field-wide crisis in AI: the evaluation of interactive LLM agents. By proposing a unified taxonomy and design principles for interactive evaluation, Paper 2 has a much broader potential impact, as it will likely shape benchmarking standards and influence virtually all future research involving AI agents.
Paper 1 proposes a foundational framework for evaluating interactive AI agents, addressing a critical, field-wide bottleneck. As the field shifts from static LLM benchmarks to autonomous agents, establishing standardized evaluation paradigms will impact nearly every sub-discipline of AI. While Paper 2 offers highly valuable real-world applications in laboratory automation, Paper 1's methodological contributions possess a broader scope. It has the potential to shape how all future interactive AI systems are tested, scored, and validated, giving it a higher potential for widespread scientific impact and foundational citations across the broader AI community.
Paper 2 likely has higher scientific impact: it introduces a concrete, novel end-to-end system (trace-to-graph extraction + online multi-agent execution + reinforcement-based self-evolution) and demonstrates real-world deployment across multiple production services with strong quantitative and expert-evaluation gains over a baseline. This combination of methodological contribution, empirical validation, and immediate applicability to enterprise operations suggests broader adoption potential. Paper 1 is timely and valuable conceptually (taxonomy/standards for interactive evaluation) but, as a position paper, offers less direct evidence and fewer deployable artifacts, which may limit near-term measurable impact.
While Paper 1 presents a strong technical achievement in competitive programming, Paper 2 addresses a critical, field-wide challenge: the evaluation of interactive AI agents. By proposing foundational design principles, a taxonomy, and reporting standards for a new evaluation paradigm, Paper 2 has the potential to shape the methodology of future research across multiple domains, leading to broader and more enduring scientific impact.
GuardAD presents a concrete, novel technical contribution—a Markovian neuro-symbolic safety framework for autonomous driving MLLMs—with strong empirical results (32% accident reduction, real-world vehicle validation). It addresses a critical safety problem with immediate practical applications. Paper 1, while intellectually valuable as a position paper proposing a taxonomy for interactive AI evaluation, lacks empirical validation and offers conceptual rather than technical contributions. Paper 2's combination of methodological novelty, rigorous experimentation across simulated and physical settings, and direct safety implications gives it broader and more immediate scientific impact.
Paper 2 proposes a foundational theoretical framework and design principles for the emerging paradigm of interactive AI evaluation. By addressing the structural shift from static responses to agentic, trajectory-based systems, it provides a roadmap that will likely shape future methodological standards across the field. While Paper 1 offers a valuable empirical critique of current corporate benchmarking practices, Paper 2 has broader potential to drive methodological innovation and establish new scientific norms in AI evaluation.
Paper 1 presents a concrete, novel method (DoLQ) combining LLMs with symbolic regression for ODE discovery, with experimental validation and code availability. It addresses a fundamental challenge in scientific machine learning with a practical, reproducible contribution. Paper 2 is a position paper proposing a conceptual framework for interactive AI evaluation—valuable but less immediately impactful as it lacks empirical validation. Paper 1's methodological innovation bridging LLM reasoning with equation discovery has broader cross-disciplinary applications in physics, biology, and engineering.
Paper 1 has higher potential impact because it proposes a general evaluation paradigm for interactive/agentic AI, offering a taxonomy, design principles, and reporting standards that could reshape how many interactive benchmarks and deployments are evaluated across domains. Its breadth (applicable to tool use, multi-agent, robustness, coordination, recoverability) and timeliness (shift from static to trajectory-based systems) make it widely influential. Paper 2 is methodologically rigorous and practically useful, but its findings are more domain- and setting-specific (CybORG cyber POMDP) and thus likely narrower in cross-field impact.
Paper 1 targets a broad, timely shift in how LLMs are deployed (interactive, tool-using agents) and proposes a general evaluation paradigm with taxonomy, design principles, and reporting standards. This can reshape evaluation methodology across many domains (agents, HCI, RL, safety, reliability), giving it wide cross-field impact. Paper 2 is methodologically rigorous and practically important for legal AI, but its scope is narrower (tax law) and its main contributions (contamination-aware testing, neuro-symbolic robustness) are more domain-specific, limiting breadth despite strong real-world relevance.
Paper 2 is likely to have higher near-term scientific impact: it presents concrete empirical results, a scalable method (diverse ensemble monitoring) with clear safety applications, and actionable findings (diversity/correlation, fine-tuning benefits, OOD performance) that others can reproduce and extend. Its methodological rigor and direct relevance to deployed agent safety make it timely and broadly useful across AI security, alignment, and software engineering. Paper 1 is conceptually novel and could shape evaluation thinking long-term, but as a position/design framework it is less immediately testable and may diffuse impact more slowly.
Paper 2 has higher estimated impact because it targets a broad, timely shift in AI/LLM evaluation toward interactive, tool-using agents and proposes a general paradigm, taxonomy, and reporting standards that can influence many benchmarks and communities. Its breadth spans evaluation methodology, agent systems, HCI, and safety/robustness, with high relevance to current deployment trends. Paper 1 is novel and rigorous with concrete benchmarks and a valuable trace-based diagnostic concept, but its scope is narrower (pricing/hidden-state competitor settings) and its main contribution is a specific evaluation paradigm likely to affect a smaller set of applied RL domains.