Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen
As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.
The paper introduces AARRI-Bench, the first installment of the AARR (Act As a Real Researcher) benchmark series, designed to evaluate whether LLM agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers. The key conceptual contribution is a shift from measuring task completion (can the agent solve hard problems?) to measuring researcher-like qualities (does the agent notice what a competent human would notice?). This includes detecting fabricated data, refusing unethical instructions, recognizing dead-end research directions, and maintaining memory across multi-turn research interactions.
The benchmark comprises 82 manually crafted tasks organized along two dimensions: horizontal (Context, Mindset, Hands-on, Interaction) and vertical (Adaptation, Integration, Innovation, Open-ended). The best-performing configuration achieves only 68.3% success rate, suggesting substantial room for improvement.
Strengths in evaluation design: The paper evaluates 16+ harness-model combinations across three agent harnesses (Claude Code, Hermes Agent, Mini-SWE-Agent) and seven frontier models, providing a reasonably comprehensive comparison. The use of the Harbor framework for containerized, reproducible evaluation is sound practice. Both coarse-grained (0/1 reward) and fine-grained (unit test pass rate) metrics are reported, and the gap between them (21-36 pp) reveals meaningful information about partial-credit behavior.
Weaknesses in evaluation design: The benchmark has only 82 tasks with a single trial per task, making statistical confidence questionable—especially for subcategory analyses where cell sizes shrink to 13-34 tasks. The paper reports no confidence intervals or significance tests for the differences between configurations. The 6.1 pp difference between the best and second-best configuration could easily be noise at this sample size.
The reliance on regex/keyword-based verification is a significant limitation that the authors themselves acknowledge. Case Study D.2 (false-guidance-rebuttal) vividly illustrates this: Kimi-K2.6 correctly refuses to fabricate data and writes a substantively correct response mentioning "scientific misconduct," but fails the test because it doesn't use the specific lexical patterns the grader expects ("cannot," "refuse," "will not"). This suggests the benchmark may be measuring output formatting as much as researcher qualities—undermining its core thesis.
The manual construction process, while ensuring quality, introduces potential biases from the specific research backgrounds of the team (primarily from Xi'an Jiaotong University and Xidian University, likely with AI/CV focus). The paper does not discuss inter-annotator agreement on task design or test script validation.
The conceptual framing—evaluating researcher-like qualities rather than just task completion—is genuinely valuable and addresses a real gap. As AI agents are increasingly deployed in scientific workflows, understanding their failure modes on "easy for humans, hard for agents" tasks is practically important. The taxonomy of failure modes (context sensitivity, ethical reasoning, memory maintenance, tool interaction) provides useful vocabulary for the community.
However, the practical impact is limited by scale (82 tasks) and the acknowledged fragility of the evaluation methodology. The regex-based grading means the benchmark may not reliably distinguish between agents that lack researcher qualities and agents that simply phrase their (correct) responses differently. This fundamentally undermines the benchmark's utility as a discriminative tool.
The finding that minimalist harnesses (Mini-SWE-Agent) outperform complex ones (Claude Code) with frontier models is an interesting practical insight, though not entirely novel—it echoes findings from SWE-bench and similar evaluations.
The paper is highly timely. The proliferation of autonomous research agents (AI Scientist, AutoResearch, EvoScientist) creates an urgent need for benchmarks that go beyond execution capabilities to assess scientific judgment. The focus on research ethics (data fabrication detection, p-hacking resistance, refusing unethical instructions) is particularly relevant given growing concerns about AI-assisted scientific misconduct.
The benchmark addresses a genuine blind spot: most existing benchmarks test whether agents can do research tasks, not whether they do them responsibly. This is a meaningful conceptual advance even if the current instantiation has limitations.
The paper's most compelling evidence comes from the qualitative case studies, not the aggregate numbers. The idea-curse case (D.1) beautifully isolates harness design as the cause of memory failure; the false-guidance-rebuttal case (D.2) reveals fundamental tensions in automated evaluation of behavioral qualities. These analyses are more valuable than the leaderboard itself, suggesting the paper's lasting contribution may be diagnostic rather than evaluative.
The gap between fine-grained and binary metrics (Table 4) is one of the most informative analyses, showing that failed tasks are rarely complete failures (52-66% sub-criteria pass rate when reward=0). This finding has implications for benchmark design beyond this specific work.
Generated Jun 8, 2026
Paper 2 introduces a novel benchmark evaluating AI agents across the research lifecycle, a highly active and critical area in AI development. Benchmarks for autonomous research agents have broader applicability and the potential to drive widespread future research compared to Paper 1's domain-specific focus on mathematical proof verification, giving Paper 2 a broader scope and higher potential scientific impact.
Paper 1 introduces an automated framework for generating state-based benchmarks, addressing a critical bottleneck in the scalability and realism of agent evaluation. This methodological innovation has broader applicability across various personal computing environments compared to Paper 2, which, while highly relevant to AI research automation, relies on a more static benchmark approach. Paper 1's focus on automated creation and state-based verification offers a more robust and scalable tool for the rapidly growing field of LLM agent development.
Paper 2 (AARR) introduces a novel benchmark suite that addresses a timely and important gap: evaluating whether AI agents can replicate the nuanced judgment of human researchers. Benchmarks historically drive significant research progress and community adoption. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration, which is practically useful but more incremental—building on existing protocol standards (MCP, A2A). While CHAP addresses real engineering needs, AARR's empirical findings (e.g., best agents achieving only 68.3%) provide actionable scientific insights that will likely influence agent development research more broadly across the AI community.
Paper 2 likely has higher impact due to broader relevance and timeliness: a benchmark suite for evaluating “researcher-like” agent behavior targets a rapidly growing area (agentic LLMs) and can become a community standard, enabling systematic comparison across models, harnesses, safety/ethics, and long-horizon research workflows. Released data further boosts adoption and downstream citations. Paper 1 is novel in optimization-based evidence aggregation for legal reasoning, but its impact is narrower (domain-specific, depends on CoT parsing/availability, and quantum hardware aspects may be seen as peripheral), limiting breadth and uptake.
Paper 2 has higher likely impact due to its broader, more generalizable framework for safe, scalable agentic AI (governance, autonomy tiers, continuous alignment) with clear real-world deployment relevance across domains. It targets a timely central bottleneck—accountable delegation—and proposes an implementable control-plane architecture plus modeling and empirical validation under distribution shift. Paper 1 is valuable but mainly contributes a benchmark; its impact is narrower (evaluation-focused) and more field-specific, and benchmarks often have less cross-domain influence than broadly applicable governance/agent-development frameworks.
Paper 1 presents a concrete, novel methodological contribution—residual-centric coding for learned scientific data compression—with rigorous evaluation showing 30-60% improvements over existing methods across multiple datasets. It addresses a real bottleneck in scientific computing (high-fidelity lossy compression) with techniques that are immediately applicable. Paper 2 introduces a benchmark for evaluating LLM agents as researchers, which is timely but incremental; benchmarks have shorter-lived impact, and the finding that current agents fall short is unsurprising. Paper 1's technical depth, clear advances over baselines, and applicability to large-scale scientific workflows give it stronger lasting impact.
Paper 2 has higher likely impact due to its direct, high-stakes clinical application, strong methodological contributions (clause cards, anchor-driven instantiation, closed-loop verification) enabling controllable, auditable, and by-construction ground truth, and a sizable benchmark with an agentic environment. It targets a timely need—reliable evaluation of LLMs for safety-critical healthcare workflows—and its ideas (structured policy factorization, verifiable synthetic data) generalize to other regulated domains. Paper 1 is valuable but more meta/benchmark-focused with less immediate real-world deployment leverage.
Paper 2 introduces a novel online decision-making formulation tailored to LLM cascading with output-mediated feedback, along with a theoretically grounded algorithm (GMM + UCB) and regret guarantees. This methodological rigor and generality make it likely to influence work in online learning, bandits, operations research, and practical LLM system orchestration (cost/quality trade-offs). Paper 1 is timely and useful as an evaluation benchmark for agentic research behavior, but its impact is more concentrated in benchmarking/assessment and may be less foundational than a new model + provable learning results.
Paper 2 introduces a novel benchmark paradigm (AARR) that evaluates AI agents on nuanced research capabilities beyond execution, addressing a fundamental gap in how we assess AI research systems. Its broader scope—spanning research ethics, field sensitivity, and scientific judgment—has wider cross-disciplinary impact. Paper 1, while technically solid with strong efficiency gains for code localization, addresses a narrower optimization problem. Paper 2's benchmark framework is more likely to influence future research directions in AI evaluation and shape development of research-capable agents.
Paper 2 addresses the rapidly growing and highly relevant field of autonomous LLM research agents. By introducing a novel benchmark for evaluating the nuanced capabilities of AI in the research lifecycle, it is likely to attract significant attention and citations across the broad AI and machine learning communities. In contrast, Paper 1 focuses on a more specialized empirical study of local search algorithms for SAT solving, which, while valuable, has a narrower potential audience and breadth of impact.