Back to Rankings

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen

cs.AI
Share
#2227 of 3489 · Artificial Intelligence
Tournament Score
1365±42
10501800
50%
Win Rate
9
Wins
9
Losses
18
Matches
Rating
5.2/ 10
Significance5.5
Rigor4.5
Novelty6.5
Clarity6

Abstract

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AARRI-Bench

1. Core Contribution

The paper introduces AARRI-Bench, the first installment of the AARR (Act As a Real Researcher) benchmark series, designed to evaluate whether LLM agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers. The key conceptual contribution is a shift from measuring task completion (can the agent solve hard problems?) to measuring researcher-like qualities (does the agent notice what a competent human would notice?). This includes detecting fabricated data, refusing unethical instructions, recognizing dead-end research directions, and maintaining memory across multi-turn research interactions.

The benchmark comprises 82 manually crafted tasks organized along two dimensions: horizontal (Context, Mindset, Hands-on, Interaction) and vertical (Adaptation, Integration, Innovation, Open-ended). The best-performing configuration achieves only 68.3% success rate, suggesting substantial room for improvement.

2. Methodological Rigor

Strengths in evaluation design: The paper evaluates 16+ harness-model combinations across three agent harnesses (Claude Code, Hermes Agent, Mini-SWE-Agent) and seven frontier models, providing a reasonably comprehensive comparison. The use of the Harbor framework for containerized, reproducible evaluation is sound practice. Both coarse-grained (0/1 reward) and fine-grained (unit test pass rate) metrics are reported, and the gap between them (21-36 pp) reveals meaningful information about partial-credit behavior.

Weaknesses in evaluation design: The benchmark has only 82 tasks with a single trial per task, making statistical confidence questionable—especially for subcategory analyses where cell sizes shrink to 13-34 tasks. The paper reports no confidence intervals or significance tests for the differences between configurations. The 6.1 pp difference between the best and second-best configuration could easily be noise at this sample size.

The reliance on regex/keyword-based verification is a significant limitation that the authors themselves acknowledge. Case Study D.2 (false-guidance-rebuttal) vividly illustrates this: Kimi-K2.6 correctly refuses to fabricate data and writes a substantively correct response mentioning "scientific misconduct," but fails the test because it doesn't use the specific lexical patterns the grader expects ("cannot," "refuse," "will not"). This suggests the benchmark may be measuring output formatting as much as researcher qualities—undermining its core thesis.

The manual construction process, while ensuring quality, introduces potential biases from the specific research backgrounds of the team (primarily from Xi'an Jiaotong University and Xidian University, likely with AI/CV focus). The paper does not discuss inter-annotator agreement on task design or test script validation.

3. Potential Impact

The conceptual framing—evaluating researcher-like qualities rather than just task completion—is genuinely valuable and addresses a real gap. As AI agents are increasingly deployed in scientific workflows, understanding their failure modes on "easy for humans, hard for agents" tasks is practically important. The taxonomy of failure modes (context sensitivity, ethical reasoning, memory maintenance, tool interaction) provides useful vocabulary for the community.

However, the practical impact is limited by scale (82 tasks) and the acknowledged fragility of the evaluation methodology. The regex-based grading means the benchmark may not reliably distinguish between agents that lack researcher qualities and agents that simply phrase their (correct) responses differently. This fundamentally undermines the benchmark's utility as a discriminative tool.

The finding that minimalist harnesses (Mini-SWE-Agent) outperform complex ones (Claude Code) with frontier models is an interesting practical insight, though not entirely novel—it echoes findings from SWE-bench and similar evaluations.

4. Timeliness & Relevance

The paper is highly timely. The proliferation of autonomous research agents (AI Scientist, AutoResearch, EvoScientist) creates an urgent need for benchmarks that go beyond execution capabilities to assess scientific judgment. The focus on research ethics (data fabrication detection, p-hacking resistance, refusing unethical instructions) is particularly relevant given growing concerns about AI-assisted scientific misconduct.

The benchmark addresses a genuine blind spot: most existing benchmarks test whether agents can do research tasks, not whether they do them responsibly. This is a meaningful conceptual advance even if the current instantiation has limitations.

5. Strengths & Limitations

Key Strengths:

  • Novel evaluation philosophy targeting researcher qualities rather than raw capability
  • Creative task designs (fabricated data detection, p-hacking resistance, dead-end recognition) that probe genuinely important failure modes
  • Excellent qualitative case studies (Appendix D) that reveal nuanced failure patterns invisible to aggregate metrics
  • Multi-harness evaluation enabling comparison of scaffolding approaches
  • The finding about harness complexity vs. model capability scaling is actionable
  • Notable Limitations:

  • Small scale (82 tasks) with insufficient statistical power for many claimed comparisons
  • Regex-based grading creates false negatives that confound the measurement of researcher qualities with output formatting preferences (demonstrated in their own case studies)
  • No repeated trials, no confidence intervals, no significance testing
  • Limited diversity in task creators (single university lab)
  • The "Mindset" category achieves suspiciously high pass rates (up to 76.9%) across many models, suggesting these tasks may not be as discriminative as claimed
  • Future stages (AARRA, AARRS) are only sketched without concrete methodology
  • Some reference dates (2026) suggest either forward-dating or fictional citations, raising questions about the paper's current state
  • Additional Observations:

    The paper's most compelling evidence comes from the qualitative case studies, not the aggregate numbers. The idea-curse case (D.1) beautifully isolates harness design as the cause of memory failure; the false-guidance-rebuttal case (D.2) reveals fundamental tensions in automated evaluation of behavioral qualities. These analyses are more valuable than the leaderboard itself, suggesting the paper's lasting contribution may be diagnostic rather than evaluative.

    The gap between fine-grained and binary metrics (Table 4) is one of the most informative analyses, showing that failed tasks are rarely complete failures (52-66% sub-criteria pass rate when reward=0). This finding has implications for benchmark design beyond this specific work.

    Rating:5.2/ 10
    Significance 5.5Rigor 4.5Novelty 6.5Clarity 6

    Generated Jun 8, 2026

    Comparison History (18)

    Wonvs. Evaluating Research-Level Math Proofs via Strict Step-Level Verification

    Paper 2 introduces a novel benchmark evaluating AI agents across the research lifecycle, a highly active and critical area in AI development. Benchmarks for autonomous research agents have broader applicability and the potential to drive widespread future research compared to Paper 1's domain-specific focus on mathematical proof verification, giving Paper 2 a broader scope and higher potential scientific impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

    Paper 1 introduces an automated framework for generating state-based benchmarks, addressing a critical bottleneck in the scalability and realism of agent evaluation. This methodological innovation has broader applicability across various personal computing environments compared to Paper 2, which, while highly relevant to AI research automation, relies on a more static benchmark approach. Paper 1's focus on automated creation and state-based verification offers a more robust and scalable tool for the rapidly growing field of LLM agent development.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. Collaborative Human-Agent Protocol (CHAP)

    Paper 2 (AARR) introduces a novel benchmark suite that addresses a timely and important gap: evaluating whether AI agents can replicate the nuanced judgment of human researchers. Benchmarks historically drive significant research progress and community adoption. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration, which is practically useful but more incremental—building on existing protocol standards (MCP, A2A). While CHAP addresses real engineering needs, AARR's empirical findings (e.g., best agents achieving only 68.3%) provide actionable scientific insights that will likely influence agent development research more broadly across the AI community.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

    Paper 2 likely has higher impact due to broader relevance and timeliness: a benchmark suite for evaluating “researcher-like” agent behavior targets a rapidly growing area (agentic LLMs) and can become a community standard, enabling systematic comparison across models, harnesses, safety/ethics, and long-horizon research workflows. Released data further boosts adoption and downstream citations. Paper 1 is novel in optimization-based evidence aggregation for legal reasoning, but its impact is narrower (domain-specific, depends on CoT parsing/availability, and quantum hardware aspects may be seen as peripheral), limiting breadth and uptake.

    gpt-5.2·Jun 8, 2026
    Lostvs. The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

    Paper 2 has higher likely impact due to its broader, more generalizable framework for safe, scalable agentic AI (governance, autonomy tiers, continuous alignment) with clear real-world deployment relevance across domains. It targets a timely central bottleneck—accountable delegation—and proposes an implementable control-plane architecture plus modeling and empirical validation under distribution shift. Paper 1 is valuable but mainly contributes a benchmark; its impact is narrower (evaluation-focused) and more field-specific, and benchmarks often have less cross-domain influence than broadly applicable governance/agent-development frameworks.

    gpt-5.2·Jun 8, 2026
    Lostvs. Residual Modeling for High-Fidelity Learned Compression of Scientific Data

    Paper 1 presents a concrete, novel methodological contribution—residual-centric coding for learned scientific data compression—with rigorous evaluation showing 30-60% improvements over existing methods across multiple datasets. It addresses a real bottleneck in scientific computing (high-fidelity lossy compression) with techniques that are immediately applicable. Paper 2 introduces a benchmark for evaluating LLM agents as researchers, which is timely but incremental; benchmarks have shorter-lived impact, and the finding that current agents fall short is unsurprising. Paper 1's technical depth, clear advances over baselines, and applicability to large-scale scientific workflows give it stronger lasting impact.

    claude-opus-4-6·Jun 8, 2026
    Lostvs. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

    Paper 2 has higher likely impact due to its direct, high-stakes clinical application, strong methodological contributions (clause cards, anchor-driven instantiation, closed-loop verification) enabling controllable, auditable, and by-construction ground truth, and a sizable benchmark with an agentic environment. It targets a timely need—reliable evaluation of LLMs for safety-critical healthcare workflows—and its ideas (structured policy factorization, verifiable synthetic data) generalize to other regulated domains. Paper 1 is valuable but more meta/benchmark-focused with less immediate real-world deployment leverage.

    gpt-5.2·Jun 8, 2026
    Lostvs. Online Pandora's Box for Contextual LLM Cascading

    Paper 2 introduces a novel online decision-making formulation tailored to LLM cascading with output-mediated feedback, along with a theoretically grounded algorithm (GMM + UCB) and regret guarantees. This methodological rigor and generality make it likely to influence work in online learning, bandits, operations research, and practical LLM system orchestration (cost/quality trade-offs). Paper 1 is timely and useful as an evaluation benchmark for agentic research behavior, but its impact is more concentrated in benchmarking/assessment and may be less foundational than a new model + provable learning results.

    gpt-5.2·Jun 8, 2026
    Wonvs. Learning Adaptive Parallel Execution for Efficient Code Localization

    Paper 2 introduces a novel benchmark paradigm (AARR) that evaluates AI agents on nuanced research capabilities beyond execution, addressing a fundamental gap in how we assess AI research systems. Its broader scope—spanning research ethics, field sensitivity, and scientific judgment—has wider cross-disciplinary impact. Paper 1, while technically solid with strong efficiency gains for code localization, addresses a narrower optimization problem. Paper 2's benchmark framework is more likely to influence future research directions in AI evaluation and shape development of research-capable agents.

    claude-opus-4-6·Jun 8, 2026
    Wonvs. A Study of Parallel Continuous Local Search

    Paper 2 addresses the rapidly growing and highly relevant field of autonomous LLM research agents. By introducing a novel benchmark for evaluating the nuanced capabilities of AI in the research lifecycle, it is likely to attract significant attention and citations across the broad AI and machine learning communities. In contrast, Paper 1 focuses on a more specialized empirical study of local search algorithms for SAT solving, which, while valuable, has a narrower potential audience and breadth of impact.

    gemini-3.1-pro-preview·Jun 8, 2026