Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu

Jun 10, 2026arXiv:2606.11830v1

cs.AI

#3276of 3489·Artificial Intelligence

#3276 of 3489 · Artificial Intelligence

Tournament Score

1231±50

10501800

18%

Win Rate

Wins

Losses

Matches

Rating

3/ 10

Significance3.5

Rigor3

Novelty3.5

Clarity6.5

Abstract

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper evaluates whether providing AI agents with a curated medical research "skill package" (a collection of bioinformatics and research workflow modules) improves the quality of AI-generated transcriptomic research analysis outputs, compared to native LLM outputs without such skills. The evaluation uses an NSCLC immunotherapy biomarker discovery task as a test case, with human expert and non-expert reviewers rating 21 anonymized outputs from six frontier LLMs.

The core idea—augmenting LLM agents with domain-specific procedural skills for biomedical research workflows—is conceptually reasonable and addresses a real limitation of current LLMs: their tendency to omit analytical steps, misuse statistical methods, or produce overconfident conclusions. However, the paper's actual contribution is quite thin: it reports a non-significant directional signal (mean difference of 0.39 on a 7-point scale, p=0.156) that the authors themselves acknowledge is smaller than expert rater disagreement.

2. Methodological Rigor

The study has serious methodological limitations that substantially undermine its findings:

Sample size and statistical power: With only 21 outputs (9 native, 12 skill-augmented) across six models, the study is severely underpowered. Some model-strategy cells contain only one observation, making model-level comparisons essentially anecdotal. The unbalanced design (9 vs. 12) is not well-justified.

Inter-rater reliability: The expert ICC of -0.15 is alarming. A negative ICC indicates that expert raters disagreed more than would be expected by chance, fundamentally undermining the primary outcome measure. The mean absolute expert disagreement (0.67 points) exceeds the reported effect size (0.39 points). This means the measurement instrument is noisier than the signal being measured.

Confounding factors: The study cannot disentangle the effect of the skill package from OpenClaw-specific platform behaviors (routing, context management, execution). The authors acknowledge this but proceed with claims about "skill augmentation" nonetheless.

No biological validation: The study evaluates perceived quality ratings rather than actual correctness. A well-structured but biologically wrong analysis would score highly. This is a critical gap for a biomedical application paper.

Rating instrument: The Likert-scale instrument was developed ad hoc for this study and not validated. The mixing of quality and risk constructs on different directional scales adds interpretive complexity.

To the authors' credit, they are remarkably transparent about these limitations, qualifying nearly every finding as "descriptive," "exploratory," and "hypothesis-generating." This intellectual honesty is commendable but also raises the question of whether the paper provides actionable scientific knowledge.

3. Potential Impact

The paper's practical impact is limited by its preliminary nature. The skill package concept itself is not novel—tool-augmented LLMs and agentic workflows are well-established in the AI community. The biomedical application context is relevant but the evidence provided is too weak to guide practice.

The paper might serve as a methodological template for future, better-powered evaluations of skill-augmented agents in biomedical research. However, the evaluation framework itself needs substantial refinement (better rater calibration, validated instruments, biological ground truth) before it can serve this role effectively.

The OpenClaw platform and skill package are described as publicly available, which could enable follow-up work, though the generated outputs themselves are not released.

4. Timeliness & Relevance

The topic is timely. AI agents for biomedical research are an active area, and understanding whether structured skill packages improve output quality is practically important. The choice of transcriptomic biomarker analysis as a test domain is well-motivated—it genuinely requires multi-step analytical reasoning. The selection of very recent model backbones (GPT-5.4, Claude Sonnet 4.6, etc.) makes this current as of 2026.

However, the rapid pace of LLM development means that findings about specific model backbones have a very short shelf life. The model-specific heterogeneity findings (e.g., GPT-5.4 benefiting most from skill augmentation) are based on 2-4 outputs per model and will likely not replicate with newer model versions.

5. Strengths & Limitations

Strengths:

Exceptional transparency about limitations; the paper does not overclaim

Meaningful research question about skill augmentation in biomedical AI

Multi-model evaluation provides breadth

Appropriate use of blinded review and anonymization

Detailed supplementary materials and reproducibility documentation

The distinction between expert and non-expert ratings reveals interesting divergences (e.g., Kimi K2.6's non-expert favorability without expert quality improvement)

Limitations:

The primary finding is null by conventional standards (p=0.156, CI crossing zero)

Expert rater agreement is essentially absent (ICC = -0.15)

Sample size is insufficient for any of the stated analyses

No biological ground truth or correctness assessment

Single task/domain limits generalizability

Cannot separate skill package effects from platform-specific effects

The paper reads partly as a product evaluation for OpenClaw/AIPOCH rather than a generalizable scientific contribution

The "skill package" is vaguely described—it's unclear which components are procedural guidance vs. executable code

Conflict of Interest Concern: All authors are affiliated with AIPOCH PTE. LTD., which develops the OpenClaw platform and skill package being evaluated. While the results are not strongly favorable, this creates a perception issue.

Overall Assessment

This paper addresses a relevant question but provides insufficient evidence to answer it. The study is essentially a pilot with acknowledged measurement failure (negative ICC) that produces a non-significant result. While the authors' transparency is laudable, the paper's scientific contribution is primarily to demonstrate that this type of evaluation is difficult and needs better methodology—a finding that, while useful, is not particularly surprising. The paper would benefit from being positioned as a methods/protocol paper rather than a results paper, given the preliminary nature of the findings.

Rating:3/ 10

Significance 3.5Rigor 3Novelty 3.5Clarity 6.5

Generated Jun 11, 2026

Comparison History (17)

Lostvs. The Impossibility of Eliciting Latent Knowledge

Paper 1 addresses a fundamental theoretical problem in AI alignment—the impossibility of reliably eliciting honest beliefs from advanced AI systems—proving a formal impossibility theorem with broad implications for AI safety research. This has deep relevance as AI capabilities advance. Paper 2 is an exploratory empirical study with non-significant results, limited sample size, poor inter-rater reliability, and narrow application scope (skill-augmented LLMs for a specific biomedical task). Paper 1's theoretical contributions are more novel, rigorous, and broadly impactful across AI safety, alignment, and machine learning fields.

claude-opus-4-6·Jun 11, 2026

Lostvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 2 presents a concrete, methodologically successful solution to a critical problem in autonomous driving (safety-critical scenario mining), demonstrating strong empirical results in a recognized competition. In contrast, Paper 1, while targeting the important domain of medical research, reports inconclusive, exploratory findings with limited statistical significance and poor expert agreement, reducing its immediate scientific and practical impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Paper 1 addresses a critical bottleneck in generative AI deployment (security vs. latency/cost trade-offs) with a highly practical, quantified solution achieving 50ms latency on CPUs. In contrast, Paper 2 is an exploratory pilot study with a small sample size (21 outputs) and statistically insignificant results due to high expert-rating noise. Paper 1 offers immediate utility and clear baseline metrics for LLM guardrails, granting it higher potential impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Paper 1 introduces a novel, reusable resource (LungKG) and achieves state-of-the-art results in a crucial clinical task (EMR-based diagnostic reasoning). In contrast, Paper 2 is a small exploratory study with statistically insignificant results and low inter-rater reliability, serving primarily as a pilot. Paper 1's concrete methodological advancements and clear empirical success give it a significantly higher potential for real-world application and broad scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

Paper 1 demonstrates higher scientific impact potential for several reasons: (1) It addresses a well-defined, practical problem (pre-mediation in negotiations) with a novel, structured LLM pipeline approach and validates it against human mediators in controlled experiments with statistically meaningful results. (2) It has clear real-world applications in dispute resolution, scaling access to mediation. (3) Its methodology is rigorous with two human-subject experiments showing concrete improvements. Paper 2, while addressing an important domain, is explicitly exploratory with non-significant results, limited expert agreement (negative ICC), and the authors themselves caution against confirmatory interpretation, substantially limiting its immediate impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Search Discipline for Long-Horizon Research Agents

Paper 2 identifies a fundamental and broadly applicable failure mode—aggregate metric inversions—affecting all autonomous research agents across domains, and proposes a principled external audit protocol. This addresses a critical infrastructure-level problem for the growing field of AI-driven scientific discovery. Paper 1, while addressing an important question about skill-augmented AI agents, reports inconclusive exploratory results with limited statistical power and restricted scope (single biomedical task), primarily motivating future work rather than establishing actionable findings.

claude-opus-4-6·Jun 11, 2026

Wonvs. Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

Paper 2 addresses the more novel and impactful question of whether skill-augmented AI agents can improve biomedical research analysis quality, a timely topic with broad implications for AI-assisted scientific discovery. Despite its exploratory nature and non-significant results, it introduces a rigorous evaluation framework (multi-model, blinded expert review, bootstrap CIs) for a cutting-edge paradigm (autonomous AI agents with tool access). Paper 1, while competent, applies relatively standard techniques (curriculum learning, multi-model selection) to medical QA with incremental improvements, representing a more narrow and incremental contribution.

claude-opus-4-6·Jun 11, 2026

Lostvs. IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

Paper 2 demonstrates higher potential scientific impact due to its conclusive results and methodological innovation. While Paper 1 addresses an important medical AI application, its findings are exploratory and statistically non-significant due to high expert-rating noise. In contrast, Paper 2 introduces a novel dialogue policy optimization framework (IntElicit) with a decomposed process reward mechanism that successfully prevents reward hacking in educational AI. Supported by a human study showing clear improvements over expert baselines, Paper 2 offers immediate, broadly applicable contributions to AI-mediated learning, human-computer interaction, and cognitive assessment.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

Paper 1 has higher potential scientific impact due to its focus on biomedical research quality and evaluation of AI agents in a clinically relevant NSCLC biomarker analysis task, with a multi-model, blinded expert/non-expert human evaluation and explicit treatment of uncertainty and inter-rater reliability. Even if exploratory and underpowered, it targets high-stakes real-world application and aligns with timely needs for trustworthy AI in medicine. Paper 2 is a useful engineering contribution, but its evaluation is limited (self-study, small scale) and its impact is likely narrower and more incremental.

gpt-5.2·Jun 11, 2026

Wonvs. Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Paper 1 has higher potential scientific impact due to its timeliness and cross-field relevance: evaluating skill-augmented autonomous agents for biomedical research is broadly applicable to scientific discovery workflows and impacts reliability/safety of AI-assisted science. Its focus on medical research analysis and human/expert evaluation targets a high-stakes domain with real-world translational implications. While methodological rigor is limited (small sample, low inter-rater reliability, non-significant effects), the work is more novel than applying standard LoRA+NEFTune finetuning for domain NER (Paper 2), which is incremental and narrower in scope despite clearer quantitative gains.

gpt-5.2·Jun 11, 2026

#3276of 3489·Artificial Intelligence

#3276 of 3489 · Artificial Intelligence

Tournament Score

1231±50

10501800

18%

Win Rate

Wins

Losses

Matches

Rating

3/ 10

Significance3.5

Rigor3

Novelty3.5

Clarity6.5