Verifiable Benchmarking of Long-Horizon Spatial Biology

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

May 27, 2026

arXiv:2605.28065v1 PDF

cs.AI(primary)

#927of 2682·Artificial Intelligence

#927 of 2682 · Artificial Intelligence

Tournament Score

1442±43

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6.5

Novelty6.5

Clarity7

Tournament Score

1442±43

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SpatialBench-Long

1. Core Contribution

SpatialBench-Long introduces a benchmark for evaluating whether AI agents can perform end-to-end scientific reasoning over spatial biology data—moving from raw measurements to biological conclusions without prescribed intermediate steps. The benchmark comprises 24 evaluations across four study systems (PDAC, GBM organoids, lineage-traced lung adenocarcinoma, mouse optic nerve aging) spanning nine data modalities (CosMx, Visium, Xenium, MERFISH, scRNA-seq, Slide-seq, Slide-tags, histology, lineage recording). The key differentiator from prior benchmarks is that agents must recover specific scientific claims from raw data and calibrated experimental context, rather than executing predefined workflows or answering knowledge-recall questions.

The benchmark introduces a dual evaluation approach: deterministic pass/fail grading on structured final answers (the official score) paired with rubric-based trajectory diagnostics that capture partial progress through analysis chokepoints. This addresses the sparse reward problem inherent in long-horizon evaluation while maintaining verifiability.

2. Methodological Rigor

Strengths in design methodology:

The claim hardening process (independent reproduction, randomized expert review, trajectory inspection) addresses the serious problem that published biological claims don't always reproduce. Many candidate claims were reportedly excluded for failing to reproduce—this is intellectually honest and strengthens the benchmark's reliability.

Deterministic grading over controlled vocabularies avoids the pitfalls of LLM-as-judge for the primary metric.

Anonymization of datasets and study context reduces memorization shortcuts.

The paper reports Wilson confidence intervals and acknowledges evaluation-level clustering issues.

Concerns:

With only 24 evaluations and 3 replicates each (72 runs per model-harness pair), statistical power is limited. The top three systems tie at 8/72 (11.1%), with Wilson 95% CIs spanning 5.74–20.42%—these intervals overlap with nearly all other systems. The paper acknowledges this but the practical discriminative power of the benchmark at current scale is limited.

The rubric-endpoint correlation (Pearson r=0.24, AUC=0.79) is modest, and the paper's own analysis shows rubric scores are confounded by model family/style. The careful framing of rubrics as "diagnostics not scores" is appropriate but raises the question of how useful the companion metric truly is.

The manual trajectory review covers only 75 trajectories out of 1,080—a small sample for drawing behavioral conclusions.

Cost and turn metadata are missing for non-Pi harnesses (OpenAI Codex, Claude Code), limiting cross-harness comparisons.

3. Potential Impact

For AI benchmarking: SpatialBench-Long fills a genuine gap between knowledge-recall biology benchmarks (LAB-Bench) and procedural coding benchmarks (BixBench, scBench). The "long-horizon" framing—where agents must chain many analysis decisions without intermediate supervision—is increasingly important as agents move toward autonomous research. The benchmark design philosophy (claim hardening, verifiable grading + diagnostic rubrics) could serve as a template for other domains.

For spatial biology: The benchmark implicitly defines what "competence" looks like for AI-assisted spatial analysis. The identified failure modes (wrong grouping variables, inappropriate spatial methods, prior/vocabulary shortcuts, missed metadata) provide actionable feedback for agent developers working on biological applications.

For agent development: The finding that frontier models achieve only ~11% pass rates, with failures distributed across many different chokepoints rather than concentrated at one stage, suggests that current agents lack the integrated procedural competence needed for real scientific analysis. This motivates investment in assay-specific training and multi-step reasoning capabilities.

4. Timeliness & Relevance

The benchmark arrives at a moment when AI coding agents are being actively marketed for scientific analysis, and the spatial biology field is generating increasingly complex multi-modal datasets. The gap between agent capabilities and the requirements of real spatial analysis workflows is poorly characterized by existing benchmarks. SpatialBench-Long provides a concrete, if early-stage, measurement of this gap.

The paper also arrives amid intense competition in benchmark creation for biology agents (BixBench, BioMysteryBench, GeneBench, BiomniBench), positioning itself as uniquely focused on spatial modalities and long-horizon claim recovery rather than well-scoped computational tasks.

5. Strengths & Limitations

Key Strengths:

*Ecological validity:* Tasks approximate what a real scientist does—choosing analyses, integrating modalities, and reaching conclusions—rather than testing isolated steps.

*Intellectual honesty about ground truth:* Acknowledging that published claims can fail to reproduce and building verification into the construction process is rare and valuable.

*Comprehensive model coverage:* Testing 15 model-harness pairs across five model families provides broad coverage of the current frontier.

*Failure mode taxonomy:* The behavioral analysis (Figure 8) provides genuinely useful characterization of how agents fail at biological reasoning.

Notable Limitations:

*Scale:* 24 evaluations is small. Many evaluations have zero passes across all models, making it hard to distinguish task impossibility from agent limitations.

*Benchmark diversity:* Four study systems may not generalize well. All are cancer-focused except optic nerve, and the PDAC system contributes 11/24 evaluations—nearly half.

*Reproducibility concerns:* Only a "subset" of evaluations and trajectories is released publicly. Full reproducibility requires access to all 24 evaluations, raw data, graders, and rubrics.

*Limited statistical discrimination:* Most model-harness pairs are not statistically distinguishable from each other given overlapping confidence intervals.

*Harness confounding:* Differences between Pi, OpenAI Codex, and Claude Code harnesses are not isolated from model differences, making attribution difficult.

*The paper comes from an industry lab (LatchBio) whose platform is used as the harness (Pi)*, creating potential conflicts of interest in benchmark construction and harness selection.

Summary

SpatialBench-Long makes a meaningful contribution by defining and operationalizing the challenge of long-horizon spatial biology reasoning for AI agents. Its construction methodology—particularly claim hardening and the principled separation of verifiable grading from diagnostic rubrics—represents thoughtful benchmark design. However, the small evaluation count, limited statistical discrimination between models, partial data release, and potential platform conflicts temper the immediate impact. The benchmark's greatest value may be as a conceptual template and as early evidence that frontier agents remain far from reliable end-to-end biological reasoning.

Rating:6.2/ 10

Significance 6.5Rigor 6.5Novelty 6.5Clarity 7

Generated May 28, 2026

Comparison History (19)

vs. A Policy-Driven Runtime Layer for Agentic LLM Serving

gpt-5.25/28/2026

Paper 2 likely has higher impact due to broad applicability and immediate real-world relevance: it proposes an architectural runtime layer that can generalize across many agentic LLM systems and serving engines, enabling multiple cross-cutting policies (caching, batching, safety, fairness). It includes concrete primitives, maps several policies, and demonstrates measurable performance/cost gains on real workloads, suggesting strong translational potential. Paper 1 is novel and valuable for evaluating scientific reasoning in spatial biology, but its impact is narrower (benchmarking for a specific domain) and depends on downstream adoption by toolmakers and biologists.

vs. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

claude-opus-4.65/28/2026

SpatialBench-Long addresses a critical gap in evaluating AI agents for end-to-end scientific reasoning over complex spatial biology data, spanning multiple modalities and biological systems. It introduces a rigorous, deterministic evaluation framework for a rapidly growing field (AI for science), with broad implications for how AI agents are assessed in biological discovery. Paper 2, while technically interesting in GPU kernel optimization, addresses a narrower domain with more incremental contributions. SpatialBench-Long's cross-disciplinary impact (AI + spatial biology) and timeliness in the AI-for-science movement give it higher potential impact.

vs. Do Clinical Models Change Treatment Decisions?

gemini-3.15/28/2026

While both propose valuable benchmarks, Paper 1 addresses a critical safety and efficacy gap in clinical AI—ensuring models adapt treatment decisions to shifting patient contexts rather than just reciting medical knowledge. This has immediate, high-stakes implications for real-world healthcare deployment, patient safety, and medical model evaluation, giving it a broader and more critical societal impact than the specialized spatial biology focus of Paper 2.

vs. Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

gpt-5.25/28/2026

Paper 2 likely has higher impact: it introduces a rigorously validated, multi-technology benchmark for end-to-end scientific reasoning over complex spatial biology data, addressing an urgent evaluation gap for real-world scientific agents. Its deterministic grading, hardened claims, and broad modality coverage increase methodological rigor and reproducibility, and the benchmark can catalyze progress across ML, computational biology, and bioinformatics. Paper 1 is a solid algorithmic improvement for self-training LLMs, but confidence-weighted updates are a narrower methodological contribution with more incremental novelty and less domain-shaping infrastructure effect.

vs. The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

claude-opus-4.65/28/2026

SpatialBench-Long introduces a novel benchmark for evaluating AI agents on end-to-end scientific reasoning over complex spatial biology data, addressing a clear gap in existing benchmarks. It spans multiple real biological systems, data modalities, and state-of-the-art models, with rigorous evaluation methodology. This has broader impact across AI and biology communities, enabling systematic progress measurement for scientific AI agents. Paper 1, while offering useful insights into reasoning trace dynamics and backtracking patterns, addresses a narrower problem (diagnosing overthinking in LLM reasoning) with more incremental practical implications.

vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

claude-opus-4.65/28/2026

Paper 2 (SpatialBench-Long) has higher potential impact because it addresses a critical gap at the intersection of AI and spatial biology—testing whether agents can perform end-to-end scientific reasoning over complex real-world experimental data. It spans multiple cutting-edge spatial transcriptomics technologies and disease systems, making it broadly relevant to both AI and biomedical communities. Its deterministic grading and rigorous claim validation methodology set a high standard. Paper 1, while valuable for understanding LLM skill formation, addresses a more incremental question within the AI agent community and finds largely negative results about current capabilities without clear pathways forward.

vs. Laguna M.1/XS.2 Technical Report

claude-opus-4.65/28/2026

SpatialBench-Long addresses a significant gap in AI evaluation for scientific reasoning over complex spatial biology data, introducing a novel benchmark methodology with deterministic grading and multi-step verification. It opens a new evaluation paradigm at the intersection of AI agents and spatial biology—two rapidly growing fields. Paper 1, while technically solid, is primarily a model release/technical report for coding-focused MoE models in an already crowded space of foundation model releases, offering incremental rather than conceptually novel contributions.

vs. A Sober Look at Agentic Misalignment in Automated Workflows

gemini-3.15/28/2026

Paper 1 addresses a fundamental challenge in AI safety and multi-agent systems (agentic misalignment), offering a novel theoretical framework and alignment paradigm (AEA). Its generalizable methodology has a broader potential impact across numerous fields utilizing automated AI workflows. While Paper 2 provides a valuable, rigorous benchmark for a critical scientific domain (spatial biology), its impact is inherently more domain-specific compared to the foundational AI advancements proposed in Paper 1.

vs. PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

gemini-3.15/28/2026

Paper 1 presents a novel benchmark for long-horizon scientific reasoning in spatial biology, directly advancing the application of AI in scientific discovery (AI4Science). By bridging complex multi-omics data with AI agents, it has profound potential to accelerate real-world biological research. While Paper 2 offers valuable efficiency improvements for multimodal web agents, Paper 1's direct contribution to enabling verifiable, automated scientific conclusions addresses a more transformative and cross-disciplinary scientific challenge.

vs. JobBench: Aligning Agent Work With Human Will

claude-opus-4.65/28/2026

JobBench has broader impact across fields: it covers 35 occupations with 130 tasks, evaluates 36 models, and introduces a paradigm shift from replacement to enhancement framing in AI labor economics. Its methodological contribution (fact-anchored rubrics, heterogeneous workspaces) is generalizable. SpatialBench-Long is rigorous and valuable but narrower in scope (24 evaluations in spatial biology). JobBench's reframing of how we evaluate occupational AI agents has potential to influence AI policy, workforce development, and benchmark design across many domains.

vs. Constrained Auto-Bidding via Generative Response Modeling

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact: it introduces a verifiable, end-to-end benchmark for long-horizon reasoning over diverse spatial biology modalities and disease/experimental systems, addressing a timely bottleneck in AI-for-science evaluation and reproducibility. Its applications span multiple biomedical domains and could influence model development, scientific agent tooling, and standards for claim-level validation. Paper 2 is methodologically rigorous with theory and practical gains for ad auto-bidding, but its impact is narrower (primarily online advertising) and less cross-disciplinary.

vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

gemini-3.15/28/2026

Paper 1 addresses a critical frontier in AI: autonomous scientific discovery and long-horizon reasoning. By providing a rigorous, multimodal benchmark for spatial biology, it enables the evaluation of AI agents on complex, real-world scientific tasks rather than isolated steps. This has immense potential to accelerate biological research and drug discovery. While Paper 2 presents a solid methodological improvement for multimodal reasoning, Paper 1's focus on end-to-end scientific reasoning in a high-impact applied domain offers greater potential for transformative, real-world scientific advancements.

vs. Human-like in-group bias in instruction-tuned language model agents

claude-opus-4.65/28/2026

Paper 2 addresses a broadly impactful and timely problem — emergent discrimination in AI agent networks — with rigorous methodology (6 model families, 20 seeds, 500 turns, multiple conditions). Its finding that in-group bias is invisible to standard audits yet compounds into structural inequality has immediate implications for AI safety, policy, fairness, and multi-agent system design across many fields. Paper 1, while technically rigorous and valuable for spatial biology benchmarking, serves a narrower community and primarily reports low agent performance on a new benchmark rather than uncovering a fundamental behavioral phenomenon with broad societal relevance.

vs. SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

claude-opus-4.65/28/2026

SpatialBench-Long addresses a critical gap in AI agent evaluation for scientific reasoning over complex spatial biology data, introducing a rigorous benchmark with 24 evaluations across multiple modalities and biological systems. Its methodological rigor (deterministic grading, reproduction, independent review) and direct relevance to accelerating biological discovery give it high impact potential. SwarmHarness proposes an interesting decentralized compute protocol but is more incremental over existing distributed computing approaches, lacks empirical validation at scale, and its real-world adoption faces significant practical barriers. SpatialBench-Long is more timely given the explosion of AI agents in science.

vs. From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

gpt-5.25/28/2026

Paper 1 is likely higher impact due to stronger novelty and broader relevance: it introduces a verifiable, long-horizon benchmark requiring end-to-end scientific reasoning from complex multimodal spatial biology data, with deterministic grading and expert-hardened claims—useful for evaluating and driving general scientific-agent capabilities. Its applications span multiple high-value biomedical domains and measurement platforms, increasing cross-field impact and timeliness. Paper 2 addresses an important reproducibility/benchmarking problem in PHM, but its scope is narrower (industrial ML) and the framework/slot-binding approach is less broadly transformative than a major new benchmark for scientific reasoning.

vs. LACUNA: Safe Agents as Recursive Program Holes

gpt-5.25/28/2026

Paper 1 introduces a novel programming model (typed, runtime-integrated “program holes” for agent actions) that directly addresses a core limitation in agent architectures—control-flow/runtime separation—while adding safety via compile-time type checks and transactional rejection/retry. This is broadly applicable across domains where LLM agents write/execute code, potentially influencing agent frameworks, programming languages, and safety tooling. Paper 2 is a rigorous and timely benchmark with clear value for spatial biology, but its impact is narrower (primarily evaluation within a specific scientific area) and less likely to reshape general agent design.

vs. From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

claude-opus-4.65/28/2026

Paper 2 addresses a more broadly relevant problem—evaluating LLM agents in financial markets—with a methodologically rigorous approach that combines data masking to prevent memorization leakage and Barra-style performance attribution to decompose returns. Its contributions (leakage control and alpha attribution) are generalizable beyond finance to any domain where LLM memorization confounds evaluation. Paper 1, while valuable for spatial biology, targets a narrower scientific community with a domain-specific benchmark. Paper 2's insights about memorization contamination and the distinction between genuine reasoning and data leakage have wider implications for the entire LLM evaluation ecosystem.

vs. C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

claude-opus-4.65/28/2026

SpatialBench-Long introduces a novel benchmark paradigm for evaluating AI agents on end-to-end scientific reasoning over complex spatial biology data, addressing a significant gap in how we assess AI capabilities for real scientific discovery. It spans multiple cutting-edge spatial transcriptomics technologies and disease systems, with rigorous verification. Its breadth of impact is larger—it defines evaluation standards for AI-driven science across biology, benchmarking, and agent development. While C-MIG offers solid methodological advances in clinical RAG-RL, it represents an incremental improvement in a well-explored space. SpatialBench-Long's infrastructure contribution has broader, longer-lasting impact potential.

vs. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

gemini-3.15/28/2026

Paper 1 addresses a major frontier in AI—autonomous scientific discovery and reasoning in a highly complex, real-world domain (spatial biology). Its end-to-end evaluation paradigm has massive potential to accelerate biomedical research. While Paper 2 offers valuable methodological insights into LLM reasoning on SAT problems, Paper 1's benchmark directly bridges AI and translational medicine, offering broader interdisciplinary impact and real-world applicability.