Verifiable Benchmarking of Long-Horizon Spatial Biology
Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman
Abstract
AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SpatialBench-Long
1. Core Contribution
SpatialBench-Long introduces a benchmark for evaluating whether AI agents can perform end-to-end scientific reasoning over spatial biology data—moving from raw measurements to biological conclusions without prescribed intermediate steps. The benchmark comprises 24 evaluations across four study systems (PDAC, GBM organoids, lineage-traced lung adenocarcinoma, mouse optic nerve aging) spanning nine data modalities (CosMx, Visium, Xenium, MERFISH, scRNA-seq, Slide-seq, Slide-tags, histology, lineage recording). The key differentiator from prior benchmarks is that agents must recover specific scientific claims from raw data and calibrated experimental context, rather than executing predefined workflows or answering knowledge-recall questions.
The benchmark introduces a dual evaluation approach: deterministic pass/fail grading on structured final answers (the official score) paired with rubric-based trajectory diagnostics that capture partial progress through analysis chokepoints. This addresses the sparse reward problem inherent in long-horizon evaluation while maintaining verifiability.
2. Methodological Rigor
Strengths in design methodology:
Concerns:
3. Potential Impact
For AI benchmarking: SpatialBench-Long fills a genuine gap between knowledge-recall biology benchmarks (LAB-Bench) and procedural coding benchmarks (BixBench, scBench). The "long-horizon" framing—where agents must chain many analysis decisions without intermediate supervision—is increasingly important as agents move toward autonomous research. The benchmark design philosophy (claim hardening, verifiable grading + diagnostic rubrics) could serve as a template for other domains.
For spatial biology: The benchmark implicitly defines what "competence" looks like for AI-assisted spatial analysis. The identified failure modes (wrong grouping variables, inappropriate spatial methods, prior/vocabulary shortcuts, missed metadata) provide actionable feedback for agent developers working on biological applications.
For agent development: The finding that frontier models achieve only ~11% pass rates, with failures distributed across many different chokepoints rather than concentrated at one stage, suggests that current agents lack the integrated procedural competence needed for real scientific analysis. This motivates investment in assay-specific training and multi-step reasoning capabilities.
4. Timeliness & Relevance
The benchmark arrives at a moment when AI coding agents are being actively marketed for scientific analysis, and the spatial biology field is generating increasingly complex multi-modal datasets. The gap between agent capabilities and the requirements of real spatial analysis workflows is poorly characterized by existing benchmarks. SpatialBench-Long provides a concrete, if early-stage, measurement of this gap.
The paper also arrives amid intense competition in benchmark creation for biology agents (BixBench, BioMysteryBench, GeneBench, BiomniBench), positioning itself as uniquely focused on spatial modalities and long-horizon claim recovery rather than well-scoped computational tasks.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Summary
SpatialBench-Long makes a meaningful contribution by defining and operationalizing the challenge of long-horizon spatial biology reasoning for AI agents. Its construction methodology—particularly claim hardening and the principled separation of verifiable grading from diagnostic rubrics—represents thoughtful benchmark design. However, the small evaluation count, limited statistical discrimination between models, partial data release, and potential platform conflicts temper the immediate impact. The benchmark's greatest value may be as a conceptual template and as early evidence that frontier agents remain far from reliable end-to-end biological reasoning.
Generated May 28, 2026
Comparison History (19)
Paper 2 likely has higher impact due to broad applicability and immediate real-world relevance: it proposes an architectural runtime layer that can generalize across many agentic LLM systems and serving engines, enabling multiple cross-cutting policies (caching, batching, safety, fairness). It includes concrete primitives, maps several policies, and demonstrates measurable performance/cost gains on real workloads, suggesting strong translational potential. Paper 1 is novel and valuable for evaluating scientific reasoning in spatial biology, but its impact is narrower (benchmarking for a specific domain) and depends on downstream adoption by toolmakers and biologists.
SpatialBench-Long addresses a critical gap in evaluating AI agents for end-to-end scientific reasoning over complex spatial biology data, spanning multiple modalities and biological systems. It introduces a rigorous, deterministic evaluation framework for a rapidly growing field (AI for science), with broad implications for how AI agents are assessed in biological discovery. Paper 2, while technically interesting in GPU kernel optimization, addresses a narrower domain with more incremental contributions. SpatialBench-Long's cross-disciplinary impact (AI + spatial biology) and timeliness in the AI-for-science movement give it higher potential impact.
While both propose valuable benchmarks, Paper 1 addresses a critical safety and efficacy gap in clinical AI—ensuring models adapt treatment decisions to shifting patient contexts rather than just reciting medical knowledge. This has immediate, high-stakes implications for real-world healthcare deployment, patient safety, and medical model evaluation, giving it a broader and more critical societal impact than the specialized spatial biology focus of Paper 2.
Paper 2 likely has higher impact: it introduces a rigorously validated, multi-technology benchmark for end-to-end scientific reasoning over complex spatial biology data, addressing an urgent evaluation gap for real-world scientific agents. Its deterministic grading, hardened claims, and broad modality coverage increase methodological rigor and reproducibility, and the benchmark can catalyze progress across ML, computational biology, and bioinformatics. Paper 1 is a solid algorithmic improvement for self-training LLMs, but confidence-weighted updates are a narrower methodological contribution with more incremental novelty and less domain-shaping infrastructure effect.
SpatialBench-Long introduces a novel benchmark for evaluating AI agents on end-to-end scientific reasoning over complex spatial biology data, addressing a clear gap in existing benchmarks. It spans multiple real biological systems, data modalities, and state-of-the-art models, with rigorous evaluation methodology. This has broader impact across AI and biology communities, enabling systematic progress measurement for scientific AI agents. Paper 1, while offering useful insights into reasoning trace dynamics and backtracking patterns, addresses a narrower problem (diagnosing overthinking in LLM reasoning) with more incremental practical implications.
Paper 2 (SpatialBench-Long) has higher potential impact because it addresses a critical gap at the intersection of AI and spatial biology—testing whether agents can perform end-to-end scientific reasoning over complex real-world experimental data. It spans multiple cutting-edge spatial transcriptomics technologies and disease systems, making it broadly relevant to both AI and biomedical communities. Its deterministic grading and rigorous claim validation methodology set a high standard. Paper 1, while valuable for understanding LLM skill formation, addresses a more incremental question within the AI agent community and finds largely negative results about current capabilities without clear pathways forward.
SpatialBench-Long addresses a significant gap in AI evaluation for scientific reasoning over complex spatial biology data, introducing a novel benchmark methodology with deterministic grading and multi-step verification. It opens a new evaluation paradigm at the intersection of AI agents and spatial biology—two rapidly growing fields. Paper 1, while technically solid, is primarily a model release/technical report for coding-focused MoE models in an already crowded space of foundation model releases, offering incremental rather than conceptually novel contributions.
Paper 1 addresses a fundamental challenge in AI safety and multi-agent systems (agentic misalignment), offering a novel theoretical framework and alignment paradigm (AEA). Its generalizable methodology has a broader potential impact across numerous fields utilizing automated AI workflows. While Paper 2 provides a valuable, rigorous benchmark for a critical scientific domain (spatial biology), its impact is inherently more domain-specific compared to the foundational AI advancements proposed in Paper 1.
Paper 1 presents a novel benchmark for long-horizon scientific reasoning in spatial biology, directly advancing the application of AI in scientific discovery (AI4Science). By bridging complex multi-omics data with AI agents, it has profound potential to accelerate real-world biological research. While Paper 2 offers valuable efficiency improvements for multimodal web agents, Paper 1's direct contribution to enabling verifiable, automated scientific conclusions addresses a more transformative and cross-disciplinary scientific challenge.
JobBench has broader impact across fields: it covers 35 occupations with 130 tasks, evaluates 36 models, and introduces a paradigm shift from replacement to enhancement framing in AI labor economics. Its methodological contribution (fact-anchored rubrics, heterogeneous workspaces) is generalizable. SpatialBench-Long is rigorous and valuable but narrower in scope (24 evaluations in spatial biology). JobBench's reframing of how we evaluate occupational AI agents has potential to influence AI policy, workforce development, and benchmark design across many domains.
Paper 1 likely has higher scientific impact: it introduces a verifiable, end-to-end benchmark for long-horizon reasoning over diverse spatial biology modalities and disease/experimental systems, addressing a timely bottleneck in AI-for-science evaluation and reproducibility. Its applications span multiple biomedical domains and could influence model development, scientific agent tooling, and standards for claim-level validation. Paper 2 is methodologically rigorous with theory and practical gains for ad auto-bidding, but its impact is narrower (primarily online advertising) and less cross-disciplinary.
Paper 1 addresses a critical frontier in AI: autonomous scientific discovery and long-horizon reasoning. By providing a rigorous, multimodal benchmark for spatial biology, it enables the evaluation of AI agents on complex, real-world scientific tasks rather than isolated steps. This has immense potential to accelerate biological research and drug discovery. While Paper 2 presents a solid methodological improvement for multimodal reasoning, Paper 1's focus on end-to-end scientific reasoning in a high-impact applied domain offers greater potential for transformative, real-world scientific advancements.
Paper 2 addresses a broadly impactful and timely problem — emergent discrimination in AI agent networks — with rigorous methodology (6 model families, 20 seeds, 500 turns, multiple conditions). Its finding that in-group bias is invisible to standard audits yet compounds into structural inequality has immediate implications for AI safety, policy, fairness, and multi-agent system design across many fields. Paper 1, while technically rigorous and valuable for spatial biology benchmarking, serves a narrower community and primarily reports low agent performance on a new benchmark rather than uncovering a fundamental behavioral phenomenon with broad societal relevance.
SpatialBench-Long addresses a critical gap in AI agent evaluation for scientific reasoning over complex spatial biology data, introducing a rigorous benchmark with 24 evaluations across multiple modalities and biological systems. Its methodological rigor (deterministic grading, reproduction, independent review) and direct relevance to accelerating biological discovery give it high impact potential. SwarmHarness proposes an interesting decentralized compute protocol but is more incremental over existing distributed computing approaches, lacks empirical validation at scale, and its real-world adoption faces significant practical barriers. SpatialBench-Long is more timely given the explosion of AI agents in science.
Paper 1 is likely higher impact due to stronger novelty and broader relevance: it introduces a verifiable, long-horizon benchmark requiring end-to-end scientific reasoning from complex multimodal spatial biology data, with deterministic grading and expert-hardened claims—useful for evaluating and driving general scientific-agent capabilities. Its applications span multiple high-value biomedical domains and measurement platforms, increasing cross-field impact and timeliness. Paper 2 addresses an important reproducibility/benchmarking problem in PHM, but its scope is narrower (industrial ML) and the framework/slot-binding approach is less broadly transformative than a major new benchmark for scientific reasoning.
Paper 1 introduces a novel programming model (typed, runtime-integrated “program holes” for agent actions) that directly addresses a core limitation in agent architectures—control-flow/runtime separation—while adding safety via compile-time type checks and transactional rejection/retry. This is broadly applicable across domains where LLM agents write/execute code, potentially influencing agent frameworks, programming languages, and safety tooling. Paper 2 is a rigorous and timely benchmark with clear value for spatial biology, but its impact is narrower (primarily evaluation within a specific scientific area) and less likely to reshape general agent design.
Paper 2 addresses a more broadly relevant problem—evaluating LLM agents in financial markets—with a methodologically rigorous approach that combines data masking to prevent memorization leakage and Barra-style performance attribution to decompose returns. Its contributions (leakage control and alpha attribution) are generalizable beyond finance to any domain where LLM memorization confounds evaluation. Paper 1, while valuable for spatial biology, targets a narrower scientific community with a domain-specific benchmark. Paper 2's insights about memorization contamination and the distinction between genuine reasoning and data leakage have wider implications for the entire LLM evaluation ecosystem.
SpatialBench-Long introduces a novel benchmark paradigm for evaluating AI agents on end-to-end scientific reasoning over complex spatial biology data, addressing a significant gap in how we assess AI capabilities for real scientific discovery. It spans multiple cutting-edge spatial transcriptomics technologies and disease systems, with rigorous verification. Its breadth of impact is larger—it defines evaluation standards for AI-driven science across biology, benchmarking, and agent development. While C-MIG offers solid methodological advances in clinical RAG-RL, it represents an incremental improvement in a well-explored space. SpatialBench-Long's infrastructure contribution has broader, longer-lasting impact potential.
Paper 1 addresses a major frontier in AI—autonomous scientific discovery and reasoning in a highly complex, real-world domain (spatial biology). Its end-to-end evaluation paradigm has massive potential to accelerate biomedical research. While Paper 2 offers valuable methodological insights into LLM reasoning on SAT problems, Paper 1's benchmark directly bridges AI and translational medicine, offering broader interdisciplinary impact and real-world applicability.