CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park, A. Ali Heydari, Ahmed A. Metwally, Hong Yu, Xin Liu

Apr 16, 2026

arXiv:2604.14615v1 PDF

cs.AI(primary)

#172of 2292·Artificial Intelligence

#172 of 2292 · Artificial Intelligence

Tournament Score

1525±38

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6

Rigor7

Novelty6.5

Clarity5.5

Tournament Score

1525±38

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated ΔR^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CoDaS — AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

1. Core Contribution

CoDaS is a multi-agent LLM-based system that structures digital biomarker discovery from wearable sensor data as an iterative six-phase pipeline: data profiling, hypothesis generation, parallel statistical/ML exploration, adversarial validation, mechanistic reasoning, and report synthesis. The key architectural innovation is the separation of concerns across specialized agents (Scout, Critic, Defender, Mechanism, Novelty agents) operating on shared state, combined with a hybrid deterministic-generative approach where statistical computations are executed deterministically while LLMs handle interpretation and hypothesis generation.

The system was evaluated across three cohorts (N=9,279 total) spanning mental health (depression via PHQ-8/PHQ-4) and metabolic disease (insulin resistance via HOMA-IR), producing 41 mental health and 25 metabolic biomarker candidates. The system also includes a structured 11-test validation battery and extensive leakage prevention mechanisms.

2. Methodological Rigor

Strengths in validation design: The 11-check validation battery (replication, permutation testing, bootstrap stability, leave-one-out influence, subgroup consistency, method triangulation, construct validity gates, causal robustness, construct independence, CI consistency, and discriminative power) is thoughtfully designed and represents a serious attempt to prevent the spurious discovery problem that plagues automated feature mining. The construct independence gate correctly rejected TG/HDL ratio as near-tautological — a good positive control.

Leakage prevention: The separation of label data from LLM agents, the |ρ|>0.85 construct overlap threshold, and participant-level cross-validation splits are appropriate safeguards.

Significant caveats the authors acknowledge: The entire study is exploratory and non-preregistered. No external validation was performed. The GLOBEM endpoint was data-driven (selected by the system itself). The effect sizes are modest (ρ=0.126–0.252 for depression; ΔR²=0.021–0.040 for predictive models). The GLOBEM CV AUC of 0.535 is essentially chance-level, which the authors acknowledge but somewhat bury among other results. The 11 validation checks are not independent, share underlying data, and were not preregistered — they are pipeline design choices, not a prespecified analysis plan.

Concerns: The claim that CoDaS "discovered" biomarkers is somewhat overstated given that the top findings (sleep variability–depression, resting heart rate–metabolic risk) are well-established in the literature. The autonomously constructed composite features (e.g., steps/resting HR) are interesting but straightforward ratios that domain experts would likely construct. The benchmark evaluations, while impressive, compare a full multi-agent system against individual models — an inherently asymmetric comparison the authors acknowledge.

3. Potential Impact

Practical value: The framework demonstrates that multi-agent LLM systems can automate significant portions of the hypothesis-generation-to-report pipeline for wearable biomarker research. The human expert evaluation showing 57% effort preservation and estimated 37 person-days of equivalent manual work suggests genuine practical utility as a research acceleration tool.

Translational limitations: The modest effect sizes (ΔR²=0.021 for insulin resistance from wearables) and lack of external validation substantially limit near-term clinical translatability. The authors correctly position CoDaS as a "hypothesis generation and prioritization platform" rather than a diagnostic system, but this framing somewhat reduces the immediate clinical impact.

Broader influence: The architecture — particularly the adversarial critic-defender debate, the Fact Sheet for preventing numerical hallucination, and the quality gates for output suppression — offers reusable design patterns for AI-assisted scientific discovery systems in other domains. The held-out validation experiment (Appendix C) demonstrating that the confounder and subgroup checks improve out-of-sample robustness is a valuable methodological contribution.

4. Timeliness & Relevance

The paper addresses a genuine bottleneck: the gap between massive wearable data generation and clinically actionable biomarker discovery. It sits at the intersection of two hot areas — LLM-based scientific agents and digital health — and arrives as consumer wearable adoption accelerates. The comparison with AI co-scientist, AlphaEvolve, and Biomni positions it well within the current landscape.

5. Strengths & Limitations

Key strengths:

Comprehensive validation battery with well-designed leakage prevention

Thoughtful human evaluation (n=15, blinded, 7-axis assessment with inter-rater reliability analysis)

Honest reporting of modest effect sizes and near-chance GLOBEM performance

Strong benchmark performance across six diverse evaluations

Extensive limitations section that acknowledges most weaknesses

Notable limitations:

No external validation — the cross-cohort "construct convergence" between DWB and GLOBEM uses different instruments and populations, making it suggestive at best

The "novel" discoveries are largely confirmatory of known associations or straightforward composite features

The GLOBEM cohort contributes little beyond demonstrating the pipeline doesn't fabricate signal from noise

The user study compares against baselines (generic DS agents, Biomni) that are not designed for this specific task, making the comparison somewhat unfair

Fixed agent topology limits extensibility claims

The paper is extremely long (~42 pages) with substantial redundancy, which paradoxically makes it harder to assess the core contributions

Reproducibility concern: The system relies on proprietary models (Gemini-3.1 Pro, Gemini-3 Flash) and proprietary datasets (DWB, WEAR-ME), limiting independent reproduction.

Overall Assessment

CoDaS represents solid engineering work that demonstrates the feasibility of multi-agent LLM systems for structured biomarker hypothesis generation from wearable data. The validation framework and safety mechanisms are well-designed. However, the scientific novelty of the discovered biomarkers is limited (mostly confirming known associations), effect sizes are modest, and no external validation is provided. The primary contribution is the system architecture and methodology rather than the biomarker findings themselves. The paper would benefit from tighter focus and clearer separation of genuine discovery from confirmatory analysis.

Rating:6.5/ 10

Significance 6Rigor 7Novelty 6.5Clarity 5.5

Generated May 5, 2026

Comparison History (30)

vs. Neural Decision-Propagation for Answer Set Programming

gpt-5.25/5/2026

Paper 1 has higher likely scientific impact due to strong timeliness and real-world applicability in digital health, using large-scale wearable cohorts and yielding numerous candidate biomarkers with explicit validation (replication, stability, robustness). Its multi-agent, traceable workflow could generalize broadly across biomedical sensing and clinical research, affecting multiple fields (health AI, epidemiology, psychiatry, metabolism). Paper 2 is novel within neuro-symbolic ASP and may advance solver scalability, but its immediate impact is narrower to logic programming/benchmarks and hinges on adoption and demonstrated superiority over mature solvers in diverse settings.

vs. Neural Decision-Propagation for Answer Set Programming

claude-opus-4.65/5/2026

CoDaS addresses a high-impact application domain (digital health biomarker discovery from wearables) with broad interdisciplinary relevance spanning AI, clinical medicine, and public health. It demonstrates concrete results across large cohorts (9,279 participants) with validated biomarkers for depression and metabolic outcomes. The multi-agent LLM-based scientific discovery framework is timely and has clear real-world applications. Paper 1, while technically rigorous in advancing neuro-symbolic AI, targets a more niche community (ASP/logic programming) with less immediate broad impact and fewer downstream applications.

vs. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

gemini-35/5/2026

Paper 2 demonstrates higher potential scientific impact by directly addressing the critical bottleneck of scientific discovery in digital health. While Paper 1 offers strong foundational improvements for AI agent tool retrieval, Paper 2 applies multi-agent systems to massive real-world datasets (9,279 participants) to discover actionable clinical biomarkers for depression and metabolic disorders. Its interdisciplinary approach perfectly embodies 'AI for Science', offering a rigorously validated, traceable methodology that has immediate, tangible implications for healthcare, wearable technology, and computational biology.

vs. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

claude-opus-4.65/5/2026

CoDaS addresses a critical gap in digital health by automating biomarker discovery from wearable sensor data using a multi-agent AI system. It demonstrates real-world clinical relevance across 9,279 participants, recovers known clinical associations, and discovers novel biomarkers for mental health and metabolic outcomes. Its breadth of impact spans AI, clinical medicine, and digital health—fields with enormous societal importance. Paper 2, while technically sound in improving tool retrieval for agents, addresses a narrower software engineering problem with less direct real-world health/scientific impact.

vs. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

gemini-35/5/2026

Paper 1 presents a highly innovative application of AI multi-agent systems to actual scientific discovery in medicine, demonstrating tangible results (new digital biomarkers) on large patient cohorts. This 'AI for Science' approach bridges computer science and healthcare, promising profound real-world clinical impact and cross-disciplinary citations. While Paper 2 offers a valuable benchmark for AI development, benchmarks in rapidly evolving fields like LLM agents often have a shorter lifespan of relevance compared to novel AI-driven scientific discovery paradigms.

vs. LLM-Guided Strategy Synthesis for Scalable Equality Saturation

claude-opus-4.65/5/2026

CoDaS addresses a broadly impactful problem—automated biomarker discovery from wearable health data—combining AI agents with clinical validation across large cohorts. Its interdisciplinary reach spans digital health, clinical medicine, and AI, with direct real-world healthcare applications. While EggMind is technically strong and novel in compiler optimization, its impact is narrower, primarily benefiting the programming languages and compiler community. CoDaS's timeliness (intersection of LLMs, wearables, and precision medicine) and potential to accelerate clinical research give it higher estimated scientific impact across a broader audience.

vs. Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact because it targets a high-value, timely real-world problem (digital biomarker discovery from wearables) and demonstrates results across multiple sizable cohorts with replication-style validation and clinically interpretable findings. Its applications span digital health, psychiatry, metabolism, and clinical decision support, increasing breadth and translational relevance. Paper 1 is novel infrastructure for multi-agent LLM scaling, but the reported gains (speedups/scheduling metrics) appear more incremental and primarily impactful within LLM systems research rather than across domains.

vs. Fun-TSG: A Function-Driven Multivariate Time Series Generator with Variable-Level Anomaly Labeling

gpt-5.25/5/2026

Paper 1 likely has higher impact due to its direct clinical relevance and demonstration on large real-world wearable cohorts with replicated findings and measurable predictive gains. Its multi-agent, traceable biomarker discovery workflow is timely for digital health and could influence both methodology (human-in-the-loop AI discovery) and applied outcomes (candidate biomarkers for mental health/metabolic risk). Paper 2 is useful infrastructure for benchmarking anomaly detection, but generators often see narrower adoption unless they become standard benchmarks; its primary impact is methodological within ML evaluation rather than immediate translational value.

vs. Poly-EPO: Training Exploratory Reasoning Models

gpt-5.25/5/2026

Paper 2 has higher likely scientific impact due to broader cross-domain applicability: a general post-training framework (set RL) and Poly-EPO can influence many areas of LM training, reasoning, and test-time compute scaling. Its methodological contribution (adapting RL advantage for set objectives) is more foundational and reusable than a domain-specific multi-agent pipeline. Paper 1 is timely and impactful for digital health, but its advances are more applied and likely constrained by dataset/clinical validation requirements, limiting breadth and near-term generalization beyond wearables.

vs. RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

gpt-5.25/5/2026

Paper 2 is likely higher impact due to strong novelty (interactive, user-driven unlearning at inference time; training-free single-sample STAMP with closed-form updates), high timeliness (privacy, safety, regulation for LLMs), and broad applicability across ML/security/HCI/policy. It proposes a general framework (watchdog/surgeon/patient) and reports strong quantitative results against multiple baselines, suggesting methodological rigor and portability (on-device efficiency). Paper 1 is valuable for digital health biomarker workflows, but its impact is more domain-specific and shows modest predictive gains, with more limited cross-field reach.

vs. RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

gpt-5.25/5/2026

Paper 2 has higher likely impact due to strong novelty (user-driven, inference-time unlearning) and broad, timely relevance to safety, privacy, and regulation of widely deployed LLMs. The STAMP closed-form, training-free update with a low-rank acceleration is methodologically distinctive and practical (on-device), enabling real-world adoption across many applications and model providers. Its scope generalizes beyond one domain and could influence model editing, security, and compliance research. Paper 1 is valuable for digital health, but its impact is narrower, improvements are modest, and results are more application-specific.

vs. AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

claude-opus-4.65/5/2026

CoDaS addresses the high-impact problem of biomarker discovery from wearable sensor data, validated across 9,279 participants in three cohorts with clinically meaningful findings (circadian instability in depression, metabolic biomarkers). It has broader real-world applications in digital health, a larger potential user base (clinicians, epidemiologists, wearable companies), and demonstrates novel multi-agent AI methodology for scientific discovery. AblateCell, while technically solid for ML reproducibility/ablation in virtual cell repositories, addresses a narrower software engineering problem with a smaller target audience and less direct clinical or scientific translation.

vs. A Systematic Approach for Large Language Models Debugging

gemini-35/5/2026

Paper 1 presents a novel, concrete application of AI agents to a high-impact domain (digital health) and backs its claims with rigorous empirical validation across multiple large-scale datasets, discovering specific clinically relevant biomarkers. In contrast, Paper 2 proposes a conceptual framework for LLM debugging without providing concrete empirical results or metrics in the abstract, making its immediate scientific impact less certain.

vs. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

gemini-35/5/2026

Paper 2 demonstrates a profound interdisciplinary impact by applying multi-agent AI to accelerate tangible scientific discovery in healthcare. While Paper 1 offers valuable efficiency improvements for LLMs, Paper 2's ability to discover and validate clinically relevant biomarkers from large-scale wearable data represents a paradigm shift in how digital health research is conducted, offering significant real-world applications and immediate societal benefits.

vs. Zero-Shot Signal Temporal Logic Planning with Disjunctive Branch Selection in Dynamic Semantic Maps

claude-opus-4.65/5/2026

CoDaS addresses a high-impact problem at the intersection of AI, digital health, and clinical biomarker discovery. It demonstrates practical utility across 9,279 participants, validates findings across independent cohorts, and recovers known clinical associations while discovering new ones. Its multi-agent framework for systematic biomarker discovery from wearable data has broad applicability in precision medicine and digital health. Paper 1, while technically sound, addresses a narrower robotics/control problem with more incremental contributions (combining transformers with STL planning). Paper 2's real-world clinical relevance and scalability give it greater cross-disciplinary impact potential.

vs. Step-level Optimization for Efficient Computer-use Agents

gemini-35/5/2026

Paper 1 applies an innovative multi-agent AI system directly to a critical scientific domain—biomarker discovery for digital health. By demonstrating concrete clinical findings and advancing 'AI for Science', its cross-disciplinary impact on medicine, bioinformatics, and health sciences gives it broader scientific significance compared to Paper 2, which primarily addresses computational efficiency and engineering optimizations for software automation agents.

vs. Contextual Agentic Memory is a Memo, Not True Memory

gemini-35/5/2026

Paper 1 challenges a fundamental architectural paradigm in current AI systems (treating retrieval as memory) and proposes a neuroscience-inspired theoretical shift. Such foundational critiques often drive the next generation of architectures, leading to broader and deeper cross-disciplinary impact than application-specific tools, despite Paper 2's strong empirical results and immediate clinical utility.

vs. Sheaf-Theoretic Planning: A Categorical Foundation for Resilient Multi-Agent Autonomous Systems

gemini-35/5/2026

Paper 1 demonstrates higher potential scientific impact due to its strong empirical validation and immediate real-world applicability in digital health. It leverages a large dataset (9,279 participants) across multiple cohorts to discover clinically actionable biomarkers, showing measurable improvements in predicting depression and insulin resistance. In contrast, Paper 2 proposes a highly theoretical mathematical framework for multi-agent systems without presenting empirical validation in the abstract. Paper 1's combination of cutting-edge AI agents with rigorous statistical validation in a high-impact domain (healthcare) makes its contributions more timely, actionable, and broadly influential.

vs. Agentic Frameworks for Reasoning Tasks: An Empirical Study

claude-opus-4.65/5/2026

CoDaS presents a novel multi-agent system applied to a high-impact domain (digital health biomarker discovery), demonstrating real-world clinical utility across 9,279 participants with validated biomarkers for mental health and metabolic outcomes. It bridges AI and healthcare with concrete, reproducible findings. Paper 1, while valuable as a benchmarking study of 22 agentic frameworks, is primarily a comparative evaluation without introducing new methods. Its findings—that orchestration quality matters and math reasoning drops—are useful but incremental. CoDaS has broader cross-disciplinary impact spanning AI, wearable computing, and clinical medicine.

vs. Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework

gemini-35/5/2026

Paper 2 presents a highly applied, interdisciplinary approach with direct real-world implications for healthcare and precision medicine. Its use of multi-agent AI for automated biomarker discovery aligns with cutting-edge trends in AI for science. The extensive empirical validation across large clinical cohorts (9,279 participants) demonstrating tangible predictive improvements offers broader and more immediate societal and scientific impact compared to the theoretical, benchmark-driven methodological contributions of Paper 1.