Evaluating the Utility of Personal Health Records in Personalized Health AI

Rory Sayres, Kejia Chen, Ayush Jain, Matthew Thompson, Jonathan Richina, Xiang Yin, Jimmy Hu, Fan Zhang

May 18, 2026

arXiv:2605.18937v1 PDF

cs.AI(primary)

#1561of 2292·Artificial Intelligence

#1561 of 2292 · Artificial Intelligence

Tournament Score

1365±41

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty6

Clarity7

Tournament Score

1365±41

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a practical and timely question: does providing Personal Health Records (PHRs) as context to LLMs improve the quality of health-related responses? The study pairs 2,257 user queries (from three distinct distributions—web search, chatbot templates, and patient calls) with 1,945 de-identified PHRs and evaluates Gemini 3.0 Flash responses under three conditions: no PHR, basic PHR (demographics, conditions, medications), and full PHR (complete clinical notes). The study makes two primary contributions: (1) empirical evidence that PHR context significantly improves helpfulness, actionability, and motivation of LLM responses, and (2) a novel 16-axis PHR-specific evaluation framework covering fidelity, robustness, utility, safety, and equity dimensions.

The core finding—that PHR context improves helpfulness from 0.13 to ~0.5 on clinician ratings ([-1,1] scale)—is intuitive but important to quantify rigorously. The identification of specific failure modes (temporal disorientation at 3-6% of responses, groundedness errors including confabulation, and social determinant blindspots) provides actionable insights for the field.

Methodological Rigor

Strengths: The experimental design is well-structured with three clearly defined context conditions enabling controlled comparisons. The use of three distinct query distributions provides ecological validity across different user interaction modalities. The paired design (same query-PHR pair across conditions) strengthens statistical power for within-pair comparisons. The PHR dataset is substantial (1,945 records, 77,644 encounters, 9 health systems) with detailed characterization.

Weaknesses: The clinician evaluation subset is notably small (n=65 in the abstract, n=95 in the methods—an inconsistency that itself raises concerns). This limits statistical power for detecting differences on safety and accuracy axes, where clinician ratings showed no significant effects despite autorater suggestions of improvements. The heavy reliance on autoraters (which are themselves LLM-based) for the primary dataset creates circularity concerns—using Gemini-family models to evaluate Gemini-generated responses. While clinician-autorater alignment is discussed, the agreement statistics are not formally presented.

The query-PHR pairing methodology introduces selection bias. For search queries, an LLM-based plausibility scorer selects pairs, meaning the evaluation is conditioned on the model's own judgment of relevance. The chatbot templates are synthetically generated, not real user queries. Only the patient calls represent authentic user questions paired with their actual records.

The population skews older, more female, more White, and more medically complex than the general population, which the authors acknowledge. The US-only PHR sourcing limits generalizability.

Potential Impact

The practical implications are significant given the rapid adoption of LLM-PHR integrations (OpenAI, Anthropic, Google all pursuing this). The PHR-specific evaluation framework (Table 5) is perhaps the most durable contribution—it provides a structured rubric that could become a standard for benchmarking PHR-aware health AI systems. The identification of temporal reasoning as the dominant failure mode (up to 5.74% of full PHR responses) gives model developers a concrete target for improvement.

The self-critique loop demonstration (Table 10), showing 98% reduction in temporal errors and 60% reduction in groundedness errors, suggests tractable mitigation strategies, though the authors correctly note this is a proof-of-concept rather than a deployable solution.

The work has relevance for health policy and patient engagement. If PHR-contextualized AI can genuinely improve health literacy and self-management, this could address documented barriers to PHR adoption (complexity, perceived low utility).

Timeliness & Relevance

This study is highly timely. Multiple commercial LLM providers launched PHR integration features in 2025-2026, yet rigorous evaluation of these capabilities was largely absent. The paper fills an important evidence gap at a moment when deployment is already occurring. The cited statistics—that 41% of LLM health users have uploaded personal medical information—underscore the urgency.

Strengths & Limitations

Key Strengths:

First systematic evaluation of PHR context utility for consumer health queries at meaningful scale

Well-designed three-condition experimental framework with within-pair comparisons

Novel, comprehensive PHR-specific evaluation rubric with 16 axes across 5 categories

Detailed qualitative analysis of failure modes with concrete examples

Demonstration of self-critique remediation potential

Rich supplementary materials with full prompts enabling reproducibility

Notable Limitations:

Single model evaluated (Gemini 3.0 Flash)—no comparative benchmarking against other LLMs

Small clinician evaluation set limits conclusions on safety/accuracy improvements

Autorater-primary evaluation with potential circularity

Synthetic query generation for two of three query types

No user-facing evaluation—all assessments are by clinicians or autoraters, not by patients who would actually receive these responses

The "Basic PHR" condition represents an idealized extraction, not what users would realistically provide in conversation

No assessment of downstream behavioral outcomes or potential for harm from over-reliance

Abstract-methods inconsistencies (2,257 vs 2,255 queries; n=95 vs n=65 for clinician evaluation)

Additional Observations:

The gap between autorater and clinician assessments is concerning and under-discussed. Autoraters consistently show larger effect sizes and detect significant differences where clinicians do not (particularly for safety and accuracy). This suggests autoraters may be systematically overestimating quality improvements, which has implications for the field's growing reliance on LLM-as-judge evaluation paradigms.

The study design evaluates responses in isolation rather than in conversational context, missing important dynamics of how users would actually interact with PHR-aware systems (e.g., follow-up questions, misunderstandings, over-trust).

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 6Clarity 7

Generated May 20, 2026

Comparison History (19)

vs. Investigating Concept Alignment Using Implausible Category Members

gpt-5.25/22/2026

Paper 2 is more novel and broadly impactful: it introduces a probing paradigm (implausible category members) to diagnose concept-boundary alignment between AI and humans, linking findings to downstream safety-relevant behavior. This is timely for AI alignment/evaluation and generalizes across models, domains, and applications. Paper 1 is valuable and application-forward in healthcare, but it is more incremental (LLM-with-context evaluation) and narrower in scope, with methodological dependence on autoraters and a limited clinician-rated subset, reducing general scientific breadth compared to Paper 2.

vs. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

claude-opus-4.65/22/2026

Paper 1 addresses the high-impact intersection of LLMs, personal health records, and personalized health AI—a timely topic with enormous real-world implications for patient empowerment and healthcare delivery. It introduces novel evaluation frameworks (SHARP adaptation, PHR-specific error modes), uses rigorous methodology with both automated and clinician evaluations, and demonstrates statistically significant improvements. Paper 2, while solid, addresses a more incremental benchmarking contribution in document parsing, a narrower technical domain with less transformative societal potential. The health AI application has broader cross-disciplinary relevance and greater timeliness given the rapid adoption of LLMs in healthcare.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

gemini-3.15/21/2026

Paper 2 addresses a high-stakes domain (healthcare) by rigorously evaluating LLM utility with Personal Health Records. Its extensive methodology, including clinician ratings and a novel evaluation framework, promises significant real-world applications in patient empowerment and personalized medicine. While Paper 1 offers valuable technical insights into LLM alignment, Paper 2's broader societal relevance, direct application to healthcare, and comprehensive evaluation give it higher potential for widespread scientific and practical impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gemini-3.15/21/2026

Paper 2 introduces a foundational, domain-agnostic framework for LLM agent diagnostics, solving a critical scalability bottleneck in AI development. Its broad applicability across all fields of AI engineering gives it higher potential impact compared to Paper 1, which, while highly relevant to healthcare, is primarily an empirical evaluation of an existing model in a specific application domain.

vs. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

gemini-3.15/20/2026

Paper 1 introduces a fundamental architectural abstraction ('Formal Skill') for LLM agents, addressing critical challenges in token efficiency, reliability, and state management. Its open-source runtime framework has broad applicability across all domains utilizing AI agents. In contrast, Paper 2 is a domain-specific empirical evaluation of existing LLMs in healthcare. While Paper 2 has strong real-world relevance, Paper 1's foundational methodological innovations provide a higher potential for broad, cross-disciplinary scientific impact in the rapidly evolving field of autonomous AI.

vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities during pretraining, providing mechanistic insights via expert-activation patterns. Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for data-centric optimization of foundation models across the entire AI field. Paper 1, while valuable for health AI applications, is more applied and incremental, evaluating an existing LLM on PHR-augmented queries. Paper 2's insights will likely influence training data strategies for many future models, giving it broader and deeper scientific impact.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

claude-opus-4.65/20/2026

Paper 2 addresses a broader, more impactful problem—leveraging PHRs with LLMs to improve personalized healthcare—with clear real-world applications affecting millions of patients. It introduces a novel evaluation framework (SHARP + PHR-specific error modes), uses a large-scale study (2,257 queries, 1,945 PHRs), and demonstrates statistically significant results (p<0.001). Paper 1, while methodologically interesting, reports a negative result in a niche domain (offensive cybersecurity CTF agents) with non-significant findings (p=0.71) from a relatively small dataset (180 runs), limiting its broader impact despite its conceptual contribution about environment-feedback bandwidth.

vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

gemini-3.15/20/2026

Paper 1 addresses a fundamental cognitive bottleneck in AI—embodied spatial intelligence and second-order Theory of Mind. By proposing a novel mechanism to overcome the 'Cartesian Illusion' in MLLMs, it offers significant methodological innovation for Embodied AI. While Paper 2 provides highly valuable real-world evaluation of LLMs in healthcare, Paper 1 introduces a foundational algorithmic paradigm that solves a complex, theoretical reasoning gap. This fundamental contribution gives Paper 1 a broader potential scientific impact across robotics, multi-agent systems, and multimodal reasoning.

vs. MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

gemini-3.15/20/2026

Paper 1 introduces a foundational architectural advancement in AI by integrating metacognition into multi-agent systems, solving a critical bottleneck in agent coordination and task delegation. Its theoretical innovation and broad applicability across any LLM-driven domain give it a higher potential for widespread scientific impact compared to Paper 2, which, while highly valuable for digital health, is primarily an empirical evaluation of existing LLMs in a specific application area.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

claude-opus-4.65/20/2026

Paper 1 addresses the high-impact intersection of LLMs and personalized healthcare, evaluating how PHR data improves AI-generated health answers using a large-scale study with clinician validation. It introduces novel evaluation frameworks (SHARP adaptation and PHR-specific error modes) applicable broadly to health AI. The domain—patient empowerment through AI—has enormous real-world implications and timeliness given LLM adoption in healthcare. Paper 2 makes a solid but incremental contribution to deepfake detection generalization with a 2.1% AUC improvement on one dataset, representing narrower impact and more limited novelty.

vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to greater novelty and broader, field-general relevance: it introduces a new benchmark and rubric to evaluate an under-measured capability (multi-turn clarification for ill-posed scientific tasks), with clear methodological structure (ontology + multi-dimensional scoring) and actionable failure analyses. Benchmarks often become community standards, influencing model design and evaluation across domains beyond computational science. Paper 2 is timely and high-application, but is more incremental (single-model study, limited clinician subset) and more domain-specific, with impact constrained by deployment/regulatory and data-access barriers.

vs. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

claude-opus-4.65/20/2026

Paper 1 introduces a novel diagnostic benchmark with a rigorous forensic error taxonomy that reveals fundamental structural failure modes in LLM mathematical reasoning, including the discovery of a near-universal behavioral threshold and novel hallucination patterns. These findings have broad implications across AI safety, model architecture design, and understanding of LLM cognitive limitations. Paper 2, while practically useful, is more incremental—applying an existing LLM to PHR-based question answering with predictable improvements from added context. Paper 1's methodological innovation (forensic error pipeline) and theoretical contributions (working memory limits, fabrication transitions) offer deeper scientific insights with wider cross-field relevance.

vs. AgentNLQ: A General-Purpose Agent for Natural Language to SQL

claude-opus-4.65/20/2026

Paper 2 addresses a more novel and impactful intersection of LLMs with personal health records, a domain with enormous real-world implications for patient empowerment and healthcare delivery. It introduces a new evaluation framework (PHR-specific error modes), uses substantial scale (2,257 queries, 1,945 PHRs), and identifies clinically meaningful failure modes like temporal disorientation. Paper 1, while achieving strong NL2SQL benchmark results, operates in a more crowded research space with incremental multi-agent improvements. Paper 2's broader societal impact potential and cross-disciplinary relevance (AI, clinical informatics, patient safety) give it the edge.

vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

gemini-3.15/20/2026

Paper 2 proposes a fundamental architectural innovation (PEEK) for LLM agents dealing with long-context environments. Its introduction of a context map as an orientation cache solves a core efficiency and reasoning bottleneck in modern AI systems, offering broad utility across multiple domains like software engineering and information retrieval. While Paper 1 is a valuable empirical evaluation of LLMs in healthcare, Paper 2's methodological advancements in core AI agent design are likely to have a wider, more foundational scientific impact and spur more downstream research.

vs. Memory-Augmented Reinforcement Learning Agent for CAD Generation

gpt-5.25/20/2026

Paper 2 is more novel methodologically, introducing a memory-augmented RL framework with dual-track memory and utility-based retrieval to address geometric feasibility and long-horizon error correction in CAD generation—an important, under-solved technical bottleneck. Its approach is broadly transferable to other tool-using agents with hard constraints (robotics, planning, program synthesis), increasing cross-field impact. Paper 1 is timely and application-relevant, but largely evaluates an existing LLM with added PHR context and proposes evaluation taxonomies; the technical innovation and generalizability are comparatively lower, and clinical impact depends on substantial downstream validation and deployment constraints.

vs. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

gemini-3.15/20/2026

Paper 2 addresses the integration of LLMs with Personal Health Records, a high-stakes domain with massive potential for real-world clinical and societal impact. By focusing on safety, personalization, and evaluating specific error modes in a medical context, it provides critical foundational work for the safe deployment of personalized health AI. This gives it a broader interdisciplinary impact and higher real-world application potential compared to the specific algorithmic improvements for text-to-image model alignment presented in Paper 1.

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gemini-3.15/20/2026

While Paper 2 presents a highly effective and deployed industrial application for digital advertising, Paper 1 addresses a critical challenge in medical AI: safely leveraging Personal Health Records using LLMs. Given the profound societal implications of personalized healthcare AI, along with the introduction of a novel evaluation framework identifying specific error modes (like temporal disorientation), Paper 1 offers broader scientific, ethical, and practical impact across multiple disciplines.

vs. Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental problem (mode collapse) in reinforcement learning for LLMs, proposing a principled and generalizable method (DMPO) with broad applicability across reasoning tasks, combinatorial optimization, and multiple modalities. Its theoretical contribution (forward KL approximation for on-policy RL) and demonstrated improvements across diverse benchmarks suggest wider impact across the ML/AI community. Paper 1, while practical and well-executed, is more application-focused (PHR+LLM evaluation) with narrower scope and incremental contribution to health AI evaluation methodology.

vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

claude-opus-4.65/20/2026

Paper 1 addresses the highly timely intersection of LLMs and personalized healthcare, with a large-scale evaluation (2,257 queries, 1,945 PHRs) and both automated and clinician ratings. It introduces a novel evaluation framework for PHR-specific error modes, has direct real-world healthcare applications affecting millions of patients, and tackles the critical challenge of making complex health records actionable. Paper 2 contributes meaningfully to XAI with causal concept explanations, but explainability methods face a more crowded field. Paper 1's broader societal impact potential and timeliness in the LLM-health space give it the edge.