Evaluating the Utility of Personal Health Records in Personalized Health AI
Rory Sayres, Kejia Chen, Ayush Jain, Matthew Thompson, Jonathan Richina, Xiang Yin, Jimmy Hu, Fan Zhang
Abstract
Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper addresses a practical and timely question: does providing Personal Health Records (PHRs) as context to LLMs improve the quality of health-related responses? The study pairs 2,257 user queries (from three distinct distributions—web search, chatbot templates, and patient calls) with 1,945 de-identified PHRs and evaluates Gemini 3.0 Flash responses under three conditions: no PHR, basic PHR (demographics, conditions, medications), and full PHR (complete clinical notes). The study makes two primary contributions: (1) empirical evidence that PHR context significantly improves helpfulness, actionability, and motivation of LLM responses, and (2) a novel 16-axis PHR-specific evaluation framework covering fidelity, robustness, utility, safety, and equity dimensions.
The core finding—that PHR context improves helpfulness from 0.13 to ~0.5 on clinician ratings ([-1,1] scale)—is intuitive but important to quantify rigorously. The identification of specific failure modes (temporal disorientation at 3-6% of responses, groundedness errors including confabulation, and social determinant blindspots) provides actionable insights for the field.
Methodological Rigor
Strengths: The experimental design is well-structured with three clearly defined context conditions enabling controlled comparisons. The use of three distinct query distributions provides ecological validity across different user interaction modalities. The paired design (same query-PHR pair across conditions) strengthens statistical power for within-pair comparisons. The PHR dataset is substantial (1,945 records, 77,644 encounters, 9 health systems) with detailed characterization.
Weaknesses: The clinician evaluation subset is notably small (n=65 in the abstract, n=95 in the methods—an inconsistency that itself raises concerns). This limits statistical power for detecting differences on safety and accuracy axes, where clinician ratings showed no significant effects despite autorater suggestions of improvements. The heavy reliance on autoraters (which are themselves LLM-based) for the primary dataset creates circularity concerns—using Gemini-family models to evaluate Gemini-generated responses. While clinician-autorater alignment is discussed, the agreement statistics are not formally presented.
The query-PHR pairing methodology introduces selection bias. For search queries, an LLM-based plausibility scorer selects pairs, meaning the evaluation is conditioned on the model's own judgment of relevance. The chatbot templates are synthetically generated, not real user queries. Only the patient calls represent authentic user questions paired with their actual records.
The population skews older, more female, more White, and more medically complex than the general population, which the authors acknowledge. The US-only PHR sourcing limits generalizability.
Potential Impact
The practical implications are significant given the rapid adoption of LLM-PHR integrations (OpenAI, Anthropic, Google all pursuing this). The PHR-specific evaluation framework (Table 5) is perhaps the most durable contribution—it provides a structured rubric that could become a standard for benchmarking PHR-aware health AI systems. The identification of temporal reasoning as the dominant failure mode (up to 5.74% of full PHR responses) gives model developers a concrete target for improvement.
The self-critique loop demonstration (Table 10), showing 98% reduction in temporal errors and 60% reduction in groundedness errors, suggests tractable mitigation strategies, though the authors correctly note this is a proof-of-concept rather than a deployable solution.
The work has relevance for health policy and patient engagement. If PHR-contextualized AI can genuinely improve health literacy and self-management, this could address documented barriers to PHR adoption (complexity, perceived low utility).
Timeliness & Relevance
This study is highly timely. Multiple commercial LLM providers launched PHR integration features in 2025-2026, yet rigorous evaluation of these capabilities was largely absent. The paper fills an important evidence gap at a moment when deployment is already occurring. The cited statistics—that 41% of LLM health users have uploaded personal medical information—underscore the urgency.
Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
The gap between autorater and clinician assessments is concerning and under-discussed. Autoraters consistently show larger effect sizes and detect significant differences where clinicians do not (particularly for safety and accuracy). This suggests autoraters may be systematically overestimating quality improvements, which has implications for the field's growing reliance on LLM-as-judge evaluation paradigms.
The study design evaluates responses in isolation rather than in conversational context, missing important dynamics of how users would actually interact with PHR-aware systems (e.g., follow-up questions, misunderstandings, over-trust).
Generated May 20, 2026
Comparison History (19)
Paper 2 is more novel and broadly impactful: it introduces a probing paradigm (implausible category members) to diagnose concept-boundary alignment between AI and humans, linking findings to downstream safety-relevant behavior. This is timely for AI alignment/evaluation and generalizes across models, domains, and applications. Paper 1 is valuable and application-forward in healthcare, but it is more incremental (LLM-with-context evaluation) and narrower in scope, with methodological dependence on autoraters and a limited clinician-rated subset, reducing general scientific breadth compared to Paper 2.
Paper 1 addresses the high-impact intersection of LLMs, personal health records, and personalized health AI—a timely topic with enormous real-world implications for patient empowerment and healthcare delivery. It introduces novel evaluation frameworks (SHARP adaptation, PHR-specific error modes), uses rigorous methodology with both automated and clinician evaluations, and demonstrates statistically significant improvements. Paper 2, while solid, addresses a more incremental benchmarking contribution in document parsing, a narrower technical domain with less transformative societal potential. The health AI application has broader cross-disciplinary relevance and greater timeliness given the rapid adoption of LLMs in healthcare.
Paper 2 addresses a high-stakes domain (healthcare) by rigorously evaluating LLM utility with Personal Health Records. Its extensive methodology, including clinician ratings and a novel evaluation framework, promises significant real-world applications in patient empowerment and personalized medicine. While Paper 1 offers valuable technical insights into LLM alignment, Paper 2's broader societal relevance, direct application to healthcare, and comprehensive evaluation give it higher potential for widespread scientific and practical impact.
Paper 2 introduces a foundational, domain-agnostic framework for LLM agent diagnostics, solving a critical scalability bottleneck in AI development. Its broad applicability across all fields of AI engineering gives it higher potential impact compared to Paper 1, which, while highly relevant to healthcare, is primarily an empirical evaluation of an existing model in a specific application domain.
Paper 1 introduces a fundamental architectural abstraction ('Formal Skill') for LLM agents, addressing critical challenges in token efficiency, reliability, and state management. Its open-source runtime framework has broad applicability across all domains utilizing AI agents. In contrast, Paper 2 is a domain-specific empirical evaluation of existing LLMs in healthcare. While Paper 2 has strong real-world relevance, Paper 1's foundational methodological innovations provide a higher potential for broad, cross-disciplinary scientific impact in the rapidly evolving field of autonomous AI.
Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities during pretraining, providing mechanistic insights via expert-activation patterns. Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for data-centric optimization of foundation models across the entire AI field. Paper 1, while valuable for health AI applications, is more applied and incremental, evaluating an existing LLM on PHR-augmented queries. Paper 2's insights will likely influence training data strategies for many future models, giving it broader and deeper scientific impact.
Paper 2 addresses a broader, more impactful problem—leveraging PHRs with LLMs to improve personalized healthcare—with clear real-world applications affecting millions of patients. It introduces a novel evaluation framework (SHARP + PHR-specific error modes), uses a large-scale study (2,257 queries, 1,945 PHRs), and demonstrates statistically significant results (p<0.001). Paper 1, while methodologically interesting, reports a negative result in a niche domain (offensive cybersecurity CTF agents) with non-significant findings (p=0.71) from a relatively small dataset (180 runs), limiting its broader impact despite its conceptual contribution about environment-feedback bandwidth.
Paper 1 addresses a fundamental cognitive bottleneck in AI—embodied spatial intelligence and second-order Theory of Mind. By proposing a novel mechanism to overcome the 'Cartesian Illusion' in MLLMs, it offers significant methodological innovation for Embodied AI. While Paper 2 provides highly valuable real-world evaluation of LLMs in healthcare, Paper 1 introduces a foundational algorithmic paradigm that solves a complex, theoretical reasoning gap. This fundamental contribution gives Paper 1 a broader potential scientific impact across robotics, multi-agent systems, and multimodal reasoning.
Paper 1 introduces a foundational architectural advancement in AI by integrating metacognition into multi-agent systems, solving a critical bottleneck in agent coordination and task delegation. Its theoretical innovation and broad applicability across any LLM-driven domain give it a higher potential for widespread scientific impact compared to Paper 2, which, while highly valuable for digital health, is primarily an empirical evaluation of existing LLMs in a specific application area.
Paper 1 addresses the high-impact intersection of LLMs and personalized healthcare, evaluating how PHR data improves AI-generated health answers using a large-scale study with clinician validation. It introduces novel evaluation frameworks (SHARP adaptation and PHR-specific error modes) applicable broadly to health AI. The domain—patient empowerment through AI—has enormous real-world implications and timeliness given LLM adoption in healthcare. Paper 2 makes a solid but incremental contribution to deepfake detection generalization with a 2.1% AUC improvement on one dataset, representing narrower impact and more limited novelty.
Paper 1 likely has higher scientific impact due to greater novelty and broader, field-general relevance: it introduces a new benchmark and rubric to evaluate an under-measured capability (multi-turn clarification for ill-posed scientific tasks), with clear methodological structure (ontology + multi-dimensional scoring) and actionable failure analyses. Benchmarks often become community standards, influencing model design and evaluation across domains beyond computational science. Paper 2 is timely and high-application, but is more incremental (single-model study, limited clinician subset) and more domain-specific, with impact constrained by deployment/regulatory and data-access barriers.
Paper 1 introduces a novel diagnostic benchmark with a rigorous forensic error taxonomy that reveals fundamental structural failure modes in LLM mathematical reasoning, including the discovery of a near-universal behavioral threshold and novel hallucination patterns. These findings have broad implications across AI safety, model architecture design, and understanding of LLM cognitive limitations. Paper 2, while practically useful, is more incremental—applying an existing LLM to PHR-based question answering with predictable improvements from added context. Paper 1's methodological innovation (forensic error pipeline) and theoretical contributions (working memory limits, fabrication transitions) offer deeper scientific insights with wider cross-field relevance.
Paper 2 addresses a more novel and impactful intersection of LLMs with personal health records, a domain with enormous real-world implications for patient empowerment and healthcare delivery. It introduces a new evaluation framework (PHR-specific error modes), uses substantial scale (2,257 queries, 1,945 PHRs), and identifies clinically meaningful failure modes like temporal disorientation. Paper 1, while achieving strong NL2SQL benchmark results, operates in a more crowded research space with incremental multi-agent improvements. Paper 2's broader societal impact potential and cross-disciplinary relevance (AI, clinical informatics, patient safety) give it the edge.
Paper 2 proposes a fundamental architectural innovation (PEEK) for LLM agents dealing with long-context environments. Its introduction of a context map as an orientation cache solves a core efficiency and reasoning bottleneck in modern AI systems, offering broad utility across multiple domains like software engineering and information retrieval. While Paper 1 is a valuable empirical evaluation of LLMs in healthcare, Paper 2's methodological advancements in core AI agent design are likely to have a wider, more foundational scientific impact and spur more downstream research.
Paper 2 is more novel methodologically, introducing a memory-augmented RL framework with dual-track memory and utility-based retrieval to address geometric feasibility and long-horizon error correction in CAD generation—an important, under-solved technical bottleneck. Its approach is broadly transferable to other tool-using agents with hard constraints (robotics, planning, program synthesis), increasing cross-field impact. Paper 1 is timely and application-relevant, but largely evaluates an existing LLM with added PHR context and proposes evaluation taxonomies; the technical innovation and generalizability are comparatively lower, and clinical impact depends on substantial downstream validation and deployment constraints.
Paper 2 addresses the integration of LLMs with Personal Health Records, a high-stakes domain with massive potential for real-world clinical and societal impact. By focusing on safety, personalization, and evaluating specific error modes in a medical context, it provides critical foundational work for the safe deployment of personalized health AI. This gives it a broader interdisciplinary impact and higher real-world application potential compared to the specific algorithmic improvements for text-to-image model alignment presented in Paper 1.
While Paper 2 presents a highly effective and deployed industrial application for digital advertising, Paper 1 addresses a critical challenge in medical AI: safely leveraging Personal Health Records using LLMs. Given the profound societal implications of personalized healthcare AI, along with the introduction of a novel evaluation framework identifying specific error modes (like temporal disorientation), Paper 1 offers broader scientific, ethical, and practical impact across multiple disciplines.
Paper 2 addresses a fundamental problem (mode collapse) in reinforcement learning for LLMs, proposing a principled and generalizable method (DMPO) with broad applicability across reasoning tasks, combinatorial optimization, and multiple modalities. Its theoretical contribution (forward KL approximation for on-policy RL) and demonstrated improvements across diverse benchmarks suggest wider impact across the ML/AI community. Paper 1, while practical and well-executed, is more application-focused (PHR+LLM evaluation) with narrower scope and incremental contribution to health AI evaluation methodology.
Paper 1 addresses the highly timely intersection of LLMs and personalized healthcare, with a large-scale evaluation (2,257 queries, 1,945 PHRs) and both automated and clinician ratings. It introduces a novel evaluation framework for PHR-specific error modes, has direct real-world healthcare applications affecting millions of patients, and tackles the critical challenge of making complex health records actionable. Paper 2 contributes meaningfully to XAI with causal concept explanations, but explainability methods face a more crowded field. Paper 1's broader societal impact potential and timeliness in the LLM-health space give it the edge.