What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

Payal Chandak, Victoria Alkin, David Wu, Maya Dagan, Taposh Dutta Roy, Maria Clara Saad Menezes, Ayush Noori, Nirali Somia

May 18, 2026

arXiv:2605.18738v1 PDF

cs.AI(primary)

#309of 2292·Artificial Intelligence

#309 of 2292 · Artificial Intelligence

Tournament Score

1499±46

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor7.5

Novelty7.5

Clarity8.5

Tournament Score

1499±46

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a framework for auditing value pluralism in medical AI, consisting of two main components: (1) a benchmark of 50 clinician-verified clinical ethical dilemmas with structured value annotations grounded in principlism (autonomy, beneficence, nonmaleficence, justice), and (2) a revealed-preference attribution method that infers value priority distributions from binary clinical decisions rather than stated preferences. The central insight is that while individual LLMs discuss competing ethical values in their reasoning (Overton pluralism), they are near-deterministic decision-makers whose consistent choices reflect systematic value commitments — and some models significantly underweight patient autonomy relative to physician consensus. The concept of "deployment monoculture" — where a single LLM's fixed ethical stance replaces the natural diversity patients encounter across different physicians — is both novel and practically important.

Methodological Rigor

The methodology is generally sound and thoughtfully designed. Several elements stand out:

Benchmark construction follows a rigorous multi-stage pipeline with structural constraints (C1–C4) that guarantee genuine ethical trade-offs. The blinded two-stage physician review process and the formalization of value-difference vectors (Δᵢ) are well-motivated. The benchmark satisfies all three pluralistic benchmark desiderata of Sorensen et al.

Value attribution via logistic regression on value-difference vectors is elegant in its simplicity. The softmax temperature calibration using 500 synthetic Dirichlet agents (mean reconstruction JSD = 0.0086) provides convincing validation. The likelihood ratio test for non-uniform priorities is appropriate.

Decision consistency analysis is thorough, including 10 repeated queries per case at temperature 1.0, Spearman correlations against physician disagreement, and graded paraphrase experiments showing a dose-response pattern (negligible flip rate for surface rewording, sharp increase for value reversal). This effectively rules out the hypothesis that consistency is a phrasing artifact.

Calibration analysis uses bootstrap leave-one-out physician consensus with 10,000 iterations, producing a well-grounded reference distribution for interpreting model divergence.

However, several limitations warrant attention. The physician panel (N=20) is small and drawn from academic North American centers — a convenience sample that limits generalizability. The forced binary choice design, while enabling clean measurement, compresses nuanced clinical reasoning. The linear separability assumption in the logistic regression may miss non-linear value interactions. The benchmark's 50 cases, while diverse across clinical domains, provide limited statistical power for individual-level attribution (10 of 20 physicians fail to reject uniform weighting, which may reflect sample size rather than genuinely uniform priorities).

Potential Impact

Clinical AI deployment: The finding that some frontier models (GPT 5.2, Grok 4, Sonar Pro) significantly underweight autonomy — allocating 6–13% versus the physician consensus's 44% — has direct implications for patient-facing AI systems. This provides concrete, actionable evidence for regulators and healthcare systems considering LLM deployment.

AI alignment research: The paper operationalizes pluralistic alignment in a domain where ground truth is genuinely contested, contributing methodology that could extend to legal, policy, and other value-laden domains. The distinction between Overton, distributional, and steerable pluralism applied to a real-world domain is valuable.

Regulatory and economic implications: The connection drawn between physician values driving regional spending variation and the potential for algorithmic value encoding to similarly shift healthcare economics is provocative and policy-relevant.

Benchmark contribution: The benchmark itself, if released as planned, fills a gap in medical AI evaluation by moving beyond accuracy and safety to ethical alignment.

Timeliness & Relevance

This paper is exceptionally timely. LLMs are already being consulted by patients for medical advice, often without physician mediation. Existing medical AI benchmarks focus overwhelmingly on diagnostic accuracy and safety, leaving ethical alignment as a significant blind spot. The paper arrives at a moment when regulatory frameworks for medical AI are being actively debated, and the deployment monoculture concern resonates with broader discussions about foundation model homogenization.

Strengths

1. Conceptual clarity: The paper precisely defines multiple forms of pluralism and systematically tests each, yielding a nuanced picture (models achieve Overton but not distributional pluralism).

2. Revealed-preference approach: Inferring values from decisions rather than self-reports addresses the well-documented value-action gap in both humans and LLMs.

3. Calibration against human baseline: Rather than treating physician consensus as ground truth, the paper treats inter-physician variation as the reference distribution, appropriately reflecting genuine ethical pluralism.

4. Comprehensive model coverage: Testing 12 frontier models from diverse providers (OpenAI, Google, Anthropic, DeepSeek, Meta, etc.) gives ecological validity to ecosystem-level claims.

5. The deployment monoculture framing: This reframes the problem from "are models ethical?" to "does deployment preserve the value diversity patients need?" — a more sophisticated and actionable question.

Limitations

1. Western-centric bioethics: The benchmark is grounded in English-language principlism, which is not universally adopted across medical traditions. Cross-cultural validity is acknowledged but unexplored.

2. Temporal snapshot: Model versions change rapidly; value profiles may shift with updates, limiting the shelf life of specific findings.

3. No steerability experiments: The paper identifies steerable pluralism as "the most important" mode but defers testing it entirely to future work.

4. Ecological validity: Real clinical decisions involve iterative dialogue, patient-specific context, and multi-stakeholder input — the forced binary paradigm captures only a slice of ethical reasoning.

5. Attribution method limitations: The method recovers aggregate priorities across the benchmark but cannot assess whether models appropriately modulate values case-by-case (e.g., weighting autonomy more when patient capacity is high).

6. No causal analysis: The paper acknowledges that whether models have internal representations of principlist values that causally drive decisions remains open. The attribution is correlational.

Overall Assessment

This is a well-executed study that makes a meaningful contribution at the intersection of AI alignment, medical ethics, and responsible AI deployment. The framework is reproducible, the findings are specific and actionable, and the deployment monoculture concept offers a useful lens for policy discussions. The work would benefit from larger physician panels, cross-cultural replication, and steerability experiments, but as a foundational contribution it establishes important methodology and raises the right questions at the right time.

Rating:7.5/ 10

Significance 8Rigor 7.5Novelty 7.5Clarity 8.5

Generated May 19, 2026

Comparison History (24)

vs. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

claude-opus-4.65/20/2026

Paper 2 addresses a critical and timely concern—ethical value alignment of LLMs in healthcare—with broad interdisciplinary impact spanning AI safety, medical ethics, and policy. Its framework for auditing value pluralism is novel and applicable across domains. The finding that models exhibit near-deterministic ethical stances risks 'deployment monoculture' has immediate policy implications as medical AI scales. Paper 1, while technically solid, addresses a narrower engineering problem (skill library management) relevant primarily to the LLM agent research community, limiting its breadth of impact.

vs. Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

gemini-3.15/19/2026

Paper 2 offers a foundational technical advancement in understanding and optimizing the internal reasoning mechanisms of large models. By addressing the instability of reinforcement learning without costly external verifiers, its methodology has broad, cross-disciplinary implications for the future development of all AI systems. While Paper 1 provides a crucial audit of AI in medical ethics, Paper 2's fundamental breakthrough in AI optimization is likely to drive widespread structural improvements across the entire field of artificial intelligence.

vs. New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

claude-opus-4.65/19/2026

Paper 1 addresses the critical and timely issue of ethical value alignment in medical AI, a topic with broad societal implications as LLMs are increasingly deployed in healthcare. It introduces a novel auditing framework for value pluralism, bridging AI safety, medical ethics, and clinical practice. Its findings—that individual LLMs exhibit near-deterministic ethical stances potentially underweighting patient autonomy—have direct policy and deployment implications at scale. Paper 2, while technically sound, offers an incremental optimization improvement in a narrower subfield with limited broader impact beyond sparse optimization and adversarial attacks.

vs. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

gpt-5.25/19/2026

Paper 2 has higher potential impact: it introduces a novel, generalizable auditing framework for value pluralism in clinical LLMs with clinician-verified dilemmas and an attribution method, addressing a timely, high-stakes deployment risk (ethical monoculture) with clear real-world implications for medical AI governance and regulation. Its findings are broadly relevant across AI alignment, healthcare, ethics, and policy. Paper 1 is useful engineering work and comparative analysis of agent paradigms, but appears more incremental, tied to a specific framework and limited case-study evidence, with narrower cross-field impact.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact: it introduces a clinician-verified benchmark plus an attribution method to audit value pluralism in medical AI, addressing a timely, high-stakes deployment risk (ethical monoculture) with broad relevance across medicine, AI safety, and policy. Its findings (near-deterministic model choices, underweighting autonomy) have direct implications for real-world clinical use and governance. Paper 2 is technically strong and useful for cheminformatics, but its impact is more domain-specific and may be superseded as multimodal LLM vision/chemistry tooling rapidly evolves.

vs. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

gpt-5.25/19/2026

Paper 2 likely has higher impact due to strong real-world stakes and timeliness: it audits ethical value pluralism in medical LLM advice, introduces a clinician-verified benchmark plus an attribution method, and yields actionable findings (near-deterministic model values, potential autonomy underweighting, deployment monoculture risk). This directly affects clinical governance, regulation, and deployment across healthcare systems, with broad relevance to AI ethics, evaluation, and human-AI decision-making. Paper 1 is novel and useful for optimization/AutoML, but its applications are more specialized and its societal leverage is typically lower.

vs. Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

gemini-3.15/19/2026

Paper 2 addresses a critical and immediate challenge in AI deployment: clinical ethics and value pluralism. Its impact extends beyond technical AI communities into healthcare policy, medical ethics, and AI safety. While Paper 1 presents a highly innovative technical approach for EEG understanding, Paper 2's focus on auditing and preventing ethical monoculture in AI doctors has broader, more urgent real-world implications for global healthcare systems.

vs. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

claude-opus-4.65/19/2026

Paper 2 addresses a critical and timely issue—ethical value alignment in medical AI—with broader implications for AI safety, healthcare policy, and responsible deployment of LLMs. Its novel framework for auditing value pluralism is methodologically rigorous and applicable across many domains. Paper 1, while technically sound, shows modest improvements (cosine similarity of 0.181 vs 0.139) on a narrow EEG-to-text task and represents incremental progress in a niche area. Paper 2's findings about deployment monoculture risk have immediate policy relevance as medical AI scales rapidly.

vs. Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

gpt-5.25/19/2026

Paper 1 has higher likely scientific impact: it introduces a clear, clinician-verified benchmark and an attribution method to audit value pluralism in medical LLM advice, addressing a timely, high-stakes deployment risk (ethical monoculture) with direct real-world implications for healthcare governance and regulation. Its framing is broadly relevant across AI safety, medical ethics, and health policy, and the methodology appears more rigorously grounded (expert-validated dilemmas, systematic measurement). Paper 2 is interesting but more niche, with less established validation and narrower immediate applicability.

vs. DocOS: Towards Proactive Document-Guided Actions in GUI Agents

gemini-3.15/19/2026

Paper 2 offers higher scientific impact due to its critical interdisciplinary focus on medical AI ethics. While Paper 1 presents a useful benchmark for GUI agents, Paper 2 addresses a high-stakes, real-world problem: the ethical alignment and potential value monoculture of LLMs in healthcare. Its novel framework for auditing clinical value pluralism bridges AI safety, bioethics, and clinical medicine. This breadth of impact, combined with the urgent timeliness of safely deploying medical AI, gives Paper 2 a significantly higher potential to influence policy, clinical guidelines, and future AI alignment research.

vs. Self-supervised Hierarchical Visual Reasoning with World Model

claude-opus-4.65/19/2026

Paper 1 addresses a critical and timely gap at the intersection of AI safety, medical ethics, and LLM deployment—areas of enormous societal concern. It introduces a novel auditing framework for value pluralism in medical AI, with broad implications for AI governance, healthcare policy, and responsible deployment. The finding that LLMs may systematically underweight patient autonomy has immediate regulatory and clinical relevance. Paper 2, while technically solid in advancing hierarchical world models for RL, addresses a narrower technical problem with less cross-disciplinary impact and societal significance.

vs. Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

claude-opus-4.65/19/2026

Paper 1 addresses a timely, high-impact problem—ethical value alignment in medical LLMs—with a rigorous empirical framework (clinician-verified benchmarks, attribution methods, systematic auditing of frontier models). It has broad real-world implications for AI deployment in healthcare, a field affecting millions. Its findings about deterministic value commitments and autonomy underweighting are actionable for policy and AI safety. Paper 2 presents an interesting theoretical architecture for artificial subjectivity but operates in a narrow, speculative domain (reward-free gridworld, phenomenological framing) with limited immediate applicability and a smaller audience.

vs. A Global-Local Graph Attention Network for Traffic Forecasting

gemini-3.15/19/2026

Paper 1 addresses a highly critical, timely, and broadly impactful issue—the ethical alignment and value pluralism of AI in healthcare. It introduces a novel framework for auditing medical AI, which has profound implications for AI safety, medical ethics, and real-world clinical deployment. In contrast, Paper 2 presents an incremental architectural improvement in traffic forecasting, a heavily saturated subfield of machine learning, limiting its comparative novelty and cross-disciplinary impact.

vs. Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

claude-opus-4.65/19/2026

Paper 1 addresses the critical and timely issue of ethical value pluralism in medical AI, proposing a novel auditing framework for LLMs in clinical settings. Its findings about deployment monoculture and underweighting of patient autonomy have broad implications for AI safety, regulation, and healthcare policy as LLMs are rapidly deployed. Paper 2, while useful, applies standard ML regression to a relatively niche clinical application (TCD-based vascular age prediction) with a small dataset and incremental methodological contribution. Paper 1's novelty, timeliness given widespread LLM adoption, and cross-disciplinary relevance (AI ethics, medicine, policy) give it substantially higher impact potential.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

claude-opus-4.65/19/2026

TRACE presents a novel, training-free algorithm addressing the fundamental problem of LLM hallucinations with strong empirical results across 15 models and 8 families. Its universal applicability, requiring no labels, retrieval, or fine-tuning, gives it broad practical impact across all LLM applications. The insight about non-uniform cross-layer truthfulness and the adaptive correction approach represents significant methodological innovation. While Paper 1 raises important ethical questions about value pluralism in medical AI, it is primarily an auditing/evaluation contribution with narrower scope, whereas Paper 2 offers a concrete, widely applicable technical solution to a pervasive problem.

vs. HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

claude-opus-4.65/19/2026

Paper 2 addresses a critical and timely issue—ethical value alignment of LLMs in healthcare—with broad implications for AI safety, medical ethics, and policy. Its novel auditing framework for value pluralism in medical AI fills a significant gap as LLMs are rapidly deployed in clinical settings. The finding that models exhibit near-deterministic ethical stances risks creating 'deployment monoculture,' which has immediate real-world consequences at scale. Paper 1, while methodologically sound, represents an incremental advance in personality prediction using hypergraphs, with narrower application scope and less urgent societal relevance.

vs. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

gemini-3.15/19/2026

Paper 1 addresses a critical societal risk: the imposition of monolithic ethical values by AI in healthcare. Its interdisciplinary approach, spanning AI safety, clinical ethics, and policy, gives it immense real-world importance and broader scientific impact compared to Paper 2, which offers a valuable but more narrowly focused technical improvement for vision-language model architectures.

vs. What Do EEG Foundation Models Capture from Human Brain Signals?

gpt-5.25/19/2026

Paper 2 has higher likely scientific impact due to its timeliness and broad cross-field relevance (AI, medicine, ethics, policy). It introduces an actionable framework (clinician-verified dilemmas + attribution of value priorities) with direct implications for real-world deployment, regulation, and patient safety, and highlights a scalable risk (“deployment monoculture”). Paper 1 is methodologically strong and valuable for interpretability in EEG foundation models, but its impact is more domain-specific (clinical EEG/neuroAI) and less immediately tied to governance and societal-scale outcomes.

vs. Optimal Experiments for Partial Causal Effect Identification

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental problem in causal inference—optimal experiment selection for partial identification—with strong theoretical contributions (NP-hardness proof, graphical pruning criteria, complexity reductions) and practical methodology. It advances core causal inference machinery applicable across many scientific domains. Paper 2 presents an important but more niche audit of LLM ethical pluralism in medicine. While timely, it is primarily descriptive and domain-specific. Paper 1's methodological contributions have broader applicability, stronger technical depth, and are likely to influence ongoing research in causal inference and experimental design more substantially.

vs. From Feasible to Practical: Pareto-Optimal Synthesis Planning

claude-opus-4.65/19/2026

Paper 2 addresses a critically timely and broadly impactful issue—ethical value pluralism in medical AI—relevant across AI safety, healthcare policy, bioethics, and responsible AI deployment. Its framework for auditing LLM value priorities has immediate policy implications as medical AI is rapidly being deployed. Paper 1 makes a solid methodological contribution to synthesis planning with multi-objective optimization, but targets a narrower computational chemistry audience. Paper 2's interdisciplinary relevance, societal implications, and timeliness given widespread LLM adoption give it higher potential impact.