Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
Qi Han Wong
Abstract
We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper identifies a specific mechanism—diagnostic substitution—by which LLMs produce gender-disparate triage recommendations for identical neurological symptoms. Rather than simply documenting that disparities exist (which prior work has done), the authors show the causal chain: gender/age → differential diagnosis anchoring (IIH for young women vs. space-occupying lesion for men) → different urgency assignments → different care pathways. This mechanistic insight is the paper's most valuable contribution, as it moves beyond "bias exists" to "here is precisely how bias operates," which is actionable for model developers.
The finding that young women receive dramatically lower ER referral rates (e.g., 6.7% vs. 96.7% in Claude at age 25) despite severity scores of 7-9/10 is striking and clinically concerning. The age-65 convergence elegantly confirms that the mechanism is epidemiological prior anchoring on IIH rather than a crude gender heuristic.
2. Methodological Rigor
Strengths: The experimental design follows established counterfactual patient variation methodology. Testing across three model families (Gemini, Claude, GPT) strengthens generalizability claims. The use of Fisher's exact tests with Bonferroni correction is appropriate for the sample sizes and multiple comparisons. The structured JSON output format ensures consistent, machine-parseable results. The diagnosis classification scheme (IIH vs. Generic ICP, mass mentioned) provides transparent operationalization.
Weaknesses: The study has several methodological limitations that constrain its impact:
3. Potential Impact
The practical implications are significant for the growing medical AI industry:
1. Design principle: The recommendation to decouple urgency from probabilistic diagnostic priors is immediately actionable. Triage systems could be architected to assess symptom-level urgency before diagnostic anchoring occurs.
2. Audit methodology: The paper demonstrates that action-level auditing alone is insufficient—diagnosis-level auditing is necessary to detect substitution biases that produce equivalent severity scores but divergent care pathways.
3. Regulatory relevance: As AI triage tools face increasing regulatory scrutiny (FDA, EU AI Act), this type of bias characterization work directly informs safety evaluation frameworks.
The finding could influence adjacent fields including clinical decision support, health equity research, and AI safety/fairness more broadly.
4. Timeliness & Relevance
This paper is highly timely. LLM-powered triage tools are actively being deployed or piloted by major health systems and consumer health platforms. The specific concern—that epidemiologically valid priors can produce clinically inappropriate urgency reductions—is a nuanced safety issue that simplistic fairness metrics would miss. The paper addresses a genuine bottleneck in clinical AI deployment: how to audit for bias that is "clinically grounded" yet harmful.
The connection to documented human clinical biases (women's symptoms being psychologized, delayed cardiac workups) adds depth and suggests these biases are structurally encoded in training corpora rather than being artifacts of model architecture.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper is well-written, clearly structured, and appropriately cautious in its claims. The framing that models have learned "a clinically grounded epidemiological prior that happens to produce worse outcomes" is more precise and constructive than characterizing models as "biased" in a generic sense. However, the paper's scope is relatively narrow for the strength of its claims—the title implies a general phenomenon, but only one clinical scenario is tested. The work reads more as a compelling demonstration/proof-of-concept than a comprehensive characterization study.
The paper would benefit from discussing whether system prompts explicitly instructing urgency-first reasoning (before diagnosis) could mitigate the effect, as this would be the most immediately deployable intervention.
Generated Jun 3, 2026
Comparison History (24)
Paper 2 proposes a fundamental architectural innovation (ProtoT) that addresses two major bottlenecks in modern LLMs: computational efficiency (linear cost) and interpretability. If successful, this architecture could be adopted across all fields utilizing AI, offering a massive breadth of impact. While Paper 1 provides a crucial empirical finding on algorithmic bias in healthcare, Paper 2's foundational contribution to model design gives it a higher potential for widespread, transformative scientific impact.
Paper 1 demonstrates a novel, well-quantified finding—systematic gender-dependent diagnostic substitution in LLM medical triage—with immediate real-world patient safety implications. It reveals a concrete, actionable bias mechanism (epidemiological priors suppressing urgency) across multiple frontier models, directly relevant to the rapidly growing deployment of AI in healthcare. While Paper 2 makes a solid methodological contribution regarding contamination detection reliability, it primarily reveals limitations of existing tools without proposing solutions. Paper 1's findings are more likely to drive policy changes, model design improvements, and cross-disciplinary attention from both AI and medical communities.
Paper 1 likely has higher scientific impact due to a more novel methodological contribution (conditioning molecular generation on low-resolution electron density, including experimental cryo-EM/X-ray ED) with broad applicability in structure-based drug design and direct downstream real-world implications (hit generation/optimization across many targets). It proposes a concrete generative framework (EDMolGPT) and reports evaluation on 101 targets, suggesting scalability and rigor. Paper 2 is timely and important for AI safety/fairness in medical triage, but its scope is narrower, depends on prompt-based auditing of existing models, and may be more sensitive to model/version drift, reducing durable methodological impact.
Paper 1 demonstrates a specific, well-characterized safety-critical bias in LLM medical triage with a clear mechanism (diagnostic substitution), rigorous experimental design, and directly actionable implications for AI deployment in healthcare. It reveals a novel, quantified disparity with strong statistical evidence across multiple models. Paper 2 offers an engineering contribution (a lightweight evaluation pipeline) but is incremental relative to existing evaluation frameworks and lacks the depth of scientific insight, novelty, and real-world urgency that Paper 1 provides. Paper 1 is more likely to influence both AI safety policy and clinical AI design.
Paper 2 likely has higher scientific impact due to broader, longer-term relevance and cross-field applicability: it introduces a new benchmark (Causal-Plan-Bench), a large-scale dataset (Causal-Plan-1M), an end-to-end training recipe, and an empirical scaling law for causal training—artifacts that can become community standards for embodied AI, robotics, and vision-language research. Paper 1 is timely and important for AI safety/health equity, but its scope is narrower (specific symptom template and triage setting) and may generalize less, limiting breadth despite strong real-world implications.
Paper 1 addresses a critical and timely issue—gender bias in LLM-based medical triage—with broad societal implications for AI safety and healthcare equity. It demonstrates a systematic, reproducible bias across three major model families with a clear mechanistic explanation (diagnostic substitution). This has immediate policy relevance for AI regulation in healthcare, affects a massive user base, and bridges AI fairness, medicine, and public health. Paper 2 is technically solid but addresses a narrower robotics engineering problem (memory efficiency on edge hardware) with more incremental results and a smaller affected community.
Paper 1 addresses a critical and timely issue—gender bias in LLM-based medical triage—with clear real-world implications for patient safety as AI systems are increasingly deployed in healthcare. It demonstrates a systematic, reproducible bias across multiple leading model families with a well-designed methodology, and its findings have immediate policy relevance for AI regulation and clinical deployment. Paper 2 proposes a technically interesting architectural innovation (score-level SSM-attention fusion), but the results are mixed (Mamba-3 leads on LAMBADA at 369M), tested only at small scale, and represents an incremental contribution in a crowded hybrid architecture space with uncertain long-term adoption.
Paper 2 has higher likely scientific impact: it introduces a broadly applicable methodological advance (graph-coupled causal BO) with clear novelty, theoretical guarantees (low-rank kernel, information-gain and regret bounds), and extensibility beyond a single domain. Its applications span any expensive experimentation setting (engineering, biotech, policy, robotics), giving wide cross-field reach and strong timeliness given interest in causal ML and sample-efficient optimization. Paper 1 is important and timely for AI safety/health equity, but its impact may be narrower (audit of specific LLM behaviors) and more sensitive to model/version drift, limiting generalizability despite strong real-world relevance.
Paper 1 is more novel and timely, probing clinically consequential bias mechanisms in widely used LLMs. It identifies a specific causal pathway (diagnostic substitution via epidemiological priors) with direct safety and policy implications for AI triage design, and releases reproducible artifacts. Its impact could span medicine, AI safety/fairness, health policy, and regulation. Paper 2 is methodologically solid and practically relevant to hydrology, but its core finding (LSTM outperforming an encoder-only Transformer; value of downstream context) is less novel and narrower in cross-field influence.
Paper 1 addresses a critical and timely issue—systematic gender bias in LLM-based medical triage—with rigorous methodology across multiple model families and clear, statistically significant findings. It has broad implications for AI safety, healthcare equity, and policy, and contributes actionable insights (decoupling urgency from diagnostic priors). Paper 2 makes a niche contribution to procedural content generation for game enemies, a narrower domain with limited cross-field impact. Paper 1's relevance to AI fairness, clinical decision-making, and public health gives it substantially higher potential scientific and societal impact.
Paper 1 likely has higher scientific impact due to major technical novelty (agentic formal-proof generation with compiler-in-the-loop), strong methodological contribution (new IMO-style Lean benchmark), and broad downstream applicability to formal verification, mathematics, and trustworthy software/hardware. Its reported performance gains (sub-10% to 70%) and demonstrated utility on research-level formalization suggest a step-change in automated reasoning. Paper 2 is timely and important for AI safety/health equity, but is narrower in scope (single symptom template, limited models) and primarily diagnostic of bias rather than offering a generalizable technical solution.
Paper 1 provides rigorous empirical evidence of a novel, high-stakes failure mode in clinical LLMs (diagnostic substitution). Its findings directly impact AI safety, medical informatics, and clinical practice. The focus on life-critical medical triage biases offers a broader, more urgent, and highly measurable scientific impact compared to Paper 2's conceptual and industry-specific framework for AI insurance claims.
Paper 2 has higher likely scientific impact due to strong real-world implications for patient safety, healthcare policy, and AI governance. It addresses a timely, high-salience issue (LLM bias in medical decision support) with clear, actionable recommendations (decouple urgency from diagnostic priors) and broad cross-field relevance (ML, medicine, ethics, regulation). The experimental design is straightforward and reproducible across multiple frontier model families with statistically significant effects and released artifacts. Paper 1 is technically novel for agent memory, but its impact is more niche to LLM-agent architectures and less immediately consequential.
Paper 2 addresses a highly timely, critical issue of AI bias in healthcare triage with broad implications for patient safety, clinical informatics, and AI fairness. Its findings on diagnostic substitution provide actionable insights for mitigating algorithmic harm in widely used LLMs. In contrast, Paper 1 offers a niche, albeit useful, structural pattern for knowledge graphs in compliance, which has a much narrower scope and limited cross-disciplinary impact compared to the life-or-death implications of medical AI bias.
Paper 1 provides rigorous empirical evidence of a critical, real-world vulnerability in AI medical triage. Its identification of diagnostic substitution as the mechanism for bias offers actionable insights for immediate improvements in clinical LLMs. While Paper 2 presents an important theoretical framework for AI alignment, Paper 1's concrete methodology, timeliness, and direct implications for patient safety give it higher immediate scientific and societal impact.
Paper 2 likely has higher scientific impact due to clear novelty in demonstrating a concrete, quantifiable safety-critical bias mechanism (gender-dependent diagnostic substitution) across multiple frontier model families, with strong real-world applicability to clinical AI deployment and regulation. The methodology is comparatively rigorous (controlled symptom profile, demographic manipulations, multiple models, repeated trials, significance testing, and full artifact release). Its findings are timely and broadly relevant across medicine, AI safety/fairness, and policy. Paper 1 is conceptually interesting but more domain-specific and harder to validate against ground truth.
Paper 1 exposes a critical, life-threatening bias in LLM medical triage, demonstrating how AI replicates human clinical biases. Its findings have profound real-world implications for patient safety, AI ethics, and healthcare regulation, ensuring broad impact across both the medical and machine learning communities. Paper 2 presents a valuable methodological advancement for visual RL, but its impact is more confined to specialized ML research and lacks the immediate societal urgency of Paper 1.
Paper 1 addresses a critical safety issue—gender bias in LLM medical triage—with broad implications for AI deployment in healthcare, policy, and fairness research. It reveals a systematic, reproducible bias across multiple major LLM families with a clear mechanistic explanation (diagnostic substitution). This has immediate real-world relevance as LLMs are increasingly used in clinical settings. Paper 2 makes solid but incremental technical contributions to relational deep learning benchmarks, with narrower impact limited to the ML/database community. Paper 1's interdisciplinary relevance (AI ethics, medicine, policy) gives it substantially higher impact potential.
Paper 2 addresses a critical safety issue—gender bias in LLM-based medical triage—with clear, striking findings that have immediate real-world implications for AI deployment in healthcare. It demonstrates a concrete, reproducible bias mechanism (diagnostic substitution) across multiple major LLM families, making it highly relevant to AI safety, medical informatics, and health equity. Its breadth of impact spans medicine, AI ethics, policy, and fairness research. Paper 1, while technically sound, offers an incremental improvement to memory-agent RL training with narrower applicability to the NLP subfield of long-context processing.
Paper 2 addresses a critical safety and equity issue in medical AI, demonstrating a dangerous mechanism (diagnostic substitution) where epidemiological priors cause LLMs to recommend lower triage urgency for young women compared to men with identical symptoms. This has profound real-world implications for healthcare equity, AI alignment, and clinical deployments, giving it broader societal and scientific urgency. While Paper 1 provides a valuable benchmark for behavioral modeling, its focus on prediction markets and on-chain records is more niche, whereas Paper 2's findings on algorithmic bias have immediate, life-critical consequences.