Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

Qi Han Wong

Jun 2, 2026

arXiv:2606.03641v1 PDF

cs.AI(primary)cs.CY

#310of 3355·Artificial Intelligence

#310 of 3355 · Artificial Intelligence

Tournament Score

1507±46

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity8.5

Tournament Score

1507±46

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper identifies a specific mechanism—diagnostic substitution—by which LLMs produce gender-disparate triage recommendations for identical neurological symptoms. Rather than simply documenting that disparities exist (which prior work has done), the authors show the causal chain: gender/age → differential diagnosis anchoring (IIH for young women vs. space-occupying lesion for men) → different urgency assignments → different care pathways. This mechanistic insight is the paper's most valuable contribution, as it moves beyond "bias exists" to "here is precisely how bias operates," which is actionable for model developers.

The finding that young women receive dramatically lower ER referral rates (e.g., 6.7% vs. 96.7% in Claude at age 25) despite severity scores of 7-9/10 is striking and clinically concerning. The age-65 convergence elegantly confirms that the mechanism is epidemiological prior anchoring on IIH rather than a crude gender heuristic.

2. Methodological Rigor

Strengths: The experimental design follows established counterfactual patient variation methodology. Testing across three model families (Gemini, Claude, GPT) strengthens generalizability claims. The use of Fisher's exact tests with Bonferroni correction is appropriate for the sample sizes and multiple comparisons. The structured JSON output format ensures consistent, machine-parseable results. The diagnosis classification scheme (IIH vs. Generic ICP, mass mentioned) provides transparent operationalization.

Weaknesses: The study has several methodological limitations that constrain its impact:

Single symptom profile: Testing only one neurological presentation limits generalizability considerably. The authors acknowledge this but it remains a fundamental constraint.

n=30 per condition: While adequate for the massive effects observed, this sample size cannot detect subtler disparities and provides limited power for secondary analyses.

Temperature 0.3 only: A single temperature setting leaves uncertainty about robustness. Temperature sensitivity analysis would have been straightforward to include.

No human clinician benchmark: Without knowing what the "correct" triage recommendation is, the paper can only demonstrate disparity, not definitively establish which group is being harmed (though the clinical argument that these symptoms warrant ER evaluation regardless of diagnosis is well-made).

Forced single-turn format: This is a significant ecological validity concern. Real triage interactions are often multi-turn, and the constrained format may artificially amplify demographic priors by making gender the most salient differentiating signal.

Model versions: The paper references model versions (Gemini 3.5 Flash, Claude Sonnet 4.6, GPT-5.4-mini) dated June 2026. These specific versions will rapidly become outdated, though the structural finding about diagnostic substitution as a mechanism likely generalizes.

3. Potential Impact

The practical implications are significant for the growing medical AI industry:

1. Design principle: The recommendation to decouple urgency from probabilistic diagnostic priors is immediately actionable. Triage systems could be architected to assess symptom-level urgency before diagnostic anchoring occurs.

2. Audit methodology: The paper demonstrates that action-level auditing alone is insufficient—diagnosis-level auditing is necessary to detect substitution biases that produce equivalent severity scores but divergent care pathways.

3. Regulatory relevance: As AI triage tools face increasing regulatory scrutiny (FDA, EU AI Act), this type of bias characterization work directly informs safety evaluation frameworks.

The finding could influence adjacent fields including clinical decision support, health equity research, and AI safety/fairness more broadly.

4. Timeliness & Relevance

This paper is highly timely. LLM-powered triage tools are actively being deployed or piloted by major health systems and consumer health platforms. The specific concern—that epidemiologically valid priors can produce clinically inappropriate urgency reductions—is a nuanced safety issue that simplistic fairness metrics would miss. The paper addresses a genuine bottleneck in clinical AI deployment: how to audit for bias that is "clinically grounded" yet harmful.

The connection to documented human clinical biases (women's symptoms being psychologized, delayed cardiac workups) adds depth and suggests these biases are structurally encoded in training corpora rather than being artifacts of model architecture.

5. Strengths & Limitations

Key Strengths:

Mechanistic clarity: Identifying diagnostic substitution as the specific mechanism is the paper's standout contribution. This is more actionable than generic bias documentation.

Cross-model replication: Consistent findings across three major model families from different companies strengthens the claim that this is a systemic issue.

Age-dependent pattern: The convergence at age 65 is a compelling natural experiment that validates the epidemiological prior hypothesis.

Clinical grounding: The authors correctly note that IIH is a valid differential but argue convincingly that it should not reduce urgency given the symptom presentation. This nuanced framing avoids the trap of dismissing all demographic-specific reasoning.

Open science: Release of code, prompts, and raw data supports reproducibility.

Notable Limitations:

Narrow scope: A single symptom profile fundamentally limits the paper's contribution. Even one additional profile (e.g., chest pain with cardiac gender bias implications) would have substantially strengthened the work.

No mitigation evaluation: The paper identifies the problem and suggests solutions (decoupling urgency from diagnosis, multi-turn interaction) but tests none of them. Even a simple prompt engineering intervention demonstrating reduced bias would significantly increase practical impact.

Missing baseline: No comparison to human clinician triage decisions. Without this, we cannot assess whether the LLMs are better, worse, or equivalent to human bias levels.

Ecological validity: The forced single-turn structured output format, while mirroring some production deployments, is a strong experimental constraint that may amplify the observed effects.

Additional Observations

The paper is well-written, clearly structured, and appropriately cautious in its claims. The framing that models have learned "a clinically grounded epidemiological prior that happens to produce worse outcomes" is more precise and constructive than characterizing models as "biased" in a generic sense. However, the paper's scope is relatively narrow for the strength of its claims—the title implies a general phenomenon, but only one clinical scenario is tested. The work reads more as a compelling demonstration/proof-of-concept than a comprehensive characterization study.

The paper would benefit from discussing whether system prompts explicitly instructing urgency-first reasoning (before diagnosis) could mitigate the effect, as this would be the most immediately deployable intervention.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 8.5

Generated Jun 3, 2026

Comparison History (24)

vs. Prototype Transformer: Towards Language Model Architectures Interpretable by Design

gemini-3.16/6/2026

Paper 2 proposes a fundamental architectural innovation (ProtoT) that addresses two major bottlenecks in modern LLMs: computational efficiency (linear cost) and interpretability. If successful, this architecture could be adopted across all fields utilizing AI, offering a massive breadth of impact. While Paper 1 provides a crucial empirical finding on algorithmic bias in healthcare, Paper 2's foundational contribution to model design gives it a higher potential for widespread, transformative scientific impact.

vs. The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

claude-opus-4.66/3/2026

Paper 1 demonstrates a novel, well-quantified finding—systematic gender-dependent diagnostic substitution in LLM medical triage—with immediate real-world patient safety implications. It reveals a concrete, actionable bias mechanism (epidemiological priors suppressing urgency) across multiple frontier models, directly relevant to the rapidly growing deployment of AI in healthcare. While Paper 2 makes a solid methodological contribution regarding contamination detection reliability, it primarily reveals limitations of existing tools without proposing solutions. Paper 1's findings are more likely to drive policy changes, model design improvements, and cross-disciplinary attention from both AI and medical communities.

vs. From Holo Pockets to Electron Density: GPT-style Drug Design with Density

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to a more novel methodological contribution (conditioning molecular generation on low-resolution electron density, including experimental cryo-EM/X-ray ED) with broad applicability in structure-based drug design and direct downstream real-world implications (hit generation/optimization across many targets). It proposes a concrete generative framework (EDMolGPT) and reports evaluation on 101 targets, suggesting scalability and rigor. Paper 2 is timely and important for AI safety/fairness in medical triage, but its scope is narrower, depends on prompt-based auditing of existing models, and may be more sensitive to model/version drift, reducing durable methodological impact.

vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

claude-opus-4.66/3/2026

Paper 1 demonstrates a specific, well-characterized safety-critical bias in LLM medical triage with a clear mechanism (diagnostic substitution), rigorous experimental design, and directly actionable implications for AI deployment in healthcare. It reveals a novel, quantified disparity with strong statistical evidence across multiple models. Paper 2 offers an engineering contribution (a lightweight evaluation pipeline) but is incremental relative to existing evaluation frameworks and lacks the depth of scientific insight, novelty, and real-world urgency that Paper 1 provides. Paper 1 is more likely to influence both AI safety policy and clinical AI design.

vs. Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to broader, longer-term relevance and cross-field applicability: it introduces a new benchmark (Causal-Plan-Bench), a large-scale dataset (Causal-Plan-1M), an end-to-end training recipe, and an empirical scaling law for causal training—artifacts that can become community standards for embodied AI, robotics, and vision-language research. Paper 1 is timely and important for AI safety/health equity, but its scope is narrower (specific symptom template and triage setting) and may generalize less, limiting breadth despite strong real-world implications.

vs. AURA: Action-Gated Memory for Robot Policies at Constant VRAM

claude-opus-4.66/3/2026

Paper 1 addresses a critical and timely issue—gender bias in LLM-based medical triage—with broad societal implications for AI safety and healthcare equity. It demonstrates a systematic, reproducible bias across three major model families with a clear mechanistic explanation (diagnostic substitution). This has immediate policy relevance for AI regulation in healthcare, affects a massive user base, and bridges AI fairness, medicine, and public health. Paper 2 is technically solid but addresses a narrower robotics engineering problem (memory efficiency on edge hardware) with more incremental results and a smaller affected community.

vs. Forget Attention: Importance-Aware Attention Is All You Need

claude-opus-4.66/3/2026

Paper 1 addresses a critical and timely issue—gender bias in LLM-based medical triage—with clear real-world implications for patient safety as AI systems are increasingly deployed in healthcare. It demonstrates a systematic, reproducible bias across multiple leading model families with a well-designed methodology, and its findings have immediate policy relevance for AI regulation and clinical deployment. Paper 2 proposes a technically interesting architectural innovation (score-level SSM-attention fusion), but the results are mixed (Mamba-3 leads on LAMBADA at 369M), tested only at small scale, and represents an incremental contribution in a crowded hybrid architecture space with uncertain long-term adoption.

vs. Transferring Information Across Interventions in Causal Bayesian Optimization

gpt-5.26/3/2026

Paper 2 has higher likely scientific impact: it introduces a broadly applicable methodological advance (graph-coupled causal BO) with clear novelty, theoretical guarantees (low-rank kernel, information-gain and regret bounds), and extensibility beyond a single domain. Its applications span any expensive experimentation setting (engineering, biotech, policy, robotics), giving wide cross-field reach and strong timeliness given interest in causal ML and sample-efficient optimization. Paper 1 is important and timely for AI safety/health equity, but its impact may be narrower (audit of specific LLM behaviors) and more sensitive to model/version drift, limiting generalizability despite strong real-world relevance.

vs. Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

gpt-5.26/3/2026

Paper 1 is more novel and timely, probing clinically consequential bias mechanisms in widely used LLMs. It identifies a specific causal pathway (diagnostic substitution via epidemiological priors) with direct safety and policy implications for AI triage design, and releases reproducible artifacts. Its impact could span medicine, AI safety/fairness, health policy, and regulation. Paper 2 is methodologically solid and practically relevant to hydrology, but its core finding (LSTM outperforming an encoder-only Transformer; value of downstream context) is less novel and narrower in cross-field influence.

vs. An Exploration of Collision-based Enemy Morphology Generation

claude-opus-4.66/3/2026

Paper 1 addresses a critical and timely issue—systematic gender bias in LLM-based medical triage—with rigorous methodology across multiple model families and clear, statistically significant findings. It has broad implications for AI safety, healthcare equity, and policy, and contributes actionable insights (decoupling urgency from diagnostic priors). Paper 2 makes a niche contribution to procedural content generation for game enemies, a narrower domain with limited cross-field impact. Paper 1's relevance to AI fairness, clinical decision-making, and public health gives it substantially higher potential scientific and societal impact.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to major technical novelty (agentic formal-proof generation with compiler-in-the-loop), strong methodological contribution (new IMO-style Lean benchmark), and broad downstream applicability to formal verification, mathematics, and trustworthy software/hardware. Its reported performance gains (sub-10% to 70%) and demonstrated utility on research-level formalization suggest a step-change in automated reasoning. Paper 2 is timely and important for AI safety/health equity, but is narrower in scope (single symptom template, limited models) and primarily diagnostic of bias rather than offering a generalizable technical solution.

vs. From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

gemini-3.16/3/2026

Paper 1 provides rigorous empirical evidence of a novel, high-stakes failure mode in clinical LLMs (diagnostic substitution). Its findings directly impact AI safety, medical informatics, and clinical practice. The focus on life-critical medical triage biases offers a broader, more urgent, and highly measurable scientific impact compared to Paper 2's conceptual and industry-specific framework for AI insurance claims.

vs. DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

gpt-5.26/3/2026

Paper 2 has higher likely scientific impact due to strong real-world implications for patient safety, healthcare policy, and AI governance. It addresses a timely, high-salience issue (LLM bias in medical decision support) with clear, actionable recommendations (decouple urgency from diagnostic priors) and broad cross-field relevance (ML, medicine, ethics, regulation). The experimental design is straightforward and reproducible across multiple frontier model families with statistically significant effects and released artifacts. Paper 1 is technically novel for agent memory, but its impact is more niche to LLM-agent architectures and less immediately consequential.

vs. The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

gemini-3.16/3/2026

Paper 2 addresses a highly timely, critical issue of AI bias in healthcare triage with broad implications for patient safety, clinical informatics, and AI fairness. Its findings on diagnostic substitution provide actionable insights for mitigating algorithmic harm in widely used LLMs. In contrast, Paper 1 offers a niche, albeit useful, structural pattern for knowledge graphs in compliance, which has a much narrower scope and limited cross-disciplinary impact compared to the life-or-death implications of medical AI bias.

vs. Solipsistic Superintelligence is Unlikely to be Cooperative

gemini-3.16/3/2026

Paper 1 provides rigorous empirical evidence of a critical, real-world vulnerability in AI medical triage. Its identification of diagnostic substitution as the mechanism for bias offers actionable insights for immediate improvements in clinical LLMs. While Paper 2 presents an important theoretical framework for AI alignment, Paper 1's concrete methodology, timeliness, and direct implications for patient safety give it higher immediate scientific and societal impact.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to clear novelty in demonstrating a concrete, quantifiable safety-critical bias mechanism (gender-dependent diagnostic substitution) across multiple frontier model families, with strong real-world applicability to clinical AI deployment and regulation. The methodology is comparatively rigorous (controlled symptom profile, demographic manipulations, multiple models, repeated trials, significance testing, and full artifact release). Its findings are timely and broadly relevant across medicine, AI safety/fairness, and policy. Paper 1 is conceptually interesting but more domain-specific and harder to validate against ground truth.

vs. TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

gemini-3.16/3/2026

Paper 1 exposes a critical, life-threatening bias in LLM medical triage, demonstrating how AI replicates human clinical biases. Its findings have profound real-world implications for patient safety, AI ethics, and healthcare regulation, ensuring broad impact across both the medical and machine learning communities. Paper 2 presents a valuable methodological advancement for visual RL, but its impact is more confined to specialized ML research and lacks the immediate societal urgency of Paper 1.

vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

claude-opus-4.66/3/2026

Paper 1 addresses a critical safety issue—gender bias in LLM medical triage—with broad implications for AI deployment in healthcare, policy, and fairness research. It reveals a systematic, reproducible bias across multiple major LLM families with a clear mechanistic explanation (diagnostic substitution). This has immediate real-world relevance as LLMs are increasingly used in clinical settings. Paper 2 makes solid but incremental technical contributions to relational deep learning benchmarks, with narrower impact limited to the ML/database community. Paper 1's interdisciplinary relevance (AI ethics, medicine, policy) gives it substantially higher impact potential.

vs. InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

claude-opus-4.66/3/2026

Paper 2 addresses a critical safety issue—gender bias in LLM-based medical triage—with clear, striking findings that have immediate real-world implications for AI deployment in healthcare. It demonstrates a concrete, reproducible bias mechanism (diagnostic substitution) across multiple major LLM families, making it highly relevant to AI safety, medical informatics, and health equity. Its breadth of impact spans medicine, AI ethics, policy, and fairness research. Paper 1, while technically sound, offers an incremental improvement to memory-agent RL training with narrower applicability to the NLP subfield of long-context processing.

vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

gemini-3.16/3/2026

Paper 2 addresses a critical safety and equity issue in medical AI, demonstrating a dangerous mechanism (diagnostic substitution) where epidemiological priors cause LLMs to recommend lower triage urgency for young women compared to men with identical symptoms. This has profound real-world implications for healthcare equity, AI alignment, and clinical deployments, giving it broader societal and scientific urgency. While Paper 1 provides a valuable benchmark for behavioral modeling, its focus on prediction markets and on-chain records is more niche, whereas Paper 2's findings on algorithmic bias have immediate, life-critical consequences.