Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

Kinhei Lee, Peiyuan Jing, Zhenxuan Zhang, Yue Yang, Tao Wang, Dominic C Marshall, Yingying Fang, Guang Yang

#149 of 2292 · Artificial Intelligence
Share
Tournament Score
1530±40
10501800
76%
Win Rate
22
Wins
7
Losses
29
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large scale vision language models have shown promise in automating chest Xray interpretation, yet their clinical utility remains limited by a gap between model outputs and radiologist reasoning. Most systems optimize for semantic information without emulating how experts visually examine medical images, often overlooking critical findings or diverging from established diagnostic workflows. Radiologists follow structured protocols (e.g., the ABCDEF approach) that ensure all clinically relevant regions are systematically examined, reducing missed findings and supporting reliable diagnostic reasoning. We introduce GazeX, a vision language model that leverages radiologists' eye tracking data as a behavioral prior to model expert diagnostic reasoning. By incorporating gaze trajectories and fixation patterns into pretraining, GazeX learns to follow the spatial and temporal structure of radiologist attention and integrates observations in a clinically meaningful sequence. Using a curated dataset of over 30,000 gaze key frames from five radiologists, we demonstrate that GazeX produces more accurate, interpretable, and expert consistent outputs across radiology report generation, disease grounding, and visual question answering, utilizing 231,835 radiographic studies, 780,014 question answer pairs, and 1,162 image sentence pairs with bounding boxes. Unlike autonomous reporting systems, GazeX produces verifiable evidence artifacts, including inspection trajectories and finding linked localized regions, enabling efficient human verification and safe human AI collaboration. Learning through expert eyes provides a practical route toward more trustworthy, explainable, and diagnostically robust AI systems for radiology and beyond.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GazeX – A Vision-Language Model Trained on Radiologists' Gaze and Reasoning

1. Core Contribution

GazeX introduces a novel paradigm for medical vision-language models by incorporating radiologists' eye-tracking data as a behavioral prior during pretraining. The central insight is that radiologists follow structured, reproducible inspection patterns (e.g., the ABCDEF protocol) that encode diagnostic reasoning beyond what image-text pairs alone capture. The model integrates three modules: Fine-Grained Visual Perception (FVP) for spatial grounding of findings to gaze clusters, Gaze Trajectory Mimicking (GTM) for learning temporal inspection sequences, and Sequential Dependency Awareness (SDA) for recovering correct viewing order from shuffled sequences. Built on the Qwen2-VL backbone (2B parameters), GazeX is evaluated across report generation, visual question answering, and visual grounding tasks.

The key conceptual advance is treating eye movement patterns as first-class supervision signals during pretraining rather than post-hoc attention alignment. This shifts the training objective from purely semantic matching to behavioral emulation of expert workflows.

2. Methodological Rigor

Strengths in methodology:

  • The REFLACX dataset processing pipeline is well-documented, with clear descriptions of gaze clustering (DBSCAN), Gaussian attention map generation, and attention video construction.
  • The evaluation is comprehensive, spanning three downstream tasks across multiple established benchmarks (MIMIC-CXR, IU X-ray, Medical-CXR-VQA, MS-CXR), totaling 231,835 studies.
  • Statistical testing (Wilcoxon signed-rank tests, ICC analysis, Pearson correlations) is applied appropriately. The ICC analysis between GazeX and radiologists (ICC_X=0.622, ICC_Y=0.651) provides meaningful quantification of spatial agreement.
  • The ablation study systematically removes FVP, GTM, and SDA modules, demonstrating each component's contribution.
  • Methodological concerns:

  • The gaze data comes from only 5 radiologists across 3,032 cases, with the curated dataset comprising ~30,000 key frames. This is relatively small for pretraining a foundation model, raising questions about whether the learned inspection patterns generalize beyond these specific radiologists' habits.
  • The 41-case multi-reader subset used for inter-radiologist consistency analysis is quite small for drawing strong conclusions about universal inspection patterns.
  • The paper claims "foundation model" status, but with 2B parameters trained on a limited gaze dataset, this characterization may be overstated. The model is more accurately a domain-adapted model with gaze-augmented pretraining.
  • Zero-shot evaluation on IU X-ray uses the complete dataset rather than the test split due to acknowledged limitations in positive sample distribution, which complicates fair comparison.
  • The comparison with baseline Qwen2-VL involves fine-tuning it with bounding boxes derived from radiologist gaze clusters, which may not be the fairest baseline since Qwen2-VL wasn't designed for this task.
  • 3. Potential Impact

    Clinical relevance: The clinician-in-the-loop design philosophy is well-aligned with current regulatory and practical constraints on medical AI. The production of "verifiable evidence artifacts" (inspection trajectories, finding-linked regions) addresses a genuine barrier to clinical adoption—the inability of clinicians to audit AI reasoning.

    Broader implications: The approach of encoding expert behavioral priors into model training could extend beyond radiology to pathology, dermatology, or any domain where expert visual inspection patterns are systematic and recordable. The framework essentially creates a bridge between cognitive science (eye-tracking research) and applied AI.

    Practical limitations: Eye-tracking data collection is expensive and logistically challenging. The scalability of this approach depends on whether the behavioral prior learned from limited gaze data transfers effectively, or whether much larger gaze datasets would be needed for robust deployment.

    4. Timeliness & Relevance

    This work addresses a timely concern: despite rapid advances in medical VLMs (GPT-4V, Med-Gemini, etc.), clinical adoption remains limited due to trustworthiness gaps. The 2025 Nature Medicine paper by Tanno et al. on clinician-VLM collaboration in radiology (cited as [8]) confirms this is an active area of investigation. The emphasis on explainability and structured inspection aligns with emerging regulatory requirements for transparent AI in healthcare.

    The use of eye-tracking as a training signal also connects to growing interest in human-AI alignment, though applied here in a domain-specific rather than general context.

    5. Strengths & Limitations

    Key Strengths:

  • Novel and well-motivated integration of behavioral data (eye-tracking) into VLM pretraining
  • Comprehensive multi-task evaluation demonstrating consistent improvements
  • Strong qualitative analysis (Fig. 5, 7) showing clinically meaningful behavioral alignment
  • Ablation study confirms each module's contribution with statistical significance
  • The disease-specific gaze analysis (Fig. 4c) across 14 CheXbert categories provides granular evidence
  • Practical design for human-AI collaboration rather than autonomous reporting
  • Notable Weaknesses:

  • Limited gaze data scale (5 radiologists, ~3K cases) for a "foundational" model claim
  • No prospective clinical validation or reader studies measuring actual clinical impact (acknowledged by authors)
  • Restricted to chest X-rays; generalizability to other modalities undemonstrated
  • The improvements in some metrics (e.g., BLEU-3, BLEU-4) are modest and comparable to rather than clearly surpassing state-of-the-art
  • CheXbert F1 of 0.460 on MIMIC-CXR, while competitive, is not dramatically superior to existing methods
  • No comparison with recent powerful models like GPT-4V or Med-PaLM in the report generation task
  • The paper does not address potential biases in the 5 radiologists' reading patterns or how institution-specific practices might limit generalizability
  • Additional Observations:

  • The code and data availability enhance reproducibility
  • The paper is dense but well-structured, with clear figures that communicate complex spatial-temporal relationships
  • The framing around ABCDEF protocols is compelling but the actual connection between the learned gaze patterns and specific ABCDEF steps is not rigorously validated
  • Summary

    GazeX presents a creative and clinically motivated approach to incorporating expert behavioral priors into medical VLMs. The concept is strong and the evaluation is reasonably thorough, though the limited scale of gaze data and absence of prospective clinical validation temper the impact claims. The work opens an interesting research direction at the intersection of cognitive science and medical AI, with practical implications for trustworthy clinical deployment.

    Rating:6.8/ 10
    Significance 7.5Rigor 6.5Novelty 7.5Clarity 7

    Generated May 5, 2026

    Comparison History (29)

    vs. Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
    claude-opus-4.65/5/2026

    GazeX introduces a genuinely novel approach—incorporating radiologist gaze data as a behavioral prior into vision-language model pretraining—with broad implications for medical AI, human-AI collaboration, and explainability. It addresses a significant clinical need (reliable chest X-ray interpretation) with a large-scale dataset and demonstrates improvements across multiple tasks. Paper 1, while methodologically careful, addresses a narrower concern (measurement sensitivity in a specific Japanese financial NLP benchmark) with limited generalizability. Paper 2's novelty, real-world clinical applications, and cross-disciplinary relevance (computer vision, NLP, radiology, cognitive science) give it substantially higher impact potential.

    vs. Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
    gemini-35/5/2026

    Paper 2 addresses a critical bottleneck in medical AI (interpretability and expert alignment) by innovatively using radiologist eye-tracking data to train Vision Language Models. This approach significantly enhances diagnostic accuracy and explainability in a high-stakes clinical setting, offering broader real-world applications and higher potential impact on human health compared to the narrower, domain-specific evaluation methodology presented in Paper 1.

    vs. Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries
    claude-opus-4.65/5/2026

    Paper 2 (GazeX) addresses a significant gap in medical AI by incorporating radiologists' gaze patterns into vision-language models, offering a novel and practical approach to improving clinical AI trustworthiness and interpretability. It has broad real-world applicability in radiology, a large-scale empirical evaluation, and tackles the timely problem of human-AI collaboration in healthcare. Paper 1, while technically rigorous with its mechanized formal verification of governed execution, addresses a narrower audience in programming language theory and formal methods, limiting its breadth of impact. GazeX's interdisciplinary nature (computer vision, NLP, clinical medicine, eye-tracking) gives it wider potential influence.

    vs. Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries
    gemini-35/5/2026

    Paper 1 addresses a critical real-world problem in medical AI—clinical utility and explainability—by incorporating expert gaze patterns into a vision-language model. This novel approach has immediate, high-impact applications in healthcare and can be generalized to other domains requiring visual expertise. Paper 2, while highly rigorous in its formal verification, is highly theoretical and specialized within programming language theory, giving Paper 1 a much broader potential scientific and societal impact.

    vs. A Knowledge-Driven LLM-Based Decision-Support System for Explainable Defect Analysis and Mitigation Guidance in Laser Powder Bed Fusion
    gpt-5.25/5/2026

    Paper 2 is likely higher impact due to a more novel and generalizable training signal (expert gaze as a behavioral prior) applied at large scale, directly addressing a central clinical bottleneck: aligning model reasoning with radiologist workflow. Its potential real-world application in medical imaging is immediate and broad, with rigorous evaluation across multiple tasks and large datasets plus verifiable evidence artifacts that support safe deployment. Paper 1 is solid and rigorous but is more domain-specific (LPBF defects with a 27-defect ontology) and primarily integrates existing LLM+knowledge-base patterns, limiting breadth of impact.

    vs. Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration
    gpt-5.25/5/2026

    Paper 2 is likely higher impact: it introduces a distinctive training signal (radiologists’ gaze/attention trajectories) that tightly links model behavior to expert workflow, improving interpretability and verifiability—key bottlenecks for clinical deployment. It is methodologically grounded in large-scale multimodal datasets and yields concrete artifacts (trajectories, localized evidence) that enable human-AI collaboration. Real-world application potential in radiology is immediate and high-value, and the approach could generalize to other expert-visual domains. Paper 1 is useful for LLM orchestration but is closer to incremental framework engineering with less transformative downstream impact.

    vs. LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent
    gpt-5.25/5/2026

    Paper 1 is more scientifically impactful: it introduces a novel, clinically grounded behavioral prior (radiologist gaze and inspection sequences) for vision-language modeling, directly targeting a key barrier to real-world deployment—trustworthy, verifiable reasoning in medical imaging. The work leverages substantial multimodal datasets and yields interpretable artifacts (trajectories, localized evidence) aligned with clinical workflows, increasing translational potential and patient-safety relevance. While Paper 2 is timely and potentially broad for agent training, its “virtual world” RL scalability claims may be more incremental and benchmark-dependent, with less clear downstream societal impact than improved radiology reliability.

    vs. Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation
    gemini-35/5/2026

    Paper 2 addresses an urgent, cross-disciplinary challenge: quantifying AI safety for regulatory compliance. By providing a black-box statistical verification framework applicable to any high-risk AI system, it has much broader implications across finance, autonomous driving, and law than Paper 1, which, while highly innovative and methodologically rigorous, is restricted to the specific domain of medical imaging. Paper 2's potential to standardize global AI risk management and deployment gives it higher overall scientific and societal impact.

    vs. Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation
    gemini-35/5/2026

    Paper 2 addresses an urgent, universal challenge in AI: quantitative safety certification for regulatory compliance (e.g., EU AI Act). Its model-agnostic statistical framework scales across arbitrary architectures and sectors, offering broader societal, legal, and cross-disciplinary impact than Paper 1's domain-specific, though highly innovative, medical imaging application.

    vs. If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data
    claude-opus-4.65/5/2026

    GazeX introduces a fundamentally novel approach by incorporating radiologist gaze data as a behavioral prior into vision-language model training, bridging the gap between AI outputs and expert reasoning. Its large-scale multi-task evaluation (report generation, disease grounding, VQA) across 231K+ studies demonstrates broad methodological rigor. The concept of learning from expert visual attention is highly transferable beyond radiology, offering wide cross-disciplinary impact. Paper 2, while practical, presents an engineering framework (LLM-as-reasoning-engine for CGM data) with narrower scope and more incremental innovation in privacy-preserving QA pipelines.

    vs. SafeAgent: A Runtime Protection Architecture for Agentic Systems
    claude-opus-4.65/5/2026

    GazeX introduces a genuinely novel paradigm—integrating radiologist gaze data as a behavioral prior into vision-language model training—which bridges cognitive science, medical AI, and computer vision. Its large-scale dataset (30,000+ gaze keyframes, 231,835 studies), multi-task evaluation, and focus on interpretable, verifiable AI outputs address critical clinical deployment challenges. The approach is transferable beyond radiology. SafeAgent, while addressing an important security problem, offers an incremental architectural contribution (runtime guardrails) within a more narrowly scoped domain, with less cross-disciplinary novelty and narrower potential for broad scientific influence.

    vs. Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
    gpt-5.25/5/2026

    Paper 1 introduces a novel, data-intensive approach (integrating radiologists’ eye-tracking as a behavioral prior) that directly targets a major clinical gap: aligning model outputs with expert workflow and providing verifiable evidence artifacts. Its potential real-world impact in radiology is high, with clear safety/interpretability benefits and large-scale datasets spanning multiple tasks. Paper 2 is timely and useful for embodied AI evaluation, but is primarily a benchmark/analysis contribution with narrower immediate application and less direct societal impact than clinically deployable, trust-enhancing medical AI.

    vs. LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent
    gemini-35/5/2026

    Paper 2 presents a highly novel and interdisciplinary approach by incorporating human eye-tracking data as a behavioral prior into Vision Language Models. This directly addresses critical bottlenecks in medical AI, specifically interpretability, trustworthiness, and alignment with expert workflows. Its potential to improve clinical diagnostics and human-AI collaboration in healthcare gives it a more profound and immediate real-world impact compared to Paper 1, which, while impressive in optimizing RL for AI agents, offers a more incremental methodological advancement within a narrower subfield.

    vs. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
    gpt-5.25/5/2026

    Paper 2 is likely to have higher scientific impact due to broader applicability across LLM agents and long-horizon tasks, a timely problem (context bottleneck) affecting many domains, and a general architectural idea (separating context management from task execution) that can transfer across models and environments. It reports measurable gains on established agent benchmarks with efficiency improvements, suggesting practical adoption potential. Paper 1 is innovative and clinically relevant, but its impact is narrower (radiology, eye-tracking data requirements) and may face higher deployment and data-collection barriers.

    vs. OLLM: Options-based Large Language Models
    gpt-5.25/5/2026

    Paper 2 has higher potential scientific impact due to greater novelty and generality: it introduces a broadly applicable modification to next-token modeling via discrete latent “options,” enabling controllable diversity and efficient latent-space policy learning with minimal added parameters. Its methodological claim (structure-induced alignment and sample-efficient RL) could influence core LLM training/alignment across tasks and domains beyond math. Paper 1 is impactful for medical imaging and interpretability, but is more domain-specific and depends on specialized gaze datasets and clinical validation for real-world adoption, limiting breadth compared to a general LLM framework.

    vs. Strategy-Aware Optimization Modeling with Reasoning LLMs
    gemini-35/5/2026

    Paper 2 addresses a high-stakes, real-world clinical problem by introducing a highly novel multimodal approach (incorporating radiologists' gaze trajectories into VLM pretraining). This enhances interpretability, trust, and human-AI collaboration in healthcare, offering broader and more critical societal impact compared to the domain-specific algorithmic improvements in optimization modeling presented in Paper 1.

    vs. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
    gpt-5.25/5/2026

    Paper 1 likely has higher scientific impact due to greater applied novelty and direct real-world clinical relevance: integrating radiologists’ gaze as a behavioral prior into a vision-language model targets a key barrier (trustworthy, workflow-aligned reasoning) and can materially affect diagnostic accuracy, interpretability, and human-AI verification in high-stakes medicine. It also leverages large-scale multimodal datasets and produces verifiable artifacts (trajectories/grounding), potentially influencing broader medical imaging and explainable AI. Paper 2 is methodologically solid and timely for LLM evaluation, but its scope is narrower (benchmark/metric) and may yield more incremental downstream impact.

    vs. Auditable Agents
    claude-opus-4.65/5/2026

    GazeX introduces a genuinely novel approach—integrating radiologist eye-tracking data as a behavioral prior into vision-language model training—with concrete empirical results across multiple clinical tasks using large-scale datasets. This bridges cognitive science, computer vision, and clinical medicine, offering broad interdisciplinary impact and a practical path toward trustworthy medical AI. Paper 2 makes important conceptual contributions to LLM agent accountability but is more of a position/framework paper with limited empirical novelty. GazeX's methodological innovation and direct clinical applicability give it higher potential impact.

    vs. Auditable Agents
    gemini-35/5/2026

    While Paper 2 offers significant advancements in medical AI and explainability, Paper 1 addresses a critical, universal bottleneck in AI deployment: agent accountability. By providing a foundational framework, practical mechanisms, and empirical feasibility for auditing LLM agents, Paper 1 has a vastly broader potential impact across multiple fields, including AI safety, governance, cybersecurity, and enterprise adoption.

    vs. Brief chatbot interactions produce lasting changes in human moral values
    gemini-35/5/2026

    Paper 2 presents a highly rigorous, large-scale approach to a critical bottleneck in medical AI (interpretability and expert alignment), utilizing a massive dataset to provide immediate clinical utility. While Paper 1 explores a timely issue regarding AI persuasion and human morals, its relatively small sample size (n=53) limits its robustness and broad methodological impact compared to the extensive empirical validation and clear real-world applications demonstrated in Paper 2.