Quantifying the human visual exposome with vision language models

Christian Rominger, Andreas R. Schwerdtfeger, Malay Gaherwar Singh, Dimitri Khudyakow, Elizabeth A. M. Michels, Fabian Wolf, Jakob Nikolas Kather, Magdalena Katharina Wekenborg

May 5, 2026

arXiv:2605.03863v1 PDF

cs.AI(primary)cs.CV

#207of 2292·Artificial Intelligence

#207 of 2292 · Artificial Intelligence

Tournament Score

1517±40

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6

Rigor4.5

Novelty7

Clarity6.5

Tournament Score

1517±40

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a framework for quantifying the "visual exposome" — the totality of visual environmental features individuals encounter daily — by coupling ecological momentary assessment (EMA) with vision-language models (VLMs). The core novelty lies in three integrated components: (1) using VLMs to automatically extract semantic features from participant-generated photographs of their environments, (2) validating this approach against the well-established greenness-mental health relationship, and (3) developing a semi-autonomous LLM-based literature mining pipeline that identified ~1,000 environmental features linked to mental health from >7 million publications, which were then used as VLM rating targets.

The fundamental problem addressed is the gap between macro-level environmental proxies (GPS, satellite imagery) and the actual first-person visual experience of individuals. This is a genuine measurement gap in environmental health research — the "uncertain geographic context problem" is well-recognized, and this paper offers a technologically grounded solution.

Methodological Rigor

Strengths in design: The EMA protocol (7 days, 7 alarms/day, 106 participants, 2,674 photographs) provides reasonable ecological validity. The multilevel modeling approach appropriately handles the nested data structure. Reliability analyses across five VLM runs showed high consistency (RCn = 0.97-0.99), and cross-model validation using both LLaMA 4 and Qwen3 VL strengthens confidence in the approach.

Concerns: The sample is small (N=106), exclusively European, young (mean age 24.9), and heavily female (78 out of 106). This severely limits generalizability. More critically, several methodological issues deserve scrutiny:

1. The 33% significant correlation claim requires careful interpretation. With ~1,000 features tested, even without correction for multiple comparisons, one would expect 5% significant by chance. The paper reports "up to 33%" — the "up to" qualifier and the presentation in Figure 3 suggest this is the best-case scenario across different outcome variables and directions. The actual detection rates vary considerably, and the paper does not provide formal multiple comparison corrections for the large-scale feature analysis, instead arguing that rates above 5% exceed chance. This is a weak statistical argument given the non-independence of many extracted features.

2. Causal ambiguity is acknowledged but insufficiently addressed. People with higher positive affect may photograph greener environments by choice (selection bias), not because greenery causes positive affect. The photographs are participant-selected, introducing systematic confounding between mood state and photographic subject choice.

3. The literature mining pipeline, while impressive in scale, lacks rigorous validation of extraction accuracy. The paper mentions human-in-the-loop steps but provides limited detail on error rates or false positive/negative rates in the automated extraction of environmental features from publications.

4. Effect sizes are modest. The marginal R² values for greenness models predicting affect are very small (0.006-0.033), meaning VLM-derived features explain minimal variance in mental health outcomes.

Potential Impact

The framework has several promising applications:

Scalable environmental phenotyping for large cohorts, potentially transforming how environmental health studies capture exposure

Urban planning and public health applications, where understanding visual environments' mental health impacts could inform design decisions

Precision psychiatry, enabling personalized environmental recommendations

Digital health integration, combining with wearable sensors for comprehensive environmental monitoring

The literature mining pipeline itself could be repurposed for other systematic evidence synthesis tasks. The conceptual framework of "visual exposomics" is a useful organizing principle.

However, the practical impact may be limited by the requirement for active participant photography, which introduces selection bias and compliance challenges at scale. Passive image capture (e.g., from wearable cameras) would strengthen the approach but raises significant privacy and ethical concerns.

Timeliness & Relevance

The paper is highly timely, sitting at the intersection of several converging trends: the maturation of VLMs, growing interest in exposome research, and the mental health crisis driving demand for environmental interventions. The application of VLMs to health-relevant image analysis is a natural extension of their demonstrated capabilities in other domains. The work addresses a genuine bottleneck — the inability to quantify first-person visual environments at scale — that has limited environmental psychology and public health research.

Strengths & Limitations

Key Strengths:

Novel conceptual framework bridging AI and environmental health psychology

Multi-scale validation (state and trait level outcomes)

Cross-model replication (two different VLM architectures)

Open code and data (anonymized) availability

Practical demonstration that VLMs can extract meaningful environmental features from real-world photographs

The literature mining pipeline is itself a useful methodological contribution

Notable Limitations:

Small, homogeneous sample severely limits generalizability

No multiple comparison correction for the large-scale feature analysis

Participant-selected photographs introduce selection bias confounded with mood

Very small effect sizes (marginal R² < 0.035 for greenness-affect associations)

Single VLM run for the 997-feature analysis (vs. five runs for greenness validation)

The paper claims to establish how the "visible world shapes mental health" but can only demonstrate correlational associations

No comparison with existing geospatial methods on the same data to demonstrate added value

The literature mining validation lacks ground-truth assessment of extraction accuracy

Privacy concerns around participant photography are mentioned but not deeply explored

Additional Observations

The paper's framing occasionally overstates the findings. Phrases like "decoding how the visible world shapes mental health" imply causality that the cross-sectional/correlational design cannot support. The claim of a "paradigm shift" is premature given the proof-of-concept nature and small sample. The approach would benefit from comparison with Google Street View-based analyses, which have been used in related environmental health research, and from integration with objective geolocation data to validate the VLM ratings against ground-truth environmental measures.

The work is best understood as an early proof-of-concept demonstrating feasibility, with the framework's true impact dependent on larger, more diverse implementations with longitudinal and experimental designs.

Rating:5.5/ 10

Significance 6Rigor 4.5Novelty 7Clarity 6.5

Generated May 6, 2026

Comparison History (37)

vs. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

claude-opus-4.65/16/2026

CauSim addresses a fundamental limitation of LLMs—causal reasoning—with a scalable, generalizable framework that has broad implications across AI, science, and decision-making. Its methodological contribution (turning scarce-label causal problems into supervised ones via executable SCMs) is technically novel and applicable across many domains. Paper 1 is innovative in applying VLMs to visual exposomics for mental health, but its impact is narrower, primarily relevant to environmental health research. CauSim's potential to improve causal reasoning in LLMs represents a more transformative advance with wider cross-disciplinary impact.

vs. Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

claude-opus-4.65/6/2026

Paper 2 introduces a novel interdisciplinary paradigm—'visual exposomics'—that bridges computer vision, environmental health, and mental health research using VLMs to quantify first-person visual experience at scale. It addresses a fundamental gap in exposome research with a scalable, validated methodology applicable across public health, urban planning, and clinical psychology. Paper 1, while technically sophisticated in its approach to moral reasoning control in LLMs, addresses a narrower problem within AI alignment/safety. Paper 2's broader real-world applicability, interdisciplinary reach, and potential to transform environmental health assessment give it higher impact potential.

vs. Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

claude-opus-4.65/6/2026

Paper 2 has higher potential scientific impact due to its broader interdisciplinary reach spanning computer vision, environmental health, psychology, and public health. It introduces a novel paradigm ('visual exposomics') that creates an entirely new measurement framework for studying environment-mental health relationships, with clear real-world applications in urban planning, clinical psychology, and public health policy. The scalable pipeline analyzing millions of publications and thousands of photographs demonstrates immediate practical utility. Paper 1, while technically sophisticated, addresses a narrower AI alignment problem with more limited real-world applicability.

vs. Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

claude-opus-4.65/6/2026

Paper 1 introduces a novel paradigm—'visual exposomics'—that bridges computer vision, environmental health, and mental health research using VLMs/LLMs to systematically quantify first-person visual environments at scale. It addresses a fundamental gap in exposome research with broad interdisciplinary impact across public health, urban planning, psychology, and AI. Paper 2, while practically useful, represents an incremental engineering optimization (fine-tuning a smaller model for a narrow agentic subtask) with limited scientific novelty and narrower impact confined to software engineering tooling.

vs. ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

gemini-35/6/2026

Paper 2 introduces a highly novel paradigm ('visual exposomics') that bridges AI, environmental science, and mental health. Its scalable approach combining ecological momentary assessment, VLMs, and massive literature mining offers broader multi-disciplinary applications compared to Paper 1's clinical interview analysis, enabling high-throughput objective quantification of environmental impacts on human well-being.

vs. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

claude-opus-4.65/6/2026

Paper 2 introduces a novel interdisciplinary paradigm—'visual exposomics'—that bridges computer vision, environmental health, and mental health research. It creates an entirely new measurement framework for quantifying visual environments' impact on wellbeing, with broad applications across public health, urban planning, and psychology. While Paper 1 offers a solid incremental improvement in search-augmented reasoning (1.6 EM points over baseline), Paper 2 opens a new research direction with potentially transformative real-world applications, leveraging VLMs in a creative way that could influence multiple fields simultaneously.

vs. Human-Guided Harm Recovery for Computer Use Agents

gemini-35/6/2026

Paper 2 has higher potential scientific impact due to its broad interdisciplinary reach, bridging AI, psychology, public health, and environmental science. By introducing 'visual exposomics' and leveraging VLMs to quantify the impact of visual environments on mental health at scale, it offers a novel, objective paradigm that replaces coarse proxies. While Paper 1 addresses a timely issue in AI safety, Paper 2's application of AI to solve a long-standing measurement problem in human health offers wider real-world applications and cross-field utility.

vs. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

gpt-5.25/6/2026

Paper 1 is more scientifically impactful: it introduces a novel, generalizable measurement paradigm (“visual exposomics”) by combining EMA with VLM-derived semantic features and a large-scale literature-mining pipeline to link many environmental factors to mental health. It addresses a major measurement bottleneck, has clear real-world applications in psychiatry/public health, and could influence multiple fields (computational social science, environmental health, psychology, CV/NLP). Paper 2 is timely and practically useful, but is more engineering/tooling focused with impact concentrated in AI security workflows and less methodological novelty beyond integration/automation.

vs. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

gpt-5.25/6/2026

Paper 2 introduces a novel, scalable paradigm (“visual exposomics”) by combining EMA with VLM-based semantic quantification and an LLM literature-mining pipeline to derive ~1000 mental-health–linked features, then validates associations in real-world imagery. This is methodologically broader and more likely to generalize across public health, psychiatry, epidemiology, urban planning, and ML. Paper 1 is timely and practically useful for AI security workflows, but appears primarily as an engineering integration over existing attack/transform/scorer libraries with limited new scientific insight beyond tooling and a single case study.

vs. Visual Perceptual to Conceptual First-Order Rule Learning Networks

gemini-35/6/2026

Paper 1 pioneers a highly interdisciplinary paradigm ('visual exposomics') that bridges AI, public health, and psychology. By using VLMs to objectively quantify environmental impacts on mental health, it offers immense real-world applicability and broad impact across multiple scientific domains. Paper 2 presents a strong technical advancement in neuro-symbolic AI, but its immediate impact is more narrowly focused within the machine learning community.

vs. Visual Perceptual to Conceptual First-Order Rule Learning Networks

gemini-35/6/2026

Paper 2 pioneers 'visual exposomics,' introducing a highly interdisciplinary paradigm that leverages modern AI (VLMs and LLMs) to directly measure environmental impacts on mental health. While Paper 1 makes strong fundamental methodological contributions to neuro-symbolic AI, Paper 2's broad applicability across public health, psychology, and AI, combined with its direct potential for real-world impact on human well-being, suggests a higher broader scientific and societal impact.

vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

claude-opus-4.65/6/2026

Paper 2 introduces a genuinely novel paradigm—'visual exposomics'—that bridges environmental health science, mental health research, and AI in a broadly impactful way. It addresses a fundamental gap in quantifying first-person visual environments and their mental health effects, with clear real-world applications in public health, urban planning, and clinical psychology. Paper 1, while technically interesting, is more narrowly focused on optimizing ML tools for AI agents—a niche contribution within the AI/ML ecosystem. Paper 2's interdisciplinary reach, large-scale validation, and potential to transform environmental epidemiology give it higher scientific impact.

vs. End-to-end autonomous scientific discovery on a real optical platform

gemini-35/6/2026

Paper 2 demonstrates the first end-to-end autonomous AI system to discover and experimentally validate a novel physical mechanism. This represents a massive paradigm shift in AI-driven scientific discovery with profound, cross-disciplinary implications for the future of automated research, whereas Paper 1, while innovative, has a more domain-specific impact in exposomics and psychology.

vs. Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

claude-opus-4.65/6/2026

Paper 1 introduces a novel, scalable paradigm combining VLMs with ecological momentary assessment to quantify visual environments' impact on mental health—a previously unquantified determinant. It demonstrates empirical validation across 2674 photographs and leverages LLM-based literature mining of 7M+ publications, showing real-world applicability. Paper 2 presents a theoretical framework (mechanical conscience) for AI safety in distributed systems with interesting constructs but remains largely theoretical with only illustrative results. Paper 1's interdisciplinary impact (public health, environmental science, computer vision, psychology) and immediate practical applicability give it broader potential impact.

vs. AI scientists produce results without reasoning scientifically

gpt-5.25/6/2026

Paper 2 has higher likely impact: it addresses a timely, field-wide concern about autonomous AI research reliability, offers broad applicability across domains using a large-scale evaluation (25k+ runs), and introduces actionable epistemic diagnostics beyond outcome metrics. Its conclusions can influence AI agent design, evaluation standards, and policy for AI-assisted science. Paper 1 is novel and application-rich for mental health/environmental exposure measurement, but its impact is more domain-specific and depends on further validation and adoption, whereas Paper 2’s findings generalize to many scientific and industrial settings currently deploying LLM agents.

vs. AI scientists produce results without reasoning scientifically

gemini-35/6/2026

Paper 2 has broader and more fundamental scientific impact because it critically evaluates the limitations of AI agents across all scientific domains. While Paper 1 offers a highly novel application of AI in public health and psychology, Paper 2 addresses an urgent, cross-disciplinary issue: the epistemic reliability of autonomous 'AI scientists'. Its rigorous evaluation of over 25,000 agent runs provides a crucial course-correction for the rapidly growing field of AI-driven research, affecting how future scientific methodology is developed and validated.

vs. EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

gpt-5.25/6/2026

Paper 2 likely has higher impact due to a more broadly applicable methodological advance: a self-improving post-training framework that removes dependence on human labels, proprietary models, or fixed reward models—key bottlenecks in current LLM alignment. The co-evolved rubric/policy setup is novel and timely, with strong benchmark gains suggesting reproducibility and adoption across many LLM tasks and domains. Paper 1 is innovative and valuable for mental health/environment research, but its impact is more domain-specific and constrained by data collection/measurement and causal limitations typical of correlational exposome studies.

vs. SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

claude-opus-4.65/6/2026

Paper 1 demonstrates higher potential impact due to its large-scale real-world deployment (N=13,917), rigorous clinical validation against expert clinicians, and direct practical implications for healthcare delivery. The finding that agentic AI symptom interviews outperform both user-guided conversations and independent clinicians (OR=2.47) has immediate translational value. The integration with wearable physiological data across 400 conditions opens new research directions. Paper 2 is innovative in visual exposomics but has a smaller sample (2,674 photos), relies on correlational analyses, and has narrower immediate clinical applicability.

vs. OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

gemini-35/6/2026

Paper 1 introduces a highly novel, interdisciplinary paradigm ('visual exposomics') that bridges AI, environmental psychology, and public health. By providing a scalable, objective method to quantify the visual environment's impact on mental health, it has immense potential to influence diverse fields including urban planning, epidemiology, and medicine. In contrast, Paper 2 offers a valuable but more incremental methodological improvement specific to the subfield of AI search agents, giving Paper 1 broader scientific impact.

vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

gpt-5.25/6/2026

Paper 2 likely has higher impact due to clear real-world applicability (scalable, objective measurement of environmental visual factors linked to mental health), strong timeliness (VLMs + exposome/psychiatry), and broad cross-field relevance (public health, psychiatry, epidemiology, computer vision, HCI). It combines empirical participant data with a large-scale literature-mining pipeline to define ~1000 features, enabling many downstream studies and interventions. Paper 1 is novel for agent-facing interpretability and could influence AutoML/agentic tooling, but its impact is narrower and depends on adoption and validity of LLM-based interpretability metrics.