ConceptM $^3$ oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen

May 23, 2026

arXiv:2605.24399v1 PDF

cs.AI(primary)

#926of 2682·Artificial Intelligence

#926 of 2682 · Artificial Intelligence

Tournament Score

1442±43

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty7

Clarity6.5

Tournament Score

1442±43

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM $^{3}$ oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM $^{3}$ oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ConceptM³oE

1. Core Contribution

ConceptM³oE introduces a framework that embeds concept bottleneck learning directly within an interaction-aware mixture-of-experts (MoE) architecture for multimodal computational pathology. The key innovation is the decomposition of multimodal evidence into modality-specific, redundant, and synergistic expert pathways, each internally structured as a concept embedding model (CEM) that maps latent features to clinically interpretable morphology and biomarker concepts organized in a two-level hierarchy.

The paper addresses a genuine gap: existing multimodal fusion models for pathology achieve strong performance but provide opaque predictions. Concept bottleneck models provide interpretability but typically sacrifice accuracy. ConceptM³oE attempts to bridge this by using residual pathways within each expert to preserve task-relevant information that may not be captured by predefined concepts, while still exposing structured reasoning traces.

The architectural design is thoughtful — the hierarchy where morphology concepts (cellularity, pleomorphism, etc.) inform biomarker concepts (GFAP, H3K27M, etc.) mirrors actual clinical reasoning in neuropathology, where histological patterns precede and inform molecular subtyping decisions.

2. Methodological Rigor

Strengths in design: The theoretical framework is well-developed. Proposition 1 formally establishes that the soft concept augmentation preserves the information capacity of the raw expert predictor, and Theorem 1 provides generalization bounds showing concept supervision localizes the hypothesis class. The information-theoretic analysis (mutual information plane) provides compelling evidence that concept embeddings reorganize rather than restrict information.

Experimental concerns: The evaluation datasets are quite small — 199 WSIs (PBT) and 208 WSIs (TCGA) — which limits statistical power. The 10-fold cross-validation helps, but the high variance in results (e.g., standard deviations of 11-15% on macro-F1) reflects this limitation. The TCGA dataset only has 2 classes (vs. 4 for PBT), limiting the complexity of evaluation.

The comparison framework has notable gaps. Table 1 shows ConceptM³oE underperforms PathMoE on PBT (macro-F1: 77.3% vs. 79.9%) while claiming "competitive performance." The paper frames this carefully but the claim of competitive performance deserves scrutiny — the gap is non-trivial on already small datasets. On TCGA, ConceptM³oE (82.1%) also trails PathMoE (87.1%) substantially.

The concept extraction pipeline relies on GPT-4 for morphology concept annotation from pathology reports, with reported inter-rater agreement of only 70-75% for pleomorphism and mitotic activity. This introduces noise into the concept supervision that could propagate through the model.

3. Potential Impact

Clinical interpretability: The most compelling contribution is the structured reasoning trace — gate weights indicating modality reliance, per-expert concept activations, and cross-expert comparison — validated by an independent neuropathologist. This represents genuine progress toward clinically deployable interpretable AI.

Sample efficiency: The demonstration that concept supervision improves performance in data-limited regimes (macro-F1 improvement from 56.41% to 66.70% at N=50) is practically important, as many clinical settings involve rare tumor types with limited training data.

Broader applicability: The framework is architecturally general — the interaction-decomposed experts with concept bottlenecks could extend to other multimodal medical domains (radiology-pathology integration, genomics-imaging fusion). However, the concept vocabulary must be manually designed per domain, limiting scalability.

Limitations in scope: The paper evaluates only on brain tumors (pediatric CNS tumors and gliomas), and the modalities used (WSI patches + cell graphs) are both derived from histopathology slides rather than truly heterogeneous sources. The claim of "multimodal" integration is somewhat overstated — integrating genomics, radiology, or EHR data would be more clinically transformative.

4. Timeliness & Relevance

The paper addresses a timely intersection of three active research areas: multimodal medical AI, mixture-of-experts architectures, and concept-based interpretability. The push toward interpretable clinical AI is increasingly demanded by regulatory frameworks (EU AI Act, FDA guidance on clinical decision support). The MoE approach for decomposing multimodal interactions builds directly on recent work (I2MoE, 2025), and the concept bottleneck literature has been rapidly evolving.

The application to pediatric brain tumors, where molecular subtyping is essential and case volumes are inherently limited, is well-motivated. The sample-efficiency results are particularly relevant for rare disease settings.

5. Strengths & Limitations

Key Strengths:

Principled integration of concept learning within interaction-aware MoE, with formal information-theoretic guarantees

Clinically meaningful two-level concept hierarchy reflecting actual diagnostic reasoning

Independent neuropathologist validation of reasoning traces

Strong sample-efficiency results with theoretical justification

Comprehensive ablation study and information plane analysis

The residual pathway solution to the accuracy-interpretability tradeoff is elegant

Notable Weaknesses:

Small dataset sizes with high variance — results may not be robust

Performance actually trails the unconstrained baseline (PathMoE) meaningfully on both datasets, undermining the "competitive performance" claim

Both input modalities derive from the same slide (WSI patches and cell graphs), limiting the multimodal diversity claim

Concept vocabulary is manually defined and domain-specific; scalability to new domains requires expert curation

The neuropathologist validation is qualitative and limited to individual case studies rather than systematic evaluation

No external validation on non-brain-tumor datasets

GPT-4 concept extraction introduces unquantified noise

The paper is dense and could benefit from clearer separation of core contributions from theoretical apparatus

Reproducibility: The institutional PBT dataset is not publicly available, limiting reproducibility to the TCGA portion. Code availability is not explicitly mentioned.

Overall Assessment

ConceptM³oE presents an architecturally novel and theoretically grounded approach to interpretable multimodal pathology. The concept-within-expert design is original, and the clinical validation adds credibility. However, the empirical evidence is limited by small datasets, the performance gap with unconstrained baselines, and narrow evaluation scope. The paper makes a meaningful conceptual contribution to the intersection of interpretable AI and computational pathology, but the practical impact remains to be demonstrated at scale.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 7Clarity 6.5

Generated May 26, 2026

Comparison History (18)

vs. Retrying vs Resampling in AI Control

gpt-5.25/26/2026

Paper 2 likely has higher impact: it addresses AI control for agentic coding systems, a timely, broadly relevant safety problem with implications across many deployed tools. It offers clear conceptual contributions (retrying vs resampling, info-leak mechanism), disentangles prior confounds, and reports quantitative gains plus contradictions to influential prior work—suggesting field-shaping potential. Paper 1 is strong and clinically valuable, but its impact is narrower (computational pathology) and more domain-specific, whereas Paper 2’s insights can generalize across monitoring, auditing, and agent design in AI systems.

vs. When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact due to strong real-world clinical relevance (multimodal computational pathology), a clear innovation combining interaction-aware MoE with hierarchical concept bottlenecks plus residual paths to mitigate interpretability–performance tradeoffs, and validation including expert neuropathologist assessment. Its applicability spans medical AI, interpretability, multimodal learning, and data-limited modeling—broadening impact beyond a single subfield. Paper 1 offers valuable mechanistic insights into multi-agent RL for LLM workflows, but is more diagnostic/characterization-focused and narrower in immediate application and cross-domain uptake.

vs. From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact: it introduces a concrete, technically novel architecture (concept-guided multimodal MoE with residual concept bottlenecks), demonstrates methodological rigor with multi-cohort evaluation and expert (neuropathologist) validation, and targets an urgent real-world domain (clinical pathology) where interpretability and multimodal integration are high-impact and deployable. Paper 1 raises an important systems agenda for agentic AI and provides a reference harness, but it is more conceptual/architectural with less evidence of measurable performance gains or validated benchmarks, making near-term impact less certain.

vs. Hylos: Operability Contracts for Model-Native Spatial Intelligence

claude-opus-4.65/26/2026

ConceptM^3oE addresses a concrete, high-impact problem in computational pathology with rigorous methodology: multimodal fusion with interpretability via concept bottlenecks and mixture-of-experts, validated on real clinical cohorts with neuropathologist evaluation. It demonstrates clear quantitative improvements in data-limited regimes and has immediate clinical applicability. Paper 1 (Hylos) presents an interesting systems architecture for spatial AI operability, but is framed as a position/preprint with only a focused artifact study rather than comprehensive evaluation, limiting its demonstrated impact. Paper 2's combination of interpretability, clinical validation, and methodological novelty gives it broader and more immediate scientific impact.

vs. Test-Time Deep Thinking to Explore Implicit Rules

gemini-3.15/26/2026

Paper 2 addresses a critical challenge in medical AI—balancing high predictive performance with interpretability—using a novel multimodal Mixture of Experts approach. Its direct application to complex tumor diagnosis, validation by clinical experts, and performance improvements in data-limited regimes demonstrate significant real-world utility and methodological rigor. While Paper 1 presents an interesting advancement for LLM agents, Paper 2's potential to directly impact clinical decision-making and advance interpretable multimodal models in healthcare gives it a broader and more profound scientific impact.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

gemini-3.15/26/2026

Paper 2 introduces a highly innovative approach to scene synthesis by using executable programs to generate articulated, interactable 3D environments. This overcomes a major bottleneck in embodied AI and robotics, which currently rely on static meshes or limited curated libraries. Its broad applicability across robotics, simulation, and spatial computing gives it widespread potential impact, whereas Paper 1 has a narrower, though important, focus on computational pathology.

vs. MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

gemini-3.15/26/2026

Paper 2 presents a fundamental architectural innovation by integrating concept bottlenecks with multimodal Mixture-of-Experts, addressing the critical need for interpretability in medical AI. Unlike Paper 1, which primarily focuses on LLM orchestration and benchmark engineering, Paper 2 tackles complex multimodal reasoning and demonstrates significant improvements in data-limited regimes with clinical validation, promising broader methodological and practical impacts across healthcare AI.

vs. AMEL: Accumulated Message Effects on LLM Judgments

claude-opus-4.65/26/2026

Paper 2 (AMEL) has higher potential scientific impact due to its broader relevance across virtually all fields using LLM-as-judge paradigms, which is rapidly becoming ubiquitous. It identifies a systematic, quantified bias (with massive sample size across 11 models) that affects any pipeline using LLMs for sequential evaluation—code review, content moderation, benchmarking. The practical implications are immediate and actionable. Paper 1, while rigorous and clinically meaningful, addresses a narrower niche (multimodal pathology with interpretable MoE) with incremental architectural contributions over existing concept bottleneck and MoE methods.

vs. When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

gemini-3.15/26/2026

Paper 2 challenges a foundational assumption in the rapidly expanding field of LLM agents—that more context is always better. By demonstrating a counterintuitive crossover effect and providing a simple, actionable diagnostic metric, it offers broad, immediate impact across all domains utilizing AI agents. While Paper 1 is highly rigorous and valuable for healthcare, its contributions are more domain-specific and represent architectural refinements, whereas Paper 2 offers a paradigm-shifting insight with wider applicability.

vs. Latent Action Reparameterization for Efficient Agent Inference

gemini-3.15/26/2026

While Paper 1 offers valuable contributions to interpretable computational pathology, Paper 2 addresses a fundamental bottleneck in LLM agents—inference efficiency and decision horizon. By learning a compact latent action space, Paper 2 provides a highly scalable and broadly applicable framework that can impact a wide range of autonomous AI systems. Its potential for cross-domain application in the rapidly growing field of LLM agents gives it a higher estimated scientific impact.

vs. PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting

claude-opus-4.65/26/2026

ConceptM³oE addresses the critical challenge of interpretability in medical AI, combining multimodal reasoning with concept-based explanations in computational pathology. Its clinical validation by a neuropathologist, improved performance in data-limited regimes, and alignment with the growing demand for trustworthy/explainable AI in healthcare give it broader impact. The framework bridges interpretability and performance—a key bottleneck in clinical AI adoption. PHGNet, while technically sound, offers incremental improvements in traffic forecasting, a more saturated research area with narrower real-world impact compared to healthcare AI.

vs. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it introduces a novel, technically concrete architecture (concept-guided multimodal MoE with residual concept bottlenecks) and validates it on real clinical datasets with expert (neuropathologist) assessment, supporting methodological rigor and translational relevance. Its applications in computational pathology are immediate and high-stakes, and the approach generalizes to broader multimodal medical AI and interpretable ML. Paper 1 offers important conceptual framing and an evaluation method for process alignment, but its impact may be narrower and more context-dependent, with mixed empirical results across domains.

vs. From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

gpt-5.25/26/2026

Paper 2 has higher likely impact due to strong timeliness (interpretable multimodal medical AI), direct clinical applicability in computational pathology, and broader methodological relevance (concept bottlenecks + multimodal MoE with synergy/redundancy experts) that can transfer to other healthcare and multimodal domains. It reports meaningful gains in data-limited settings and includes expert (neuropathologist) validation of reasoning traces, supporting rigor and real-world credibility. Paper 1 is innovative for ECW nexus dispatch, but its impact is narrower to power-system/data-center coordination and shows modest (3–5%) improvements on benchmark grids.

vs. One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

claude-opus-4.65/26/2026

ConceptM³oE addresses a critical need in medical AI—interpretability in clinical decision-making—combining multimodal reasoning with concept bottlenecks in computational pathology. Its impact spans healthcare AI, interpretable ML, and clinical practice, with validated reasoning traces by domain experts. The improved performance in data-limited regimes is highly relevant for real-world medical settings. While Paper 1 is innovative for game AI with strong engineering contributions, its impact is narrower (game NPCs). Paper 2's methodological contributions to trustworthy medical AI have broader societal implications and cross-disciplinary relevance.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

gpt-5.25/26/2026

Paper 1 likely has higher impact due to stronger real-world clinical applicability and breadth: it advances interpretable multimodal learning in computational pathology with concept bottlenecks plus residual pathways, and includes validation by an independent neuropathologist. If robust, this can influence medical AI deployment, regulation, and trust across healthcare domains. Paper 2 is timely and useful for LLM agent performance, but is more incremental (test-time rubric guidance + learned rubric generator) and its impact may be narrower or faster-moving given rapid agent-method turnover.

vs. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

gemini-3.15/26/2026

Paper 2 presents a novel architectural solution to the performance-interpretability trade-off in medical AI. By combining multimodal MoEs with concept bottlenecks and residual pathways, it directly addresses a critical barrier to clinical AI adoption. While Paper 1 offers a valuable LLM benchmark, Paper 2 provides a tangible methodological advancement with immediate, high-stakes real-world applications in computational pathology and diagnostic healthcare.

vs. Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

gemini-3.15/26/2026

Paper 2 offers profound real-world impact by addressing the critical trade-off between interpretability and performance in medical AI. Its novel combination of Multimodal Mixture of Experts with concept bottlenecks, validated by human experts and demonstrating strong data-efficient performance, advances both algorithmic design and clinical applicability. While Paper 1 is highly timely for LLM security, Paper 2's direct life-saving potential and verifiable decision-making in complex healthcare diagnostics give it a broader, more deeply impactful scientific contribution.

vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

gemini-3.15/26/2026

While Paper 1 offers a highly rigorous and clinically valuable application of interpretable AI in pathology, Paper 2 demonstrates higher potential scientific impact due to its fundamental contributions to AI agent scaling. 'Scaling out' via collective reasoning addresses a critical bottleneck in foundational AI research regarding long-horizon tasks and test-time compute. Because AgentFugue's domain-agnostic methodology can be applied across virtually all fields utilizing autonomous agents, it possesses significantly greater breadth of impact and timeliness for the broader AI community compared to the specialized architectural focus of Paper 1.

ConceptM3^33oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

Abstract

AI Impact Assessments

Scientific Impact Assessment: ConceptM³oE

1. Core Contribution

2. Methodological Rigor

3. Potential Impact

4. Timeliness & Relevance

5. Strengths & Limitations

Key Strengths:

Notable Weaknesses:

Overall Assessment

Comparison History (18)

ConceptM $^3$ oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology