Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He

#286 of 2821 · Artificial Intelligence
Share
Tournament Score
1511±49
10501800
79%
Win Rate
11
Wins
3
Losses
14
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Mind-Omni

1. Core Contribution

Mind-Omni proposes the first unified framework that handles seven distinct brain encoding and decoding tasks (I→B, T→B, I&T→B, B→I, B→T, B→I&T, and BQA) within a single discrete diffusion architecture. The key innovations are: (1) a Brain Tokenizer that converts continuous fMRI signals into discrete tokens aligned with vision-language semantic spaces via multi-level alignment losses; (2) extension of the Muddit discrete diffusion backbone to accommodate a third brain modality branch; and (3) a curated Brain Question Answering (BQA) instruction-tuning dataset. The paper's central thesis—that multi-task unification can unlock synergistic gains between encoding and decoding tasks—is supported by evidence showing joint I&T→B encoding outperforms single-modality encoding, and joint B→I&T decoding improves both image and text outputs.

2. Methodological Rigor

The methodology is generally well-structured with a clear two-component pipeline (Brain Tokenizer + discrete diffusion backbone). Several aspects demonstrate rigor:

Strengths in methodology:

  • The Brain Tokenizer design is thoughtful, combining reconstruction, commitment, coarse-grained contrastive, fine-grained cross-attention, and perceptual alignment losses. The ablation in Table 5 systematically justifies each component.
  • The choice of discrete diffusion over autoregressive models is well-motivated: permutation invariance avoids confounding sequential biases when studying cross-task synergies.
  • Progressive training (frozen backbone → multi-task → DoRA fine-tuning with BQA) is validated against naive joint training, showing meaningful improvements.
  • MNI152 registration for cross-subject standardization is validated via RSA analysis (Figure 10).
  • Concerns:

  • The model is initialized from Muddit's pretrained weights, making it difficult to disentangle the contribution of the pretraining from the novel brain-specific components.
  • The evaluation is limited to 4 subjects from NSD—a single dataset. Cross-dataset generalization is untested.
  • The "multi-task synergy" claim, while supported by performance improvements in joint tasks, lacks mechanistic explanation. The improvement could partly stem from regularization effects rather than genuine synergy.
  • The BQA evaluation using LLM-as-Judge (Qwen3-VL) with only ~24% accuracy raises questions about task difficulty calibration and evaluation reliability.
  • Brain Tokenizer's VQ step incurs non-trivial information loss, acknowledged by the authors as contributing to the voxel-level encoding gap versus specialists.
  • 3. Potential Impact

    BCI and Neuroscience: The framework's ability to serve as a computational testbed for neuroscientific exploration (replicating category-selective regions, probing novel concept representations) is genuinely valuable. If the approach scales to other neural modalities (EEG, MEG, ECoG), it could become a foundational tool.

    Unified Multimodal Modeling: The extension of discrete diffusion to brain signals opens a new modality axis for the broader MLLM community. The Brain Tokenizer concept could inspire similar approaches for other biological signals.

    Practical limitations on impact: The reliance on 7T fMRI data (expensive, non-portable) limits near-term practical BCI applications. The performance gap with specialists on core tasks like image reconstruction may slow adoption. The framework handles only visual cortex signals, not the full brain.

    4. Timeliness & Relevance

    The paper is well-timed, riding the convergence of three trends: (1) the push toward unified/foundation models across modalities, (2) growing interest in brain-AI alignment (evidenced by ICLR/NeurIPS brain decoding papers), and (3) the maturation of discrete diffusion models (Muddit, Mmada). The question of whether unified models can match specialists in neuroscience is pressing, and this paper provides an early data point. The BQA task—reasoning about brain signals—is a novel formulation that could catalyze a new subfield.

    5. Strengths & Limitations

    Key Strengths:

  • Scope and ambition: Seven tasks in one model is genuinely unprecedented for brain-vision-language modeling. The comparison table (Table 1) clearly establishes this novelty.
  • Multi-task synergy evidence: The I&T→B encoding results (Figure 7, Table 4) and B→I&T decoding improvements (Tables 2-3) provide compelling evidence for cross-modal complementarity.
  • Comprehensive ablations: Tables 5-6 and Appendix G systematically validate design choices across tokenizer architecture, loss functions, training strategies, and data curation.
  • Neuroscientific validation: Replicating category-selective brain regions (EBA, FFA, PPA) demonstrates the model captures functionally meaningful representations beyond numerical fitting.
  • Code availability and plans to release processed data enhance reproducibility.
  • Notable Limitations:

  • Performance gap with specialists: On B→I reconstruction, the gap to MindEye2 is substantial (e.g., PixCorr, SSIM). The unified model's competitive advantage emerges primarily on semantic-level metrics.
  • Single dataset: All experiments use NSD. The claimed "multi-subject" capability is tested only across subjects within the same dataset/scanner.
  • Brain Score analysis (Figure 15): The authors acknowledge that VQ-VAE and CLIP tokenizers used in Muddit have lower neural plausibility than CLIP-H features used by specialists, creating an inherent disadvantage for encoding tasks.
  • Limited analysis of failure modes: When does the unified model fail catastrophically? What types of stimuli or brain responses are most challenging?
  • Scalability claims are aspirational: The paper positions itself toward "foundation models for neural activity" but trains on ~72K unique stimuli from 8 subjects with 442M trainable parameters—far from foundation-model scale.
  • Additional Observations

    The paper's framing as a "paradigm shift" is somewhat overstated given the performance gaps, but the demonstration that a single model can handle all seven tasks with reasonable quality is meaningful. The insight that joint encoding captures the brain's integration of visual and semantic information (Section H) is the paper's most neuroscientifically interesting finding. The Brain Tokenizer's multi-level alignment strategy is a clean engineering contribution that could generalize beyond this specific application.

    Rating:6.8/ 10
    Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

    Generated May 29, 2026

    Comparison History (14)

    vs. Voluntary Collusion with Secret Tools in Competing LLM Agents
    gemini-3.15/29/2026

    Paper 1 proposes a foundational framework unifying brain, vision, and language modeling, which has transformative potential for Brain-Computer Interfaces and neuroscience. The introduction of a novel Brain Tokenizer and a discrete diffusion paradigm offers significant methodological innovation and broad applications in medical and cognitive technologies. While Paper 2 presents an important empirical finding in AI safety, Paper 1's architectural breakthrough in multimodal neural modeling is likely to have a more profound and expansive technological impact across multiple scientific disciplines.

    vs. CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt
    claude-opus-4.65/29/2026

    Mind-Omni presents a more broadly impactful contribution by unifying seven brain-vision-language tasks through a novel discrete diffusion framework with a Brain Tokenizer, addressing a fundamental challenge in BCI research. Its breadth of impact spans neuroscience, AI, and clinical applications, with potential to serve as a foundation model for neural activity. While CircuitFormer makes a solid contribution to analog EDA with a clever circuit tokenizer and dataset, its scope is narrower, targeting a specialized engineering domain. Mind-Omni's multi-modal unification paradigm and demonstration of cross-task synergy represent a more transformative advance.

    vs. Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
    gpt-5.25/29/2026

    Paper 1 has higher estimated scientific impact due to greater novelty (a unified discrete-diffusion, multi-task brain–vision–language framework with a new brain tokenizer), broader cross-field reach (BCI, neuroscience, multimodal foundation models, generative modeling), and strong real-world application potential in neurotechnology. It also contributes datasets (BQA) and open code, aiding adoption. Paper 2 is timely and practical for LLM efficiency, but is more incremental—primarily a prompting/optimization paradigm for context compression with narrower scientific breadth and less fundamental new representation modeling.

    vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation
    gpt-5.25/29/2026

    Paper 1 likely has higher scientific impact due to greater methodological novelty (a unified multi-task brain–vision–language framework using discrete diffusion and a brain-signal tokenizer), broader cross-field relevance (neuroscience, ML, multimodal foundation models, BCIs), and strong timeliness given rapid growth in foundation-model paradigms. If rigorous and reproducible (code provided), it could become a reusable platform and benchmark direction for neural foundation models. Paper 2 is applied and useful for transport planning, but the core innovation (GPS priors + LLM schedule generation) is more domain-specific and less likely to generalize broadly beyond mobility modeling.

    vs. MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
    gpt-5.25/29/2026

    Paper 2 has higher potential scientific impact due to stronger novelty (discrete diffusion + Brain Tokenizer enabling a unified brain-vision-language token space) and broader cross-field reach (BCI, neuroscience, multimodal ML, generative modeling). Its unified multi-task framing across seven tasks and new BQA dataset could catalyze follow-on benchmarks and foundation-model directions, with clear real-world applications in BCIs. Paper 1 is timely and practically valuable for LLM training efficiency, but its impact is more incremental and narrower to data selection in mid-training pipelines.

    vs. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop
    gpt-5.25/29/2026

    Paper 2 has higher estimated impact due to broad, timely relevance to foundation-model training practices (synthetic/self-consuming data) across many domains. It offers a formal multi-model dynamical framework with clear conceptual novelty beyond single-model analyses, and its conclusions about when human curation can backfire are widely applicable to AI safety, alignment, and deployment ecosystems. Paper 1 is innovative and application-rich for BCI, but its impact is likely narrower, constrained by data availability, experimental variability, and domain specificity compared to the cross-field implications of Paper 2.

    vs. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes
    gpt-5.25/29/2026

    Paper 2 (DenoiseRL) likely has higher scientific impact due to broader applicability and timeliness: it targets scalable RL-based reasoning improvement for LLMs without stronger teachers or curated datasets, a central bottleneck in current AI. The approach could generalize across many domains and model families, affecting both methodology and practice widely. Paper 1 is innovative for BCI/brain-vision-language unification and could be impactful within neuro-AI, but its real-world impact is constrained by data scarcity, subject variability, and hardware limits, narrowing immediate breadth and reproducibility.

    vs. Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling
    claude-opus-4.65/29/2026

    Mind-Omni introduces a fundamentally novel unified framework for brain-vision-language modeling that addresses a critical gap in BCI research by unifying seven tasks through discrete diffusion. Its Brain Tokenizer enabling cross-modal token-level interactions is highly innovative, with broad implications for neuroscience, AI, and clinical BCIs. The demonstration of multi-task synergy and competitive performance against specialized models suggests a paradigm shift toward foundation models for neural activity. Paper 2, while solid engineering combining LLMs with scheduling, represents more incremental progress in a narrower application domain with less potential for cross-field impact.

    vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
    gemini-3.15/29/2026

    Mind-Omni pioneers a unified framework bridging neuroscience, vision, and language, which has profound implications for Brain-Computer Interfaces. By creating a shared semantic space for brain signals via a novel Brain Tokenizer, it offers transformative potential across multiple disciplines. While ZipRL provides a valuable methodological optimization for LLM token efficiency, Mind-Omni's interdisciplinary breadth, foundational nature, and potential to unlock advanced human-computer interactions give it a higher overall scientific impact.

    vs. RAISE: RAG Design as an Architecture Search Problem
    gpt-5.25/29/2026

    Paper 2 is more novel and higher-risk/high-reward: it proposes a unified brain-vision-language multi-task framework with a new discrete “Brain Tokenizer” and diffusion-based modeling, plus a new instruction-tuning BQA dataset. If robust, this could materially advance BCIs and multimodal foundation modeling, impacting neuroscience, ML, and medical/assistive tech. Paper 1 is timely and methodologically rigorous as a benchmarking/architecture-search framework for RAG, likely impactful for reproducibility and engineering practice, but its conceptual novelty and cross-field breadth are narrower than Paper 2’s potential paradigm shift.

    vs. Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
    gpt-5.25/29/2026

    Paper 2 likely has higher scientific impact due to greater novelty (unifying 7 brain-vision-language tasks with a discrete diffusion framework and a new Brain Tokenizer), broader cross-field relevance (neuroscience, BCI, multimodal foundation models, diffusion modeling), and stronger real-world application potential in BCIs and neural decoding/encoding. It also suggests a scalable paradigm (tokenization + shared semantic space + instruction tuning) that could generalize beyond specific datasets. Paper 1 is valuable and timely, but its contribution is narrower (time-series anomaly detection benchmark + PEFT VLM fine-tuning) and more incremental relative to existing VLM adaptation work.

    vs. OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
    gpt-5.25/29/2026

    Paper 1 is likely to have higher scientific impact due to stronger novelty and rigor: it delivers the first open, clean-room implementation of a newly standardized datacenter interconnect protocol (UB) previously locked in closed silicon, enabling reproducible research and broad follow-on work. Its multi-tier artifacts (RTL, SystemC, gem5) and matched RoCE baseline support credible, controlled comparisons, and the results directly address a major systems bottleneck with clear real-world applicability across datacenter networking, OS, and hardware design. Paper 2 is timely and potentially impactful, but diffusion-based multi-task unification is less clearly unique and may face harder reproducibility/generalization constraints.

    vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
    gpt-5.25/29/2026

    Paper 2 likely has higher impact due to greater novelty and breadth: it proposes a unified brain-vision-language foundation-style framework, introduces a brain-signal tokenizer to discretize heterogeneous neural data, and uses discrete diffusion to support seven tasks—advancing multi-task neural modeling and BCI research with broad cross-field relevance (neuroscience, ML, HCI, clinical/assistive tech). Paper 1 is timely and rigorous for reducing hallucinations in multimodal reasoning models, but it is a more incremental optimization-method contribution within a crowded alignment literature, with narrower application scope.

    vs. VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
    claude-opus-4.65/29/2026

    Mind-Omni introduces a novel unified framework for brain-vision-language modeling using discrete diffusion, addressing a fundamental limitation (single-task paradigm) in BCI research. It unifies seven tasks, introduces a Brain Tokenizer, and creates a new BQA dataset—all representing significant methodological innovations with broad implications for neuroscience, BCIs, and foundation models. While VLA-Trace provides valuable diagnostic insights for VLA models, it is primarily an analytical/diagnostic tool rather than a new capability. Mind-Omni's potential to enable foundation models for neural activity represents a more transformative contribution with broader cross-disciplinary impact.