Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion
Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He
Abstract
Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Mind-Omni
1. Core Contribution
Mind-Omni proposes the first unified framework that handles seven distinct brain encoding and decoding tasks (I→B, T→B, I&T→B, B→I, B→T, B→I&T, and BQA) within a single discrete diffusion architecture. The key innovations are: (1) a Brain Tokenizer that converts continuous fMRI signals into discrete tokens aligned with vision-language semantic spaces via multi-level alignment losses; (2) extension of the Muddit discrete diffusion backbone to accommodate a third brain modality branch; and (3) a curated Brain Question Answering (BQA) instruction-tuning dataset. The paper's central thesis—that multi-task unification can unlock synergistic gains between encoding and decoding tasks—is supported by evidence showing joint I&T→B encoding outperforms single-modality encoding, and joint B→I&T decoding improves both image and text outputs.
2. Methodological Rigor
The methodology is generally well-structured with a clear two-component pipeline (Brain Tokenizer + discrete diffusion backbone). Several aspects demonstrate rigor:
Strengths in methodology:
Concerns:
3. Potential Impact
BCI and Neuroscience: The framework's ability to serve as a computational testbed for neuroscientific exploration (replicating category-selective regions, probing novel concept representations) is genuinely valuable. If the approach scales to other neural modalities (EEG, MEG, ECoG), it could become a foundational tool.
Unified Multimodal Modeling: The extension of discrete diffusion to brain signals opens a new modality axis for the broader MLLM community. The Brain Tokenizer concept could inspire similar approaches for other biological signals.
Practical limitations on impact: The reliance on 7T fMRI data (expensive, non-portable) limits near-term practical BCI applications. The performance gap with specialists on core tasks like image reconstruction may slow adoption. The framework handles only visual cortex signals, not the full brain.
4. Timeliness & Relevance
The paper is well-timed, riding the convergence of three trends: (1) the push toward unified/foundation models across modalities, (2) growing interest in brain-AI alignment (evidenced by ICLR/NeurIPS brain decoding papers), and (3) the maturation of discrete diffusion models (Muddit, Mmada). The question of whether unified models can match specialists in neuroscience is pressing, and this paper provides an early data point. The BQA task—reasoning about brain signals—is a novel formulation that could catalyze a new subfield.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing as a "paradigm shift" is somewhat overstated given the performance gaps, but the demonstration that a single model can handle all seven tasks with reasonable quality is meaningful. The insight that joint encoding captures the brain's integration of visual and semantic information (Section H) is the paper's most neuroscientifically interesting finding. The Brain Tokenizer's multi-level alignment strategy is a clean engineering contribution that could generalize beyond this specific application.
Generated May 29, 2026
Comparison History (14)
Paper 1 proposes a foundational framework unifying brain, vision, and language modeling, which has transformative potential for Brain-Computer Interfaces and neuroscience. The introduction of a novel Brain Tokenizer and a discrete diffusion paradigm offers significant methodological innovation and broad applications in medical and cognitive technologies. While Paper 2 presents an important empirical finding in AI safety, Paper 1's architectural breakthrough in multimodal neural modeling is likely to have a more profound and expansive technological impact across multiple scientific disciplines.
Mind-Omni presents a more broadly impactful contribution by unifying seven brain-vision-language tasks through a novel discrete diffusion framework with a Brain Tokenizer, addressing a fundamental challenge in BCI research. Its breadth of impact spans neuroscience, AI, and clinical applications, with potential to serve as a foundation model for neural activity. While CircuitFormer makes a solid contribution to analog EDA with a clever circuit tokenizer and dataset, its scope is narrower, targeting a specialized engineering domain. Mind-Omni's multi-modal unification paradigm and demonstration of cross-task synergy represent a more transformative advance.
Paper 1 has higher estimated scientific impact due to greater novelty (a unified discrete-diffusion, multi-task brain–vision–language framework with a new brain tokenizer), broader cross-field reach (BCI, neuroscience, multimodal foundation models, generative modeling), and strong real-world application potential in neurotechnology. It also contributes datasets (BQA) and open code, aiding adoption. Paper 2 is timely and practical for LLM efficiency, but is more incremental—primarily a prompting/optimization paradigm for context compression with narrower scientific breadth and less fundamental new representation modeling.
Paper 1 likely has higher scientific impact due to greater methodological novelty (a unified multi-task brain–vision–language framework using discrete diffusion and a brain-signal tokenizer), broader cross-field relevance (neuroscience, ML, multimodal foundation models, BCIs), and strong timeliness given rapid growth in foundation-model paradigms. If rigorous and reproducible (code provided), it could become a reusable platform and benchmark direction for neural foundation models. Paper 2 is applied and useful for transport planning, but the core innovation (GPS priors + LLM schedule generation) is more domain-specific and less likely to generalize broadly beyond mobility modeling.
Paper 2 has higher potential scientific impact due to stronger novelty (discrete diffusion + Brain Tokenizer enabling a unified brain-vision-language token space) and broader cross-field reach (BCI, neuroscience, multimodal ML, generative modeling). Its unified multi-task framing across seven tasks and new BQA dataset could catalyze follow-on benchmarks and foundation-model directions, with clear real-world applications in BCIs. Paper 1 is timely and practically valuable for LLM training efficiency, but its impact is more incremental and narrower to data selection in mid-training pipelines.
Paper 2 has higher estimated impact due to broad, timely relevance to foundation-model training practices (synthetic/self-consuming data) across many domains. It offers a formal multi-model dynamical framework with clear conceptual novelty beyond single-model analyses, and its conclusions about when human curation can backfire are widely applicable to AI safety, alignment, and deployment ecosystems. Paper 1 is innovative and application-rich for BCI, but its impact is likely narrower, constrained by data availability, experimental variability, and domain specificity compared to the cross-field implications of Paper 2.
Paper 2 (DenoiseRL) likely has higher scientific impact due to broader applicability and timeliness: it targets scalable RL-based reasoning improvement for LLMs without stronger teachers or curated datasets, a central bottleneck in current AI. The approach could generalize across many domains and model families, affecting both methodology and practice widely. Paper 1 is innovative for BCI/brain-vision-language unification and could be impactful within neuro-AI, but its real-world impact is constrained by data scarcity, subject variability, and hardware limits, narrowing immediate breadth and reproducibility.
Mind-Omni introduces a fundamentally novel unified framework for brain-vision-language modeling that addresses a critical gap in BCI research by unifying seven tasks through discrete diffusion. Its Brain Tokenizer enabling cross-modal token-level interactions is highly innovative, with broad implications for neuroscience, AI, and clinical BCIs. The demonstration of multi-task synergy and competitive performance against specialized models suggests a paradigm shift toward foundation models for neural activity. Paper 2, while solid engineering combining LLMs with scheduling, represents more incremental progress in a narrower application domain with less potential for cross-field impact.
Mind-Omni pioneers a unified framework bridging neuroscience, vision, and language, which has profound implications for Brain-Computer Interfaces. By creating a shared semantic space for brain signals via a novel Brain Tokenizer, it offers transformative potential across multiple disciplines. While ZipRL provides a valuable methodological optimization for LLM token efficiency, Mind-Omni's interdisciplinary breadth, foundational nature, and potential to unlock advanced human-computer interactions give it a higher overall scientific impact.
Paper 2 is more novel and higher-risk/high-reward: it proposes a unified brain-vision-language multi-task framework with a new discrete “Brain Tokenizer” and diffusion-based modeling, plus a new instruction-tuning BQA dataset. If robust, this could materially advance BCIs and multimodal foundation modeling, impacting neuroscience, ML, and medical/assistive tech. Paper 1 is timely and methodologically rigorous as a benchmarking/architecture-search framework for RAG, likely impactful for reproducibility and engineering practice, but its conceptual novelty and cross-field breadth are narrower than Paper 2’s potential paradigm shift.
Paper 2 likely has higher scientific impact due to greater novelty (unifying 7 brain-vision-language tasks with a discrete diffusion framework and a new Brain Tokenizer), broader cross-field relevance (neuroscience, BCI, multimodal foundation models, diffusion modeling), and stronger real-world application potential in BCIs and neural decoding/encoding. It also suggests a scalable paradigm (tokenization + shared semantic space + instruction tuning) that could generalize beyond specific datasets. Paper 1 is valuable and timely, but its contribution is narrower (time-series anomaly detection benchmark + PEFT VLM fine-tuning) and more incremental relative to existing VLM adaptation work.
Paper 1 is likely to have higher scientific impact due to stronger novelty and rigor: it delivers the first open, clean-room implementation of a newly standardized datacenter interconnect protocol (UB) previously locked in closed silicon, enabling reproducible research and broad follow-on work. Its multi-tier artifacts (RTL, SystemC, gem5) and matched RoCE baseline support credible, controlled comparisons, and the results directly address a major systems bottleneck with clear real-world applicability across datacenter networking, OS, and hardware design. Paper 2 is timely and potentially impactful, but diffusion-based multi-task unification is less clearly unique and may face harder reproducibility/generalization constraints.
Paper 2 likely has higher impact due to greater novelty and breadth: it proposes a unified brain-vision-language foundation-style framework, introduces a brain-signal tokenizer to discretize heterogeneous neural data, and uses discrete diffusion to support seven tasks—advancing multi-task neural modeling and BCI research with broad cross-field relevance (neuroscience, ML, HCI, clinical/assistive tech). Paper 1 is timely and rigorous for reducing hallucinations in multimodal reasoning models, but it is a more incremental optimization-method contribution within a crowded alignment literature, with narrower application scope.
Mind-Omni introduces a novel unified framework for brain-vision-language modeling using discrete diffusion, addressing a fundamental limitation (single-task paradigm) in BCI research. It unifies seven tasks, introduces a Brain Tokenizer, and creates a new BQA dataset—all representing significant methodological innovations with broad implications for neuroscience, BCIs, and foundation models. While VLA-Trace provides valuable diagnostic insights for VLA models, it is primarily an analytical/diagnostic tool rather than a new capability. Mind-Omni's potential to enable foundation models for neural activity represents a more transformative contribution with broader cross-disciplinary impact.