Trong Khiem Tran, Anh Duc Chu, Quang Hung Pham, Phi Le Nguyen, Trong Nghia Hoang
Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.
This paper addresses a genuine and practically important gap in cross-modal knowledge distillation (CMKD): the reliance on paired multimodal data with sample-level correspondence. The key contribution is a theoretical framework decomposing the student's generalization error into three components—teacher error (fixed overhead), feature alignment (distributional discrepancy in representation space), and label alignment (predictive distributional discrepancy). This decomposition motivates UCMKD, a practical algorithm that performs distribution-level alignment rather than sample-level matching, implemented via bi-level optimization with Wasserstein-based feature alignment and a label transport kernel for selective knowledge transfer.
The problem formulation is well-motivated: paired multimodal data is indeed expensive and often unavailable when modalities are collected independently. Moving from sample-level to distribution-level alignment is a conceptually clean and principled shift that opens CMKD to more realistic deployment scenarios.
Theoretical Analysis: The paper provides both asymptotic (Theorem 2.6) and finite-sample (Theorem 2.7) generalization bounds. The proof strategy is relatively standard—using Kantorovich-Rubinstein duality for the feature alignment term and introducing a label transport kernel for decomposing the prediction gap—but the application to the CMKD setting is novel and the resulting bound is interpretable. The finite-sample bound appropriately incorporates Wasserstein convergence rates and VC dimension complexity terms, revealing meaningful trade-offs between alignment quality and model capacity.
However, several concerns arise:
Algorithm Design: The bi-level optimization approach (inspired by MAML) is well-justified by the ablation showing that naive joint optimization of FA and LA degrades performance (Table 6). The use of Sinkhorn-regularized optimal transport for FA is computationally practical. The selective distillation mechanism via κ is elegant—when teacher and student disagree, distillation is naturally downweighted, reducing negative transfer.
The practical implications are significant. Many real-world multimodal scenarios involve independently collected data streams (e.g., medical imaging from one institution, clinical text from another). Removing the paired-data requirement substantially broadens CMKD's applicability. The framework's universality—working well in both paired and unpaired settings—adds practical value.
The theoretical decomposition (FA + LA) provides a useful conceptual framework that could influence how researchers think about cross-modal transfer more broadly, potentially extending to domain adaptation, federated learning across heterogeneous modalities, and cross-modal generative modeling (as the authors note).
This work is timely given the increasing interest in multimodal learning and the practical reality that perfectly aligned multimodal datasets are the exception rather than the rule. With foundation models increasingly operating across modalities, principled methods for cross-modal knowledge transfer without strict pairing requirements address a genuine bottleneck.
The paper is well-written with clear exposition of the theoretical framework and its algorithmic implications. The connection between theory and algorithm is tighter than in many KD papers. The code is publicly available, supporting reproducibility. The complexity analysis showing at most 3× overhead is reassuring for scalability.
One theoretical subtlety: the bound treats the teacher error as fixed overhead, but in cross-modal settings, the teacher's representation quality on the shared embedding space Z matters significantly and is not explicitly addressed.
Generated Jun 10, 2026
Paper 2 addresses a timely and broadly impactful question about AI agents' ability to synthesize scientific conclusions, introducing a large-scale benchmark (SciConBench) with a novel clean-room evaluation methodology. Its findings that frontier models achieve only 0.337 F1 and that data leakage inflates performance estimates have immediate implications for AI safety, healthcare, and policy. The audit of consumer-facing tools adds real-world relevance. While Paper 1 makes solid methodological contributions to cross-modal knowledge distillation, Paper 2's broader societal implications, timeliness given rapid AI agent deployment, and cross-disciplinary relevance give it higher potential impact.
INFRAMIND addresses a novel and practically critical gap—infrastructure-aware multi-agent orchestration—that no prior work has systematically tackled. Its combination of hierarchical constrained MDP with RL for jointly optimizing planning, routing, and scheduling under real-time infrastructure signals is highly innovative. The dramatic empirical gains (7x latency reduction, 99.9% SLO compliance vs <50% for baselines) suggest transformative practical impact for LLM deployment at scale. While Paper 2 makes solid theoretical contributions to cross-modal KD without paired data, infrastructure-aware orchestration is more timely given the explosive growth of multi-agent LLM systems and has broader cross-field applicability.
While Paper 1 presents strong theoretical advancements in fundamental AI methodology, Paper 2 addresses a highly critical, timely, and real-world problem at the intersection of AI and biosecurity. The introduction of a benchmark for dual-use biological capabilities of LLMs, combined with actual wet-lab validation, has immense implications for scientific policy, safety, and the future of automated biological research, giving it broader societal and interdisciplinary scientific impact.
Paper 1 tackles a fundamental bottleneck in multi-modal AI by removing the need for paired data during knowledge distillation. Its strong theoretical foundations, combined with practical algorithms, offer broad applicability across diverse modalities. While Paper 2 addresses a highly timely issue in AI safety, Paper 1's foundational contributions to representation learning and model efficiency give it a higher potential for widespread, lasting scientific and practical impact across multiple domains.
Paper 2 addresses a fundamental bottleneck in multimodal AI (the need for paired data) and provides a rigorous theoretical foundation for cross-modal distillation. Its framework is broadly applicable across various data modalities (vision, text, audio), promising wide-reaching methodological impact and high citation potential across multiple AI domains. While Paper 1 offers a highly valuable, specific application in oncology, Paper 2's foundational AI advancements will likely influence a broader range of scientific and engineering fields.
Paper 1 addresses a fundamental challenge in cross-modal knowledge distillation without paired data, providing both theoretical foundations and practical algorithms. Its theoretical contributions (distributional alignment framework with guarantees) offer broadly applicable insights across multimodal learning. Paper 2 introduces an interesting parametric memory framework for LLM agents, but is more incremental (combining LoRA with agent memory). Paper 1's theoretical rigor, broader applicability across modality pairs, and addressing a more fundamental limitation (removing paired data requirements) give it higher potential for lasting scientific impact across multiple research communities.
Paper 1 addresses a fundamental and practical challenge in cross-modal knowledge distillation without paired data, providing both theoretical foundations and a principled algorithmic framework with guarantees. It has broader real-world applicability across multimodal AI systems where paired data is scarce. Paper 2, while valuable as a benchmark for evaluating LLM combinatorial reasoning, has narrower impact primarily within the LLM evaluation community and will likely become outdated as models improve. Paper 1's theoretical contributions on distributional alignment are more enduring and broadly applicable across machine learning.
Paper 2 addresses a fundamental challenge in multimodal AI by eliminating the need for paired data, significantly broadening the applicability of cross-modal distillation. Its strong theoretical foundation and applicability across various modalities suggest a wider impact compared to Paper 1, which focuses on a specific, albeit timely, technical optimization for LLM unlearning. The ability to leverage unpaired data for multimodal training has profound implications for resource-efficient AI across diverse domains.
Paper 2 addresses a fundamental and widely applicable problem in machine learning—cross-modal knowledge distillation without paired data—with both theoretical foundations and extensive experimental validation. Its contributions (theoretical guarantees, principled framework, strong empirical results across benchmarks) are broadly applicable across many multimodal AI applications. Paper 1, while innovative in bridging robotics and foundation model safety, is more niche in scope, focusing on specific social deployment scenarios with a primarily conceptual/framework contribution rather than rigorous theoretical or large-scale empirical validation.
Paper 2 addresses a fundamental bottleneck in multimodal learning—the reliance on costly paired data. By providing a theoretical foundation and a principled framework for cross-modal distillation using unpaired data, it offers broader applicability across various domains and modalities. While Paper 1 provides valuable insights into LLM behavior, Paper 2's methodological innovation and theoretical guarantees for distribution alignment present a more foundational advancement with higher potential to influence future architecture designs and reduce data collection costs.