Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.
VideoMDM addresses a fundamental bottleneck in 3D human motion generation: the dependency on expensive, studio-captured motion capture (MoCap) data. The key insight is training a 3D-native diffusion model (building on MDM) using only 2D pose supervision extracted from monocular videos. Unlike prior work (MAS, Motion-2-to-3) that lifts 2D to 3D at inference time or requires 3D data for fine-tuning, VideoMDM learns a coherent 3D motion manifold during training itself.
The framework uses a "noisy teacher" paradigm borrowed from LIS (Lesson in Splats): a pretrained 2D-to-3D lifter provides approximate 3D poses that are diffused, denoised in 3D, and supervised via 2D reprojection against accurate 2D keypoints. The theoretical contribution is a proof that a depth-weighted 2D reprojection loss is equivalent in expectation to 3D MSE supervision under mild assumptions (uniform azimuth distribution, matched depth between prediction and ground truth). Additionally, the authors adapt standard 3D motion regularizers (velocity consistency, representation alignment) to operate under 2D-only supervision via ray-projection pseudo-targets.
Theoretical grounding. The depth-weighted loss equivalence proof (Appendix A) is clean and mathematically sound. The assumptions—uniform azimuth sampling and matched depths—are reasonable for training scenarios but somewhat idealized. The depth assumption (predicted depth equals true depth) is particularly strong, though the empirical results suggest the approximation works well in practice.
Experimental design. The three evaluation settings are well-chosen and complementary:
Ablation study (Table in Figure 6b) systematically validates each component, with multistep denoising and representation alignment emerging as the most impactful.
Limitations in evaluation. The Fit3D dataset is acknowledged as too small for FID, so the authors resort to human preference studies (well-designed via Prolific with proper blinding). However, the NBA evaluation reveals some concerns: the V AE-based metrics appear unreliable, and VideoMDM underperforms MAS on FID and Diversity while excelling in Precision and Recall†—making the picture somewhat ambiguous without the human study.
The PnP camera estimation gap is notable: with GT cameras, FID reaches 0.88 vs. 0.54 (3D-supervised MDM), but PnP cameras yield 1.46, suggesting the method's practical applicability hinges on camera estimation quality. This is honestly acknowledged.
Immediate impact. This work opens a practical pathway to training 3D motion generation models from the vast corpus of monocular internet video. This is significant because MoCap datasets (AMASS, HumanML3D) contain only ~14K sequences in controlled environments. The potential to scale training data by orders of magnitude could transform motion generation diversity and realism.
Broader applications. Animation, gaming, robotics simulation, and embodied AI could all benefit. The ability to learn domain-specific motion priors from video (e.g., fitness exercises in Fit3D, basketball in NBA) without MoCap infrastructure is practically valuable.
Downstream influence. The cross-modality diffusion training paradigm, adapted from 3D scene generation to articulated motion, could inspire similar approaches in other domains where 3D supervision is scarce but 2D observations are abundant (e.g., animal motion, hand manipulation).
The paper addresses a timely bottleneck. Motion generation has advanced rapidly with diffusion models, but data scarcity remains the primary limiting factor. The concurrent emergence of highly accurate 2D pose estimators (RTMPose, etc.) and strong 2D-to-3D lifters creates the right conditions for this approach. The connection to LIS demonstrates productive cross-pollination from 3D vision to motion generation. The work also arrives as the field pushes toward in-the-wild motion understanding (Ego-Exo4D, etc.).
Missing comparisons: No comparison with very recent joint generation-estimation methods (GenMo, LMM) that also attempt to bridge 2D and 3D.
VideoMDM makes a meaningful contribution by demonstrating that 3D motion diffusion models can be effectively trained with only 2D supervision, supported by sound theoretical analysis and thorough experiments across multiple settings. The gap to 3D-supervised performance is surprisingly small on controlled benchmarks. While practical deployment at scale faces camera estimation and occlusion challenges, this work establishes an important proof-of-concept that could catalyze a shift toward video-supervised motion generation.
Generated Jun 12, 2026
VideoMDM addresses a fundamental challenge in 3D human motion generation—eliminating the need for 3D ground truth supervision—which could broadly impact motion synthesis, animation, and embodied AI. The theoretical contribution (showing depth-weighted 2D reprojection loss equivalence to 3D supervision) is novel and generalizable. Paper 2, while practically useful, addresses a narrower optimization problem (NVFP4 quantization for reasoning models) that is more incremental and tied to specific hardware. VideoMDM's ability to learn from abundant monocular video data opens significantly wider research directions and real-world applications.
VideoMDM addresses a fundamental bottleneck in 3D human motion generation—the reliance on expensive 3D motion capture data—by introducing a principled framework for learning 3D motion priors from 2D video supervision. The theoretical contribution (showing depth-weighted 2D reprojection loss is equivalent to 3D supervision in expectation) is novel and rigorous, with strong quantitative results nearly matching fully-supervised methods. This opens practical applications in animation, robotics, and AR/VR by leveraging abundant monocular video. Paper 2, while addressing an important continual learning problem, offers more incremental insights with a bio-inspired replay approach that partially restores performance, representing less methodological novelty.
Paper 1 details an IEEE standard for ML arithmetic, offering foundational impact across all ML hardware accelerators and software frameworks. While Paper 2 presents a highly innovative and timely application of diffusion models for 3D motion generation, Paper 1's standardization of low-precision, exception-free arithmetic directly dictates the efficiency and design of future global ML infrastructure. This broad, foundational relevance provides it with a higher potential for widespread, long-lasting scientific and technological impact.
Paper 1 tackles a critical, foundational issue in mechanistic interpretability: whether circuit-finding methodologies generalize across LLM architectures. By demonstrating that different models implement identical tasks using completely different attention patterns, it challenges current paradigms and sets new rigorous standards for future research. This has profound implications for AI safety and deep learning theory. Paper 2, while offering an innovative practical solution for 3D motion generation using 2D supervision, is highly domain-specific to computer vision and graphics, giving Paper 1 a broader and more fundamental scientific impact.
Paper 1 (V-HMN) proposes a fundamentally new vision backbone architecture grounded in brain-inspired Hopfield memory networks, addressing core limitations of current foundation models (interpretability, data efficiency, biological plausibility). Its breadth of potential impact spans multiple modalities and fields, offering a generalizable blueprint. Paper 2 (VideoMDM) makes a solid contribution to 3D motion generation from 2D supervision, but addresses a narrower problem. V-HMN's novelty in bridging neuroscience-inspired computation with large-scale ML and its potential to influence future foundation model design gives it higher estimated impact.
Paper 2 addresses a fundamental question about the reliability and interpretability of sparse autoencoders, a widely-used tool in mechanistic interpretability research. Its findings—that stable features carry most functional signal while unstable features reflect basis ambiguity in reproducible subspaces—have broad implications for the rapidly growing field of AI interpretability. The methodological rigor (large-scale study across seeds, models, layers, plus synthetic verification) and practical contribution (constructing more stable SAEs) give it wide applicability. Paper 1, while solid, offers incremental improvements to motion generation with a narrower application domain.
Paper 2 (VideoMDM) likely has higher impact: it tackles a broadly important, timely problem (3D human motion generation) and removes the need for costly 3D ground truth by leveraging ubiquitous 2D video supervision. The diffusion-based formulation plus an expectation-equivalence result for depth-weighted reprojection provides a clear conceptual contribution with strong real-world applicability in vision/graphics/AR. It is evaluated on multiple large benchmarks and real-video datasets with competitive metrics and human preference studies. Paper 1 is technically solid but more niche to PDE surrogate modeling and incremental over existing neural-operator hybrids.
VideoMDM addresses a fundamental bottleneck in 3D human motion generation—the dependency on expensive 3D motion capture data—by learning from ubiquitous 2D video supervision. The theoretical contribution (proving depth-weighted 2D loss equivalence to 3D supervision) and the practical implications (unlocking vast video datasets for 3D motion learning) give it broader impact across computer vision, graphics, robotics, and animation. Paper 1 offers a solid contribution to multimodal learning with missing modalities but addresses a more incremental, niche problem in bioscience. VideoMDM's paradigm shift in supervision has wider cross-field applicability.
Paper 1 introduces a fundamentally novel paradigm—generative flow matching for whole-cortex fMRI dynamics conditioned on compositional language, enabling zero-shot generation for unseen cognitive tasks. This opens a new direction ('counterfactual neuroscience') with broad implications for experimental design in neuroscience. Its cross-disciplinary impact (generative AI + neuroscience), methodological novelty (in-context priors for neural time series), and potential to transform how cognitive experiments are designed give it higher impact potential. Paper 2, while technically solid, is an incremental improvement in 3D motion generation with a narrower scope focused on reducing supervision requirements.
Paper 1 addresses a critical data bottleneck in 3D generation by enabling training on abundant 2D video data without requiring 3D ground truth. Its novel approach of using a depth-weighted 2D reprojection loss with diffusion models has massive implications for AR/VR, gaming, and biomechanics. While Paper 2 offers a valuable architectural improvement for PDE solvers, Paper 1 represents a more fundamental methodological leap with broader, more immediate commercial and interdisciplinary applications in the highly active field of generative AI.