VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

Jun 11, 2026arXiv:2606.13364v1

cs.LGcs.CV

#2529of 5669·cs.LG

#2529 of 5669 · cs.LG

Tournament Score

1415±48

10501750

54%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Abstract

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: VideoMDM

1. Core Contribution

VideoMDM addresses a fundamental bottleneck in 3D human motion generation: the dependency on expensive, studio-captured motion capture (MoCap) data. The key insight is training a 3D-native diffusion model (building on MDM) using only 2D pose supervision extracted from monocular videos. Unlike prior work (MAS, Motion-2-to-3) that lifts 2D to 3D at inference time or requires 3D data for fine-tuning, VideoMDM learns a coherent 3D motion manifold during training itself.

The framework uses a "noisy teacher" paradigm borrowed from LIS (Lesson in Splats): a pretrained 2D-to-3D lifter provides approximate 3D poses that are diffused, denoised in 3D, and supervised via 2D reprojection against accurate 2D keypoints. The theoretical contribution is a proof that a depth-weighted 2D reprojection loss is equivalent in expectation to 3D MSE supervision under mild assumptions (uniform azimuth distribution, matched depth between prediction and ground truth). Additionally, the authors adapt standard 3D motion regularizers (velocity consistency, representation alignment) to operate under 2D-only supervision via ray-projection pseudo-targets.

2. Methodological Rigor

Theoretical grounding. The depth-weighted loss equivalence proof (Appendix A) is clean and mathematically sound. The assumptions—uniform azimuth sampling and matched depths—are reasonable for training scenarios but somewhat idealized. The depth assumption (predicted depth equals true depth) is particularly strong, though the empirical results suggest the approximation works well in practice.

Experimental design. The three evaluation settings are well-chosen and complementary:

HumanML3D (synthetic 2D) isolates supervision regime from pose-estimation noise

Fit3D (real video) tests out-of-distribution generalization with real-world data

NBA (centered, unconditional) provides head-to-head comparison with MAS under favorable conditions for the baseline

Ablation study (Table in Figure 6b) systematically validates each component, with multistep denoising and representation alignment emerging as the most impactful.

Limitations in evaluation. The Fit3D dataset is acknowledged as too small for FID, so the authors resort to human preference studies (well-designed via Prolific with proper blinding). However, the NBA evaluation reveals some concerns: the V AE-based metrics appear unreliable, and VideoMDM underperforms MAS on FID and Diversity while excelling in Precision and Recall†—making the picture somewhat ambiguous without the human study.

The PnP camera estimation gap is notable: with GT cameras, FID reaches 0.88 vs. 0.54 (3D-supervised MDM), but PnP cameras yield 1.46, suggesting the method's practical applicability hinges on camera estimation quality. This is honestly acknowledged.

3. Potential Impact

Immediate impact. This work opens a practical pathway to training 3D motion generation models from the vast corpus of monocular internet video. This is significant because MoCap datasets (AMASS, HumanML3D) contain only ~14K sequences in controlled environments. The potential to scale training data by orders of magnitude could transform motion generation diversity and realism.

Broader applications. Animation, gaming, robotics simulation, and embodied AI could all benefit. The ability to learn domain-specific motion priors from video (e.g., fitness exercises in Fit3D, basketball in NBA) without MoCap infrastructure is practically valuable.

Downstream influence. The cross-modality diffusion training paradigm, adapted from 3D scene generation to articulated motion, could inspire similar approaches in other domains where 3D supervision is scarce but 2D observations are abundant (e.g., animal motion, hand manipulation).

4. Timeliness & Relevance

The paper addresses a timely bottleneck. Motion generation has advanced rapidly with diffusion models, but data scarcity remains the primary limiting factor. The concurrent emergence of highly accurate 2D pose estimators (RTMPose, etc.) and strong 2D-to-3D lifters creates the right conditions for this approach. The connection to LIS demonstrates productive cross-pollination from 3D vision to motion generation. The work also arrives as the field pushes toward in-the-wild motion understanding (Ego-Exo4D, etc.).

5. Strengths & Limitations

Key Strengths:

Principled formulation: The depth-weighted loss equivalence provides theoretical backing rather than just heuristic design

Practical impact: Genuinely enables a new training paradigm that could scale with internet video

Strong empirical validation: FID 0.88 vs. 0.54 (3D-supervised) on HumanML3D is impressive; factor-of-2 MPJPE improvement over WHAM on Fit3D demonstrates the prior generalizes beyond the teacher

Honest evaluation: The authors transparently report PnP degradation and conduct human studies where automated metrics are unreliable

Complete ablation: Each component is justified

Notable Weaknesses:

Camera dependency: The reliance on known or estimated camera parameters is the most significant practical limitation. Most internet videos lack calibration, and PnP estimation from noisy lifter outputs compounds errors

Lifter dependency: While the model generalizes beyond its teacher, a reasonable lifter must exist. The "noisy teacher" paradigm means initial 3D quality still matters

Scale not yet demonstrated: Despite the motivation of leveraging internet-scale video, experiments use relatively small datasets (611 Fit3D sequences, NBA dataset). A truly large-scale demonstration would be more convincing

Occlusion handling: All evaluated settings have minimal occlusion, which is unrealistic for in-the-wild deployment

Assumption limitations: The depth-matching assumption in the loss equivalence proof may not hold well when lifter quality is poor

Missing comparisons: No comparison with very recent joint generation-estimation methods (GenMo, LMM) that also attempt to bridge 2D and 3D.

Overall Assessment

VideoMDM makes a meaningful contribution by demonstrating that 3D motion diffusion models can be effectively trained with only 2D supervision, supported by sound theoretical analysis and thorough experiments across multiple settings. The gap to 3D-supervised performance is surprisingly small on controlled benchmarks. While practical deployment at scale faces camera estimation and occlusion challenges, this work establishes an important proof-of-concept that could catalyze a shift toward video-supervised motion generation.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated Jun 12, 2026

Comparison History (13)

Wonvs. ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

VideoMDM addresses a fundamental challenge in 3D human motion generation—eliminating the need for 3D ground truth supervision—which could broadly impact motion synthesis, animation, and embodied AI. The theoretical contribution (showing depth-weighted 2D reprojection loss equivalence to 3D supervision) is novel and generalizable. Paper 2, while practically useful, addresses a narrower optimization problem (NVFP4 quantization for reasoning models) that is more incremental and tied to specific hardware. VideoMDM's ability to learn from abundant monocular video data opens significantly wider research directions and real-world applications.

claude-opus-4-6·Jun 12, 2026

Wonvs. Not Just After One: Sleep-Inspired Replay Prevents Catastrophic Forgetting After Sequential Tasks

VideoMDM addresses a fundamental bottleneck in 3D human motion generation—the reliance on expensive 3D motion capture data—by introducing a principled framework for learning 3D motion priors from 2D video supervision. The theoretical contribution (showing depth-weighted 2D reprojection loss is equivalent to 3D supervision in expectation) is novel and rigorous, with strong quantitative results nearly matching fully-supervised methods. This opens practical applications in animation, robotics, and AR/VR by leveraging abundant monocular video. Paper 2, while addressing an important continual learning problem, offers more incremental insights with a bio-inspired replay approach that partially restores performance, representing less methodological novelty.

claude-opus-4-6·Jun 12, 2026

Lostvs. Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning

Paper 1 details an IEEE standard for ML arithmetic, offering foundational impact across all ML hardware accelerators and software frameworks. While Paper 2 presents a highly innovative and timely application of diffusion models for 3D motion generation, Paper 1's standardization of low-precision, exception-free arithmetic directly dictates the efficiency and design of future global ML infrastructure. This broad, foundational relevance provides it with a higher potential for widespread, long-lasting scientific and technological impact.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

Paper 1 tackles a critical, foundational issue in mechanistic interpretability: whether circuit-finding methodologies generalize across LLM architectures. By demonstrating that different models implement identical tasks using completely different attention patterns, it challenges current paradigms and sets new rigorous standards for future research. This has profound implications for AI safety and deep learning theory. Paper 2, while offering an innovative practical solution for 3D motion generation using 2D supervision, is highly domain-specific to computer vision and graphics, giving Paper 1 a broader and more fundamental scientific impact.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Vision Hopfield Memory Networks

Paper 1 (V-HMN) proposes a fundamentally new vision backbone architecture grounded in brain-inspired Hopfield memory networks, addressing core limitations of current foundation models (interpretability, data efficiency, biological plausibility). Its breadth of potential impact spans multiple modalities and fields, offering a generalizable blueprint. Paper 2 (VideoMDM) makes a solid contribution to 3D motion generation from 2D supervision, but addresses a narrower problem. V-HMN's novelty in bridging neuroscience-inspired computation with large-scale ML and its potential to influence future foundation model design gives it higher estimated impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Paper 2 addresses a fundamental question about the reliability and interpretability of sparse autoencoders, a widely-used tool in mechanistic interpretability research. Its findings—that stable features carry most functional signal while unstable features reflect basis ambiguity in reproducible subspaces—have broad implications for the rapidly growing field of AI interpretability. The methodological rigor (large-scale study across seeds, models, layers, plus synthetic verification) and practical contribution (constructing more stable SAEs) give it wide applicability. Paper 1, while solid, offers incremental improvements to motion generation with a narrower application domain.

claude-opus-4-6·Jun 12, 2026

Wonvs. HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

Paper 2 (VideoMDM) likely has higher impact: it tackles a broadly important, timely problem (3D human motion generation) and removes the need for costly 3D ground truth by leveraging ubiquitous 2D video supervision. The diffusion-based formulation plus an expectation-equivalence result for depth-weighted reprojection provides a clear conceptual contribution with strong real-world applicability in vision/graphics/AR. It is evaluated on multiple large benchmarks and real-video datasets with competitive metrics and human preference studies. Paper 1 is technically solid but more niche to PDE surrogate modeling and incremental over existing neural-operator hybrids.

gpt-5.2·Jun 12, 2026

Wonvs. Latent World Recovery for Multimodal Learning with Missing Modalities

VideoMDM addresses a fundamental bottleneck in 3D human motion generation—the dependency on expensive 3D motion capture data—by learning from ubiquitous 2D video supervision. The theoretical contribution (proving depth-weighted 2D loss equivalence to 3D supervision) and the practical implications (unlocking vast video datasets for 3D motion learning) give it broader impact across computer vision, graphics, robotics, and animation. Paper 1 offers a solid contribution to multimodal learning with missing modalities but addresses a more incremental, niche problem in bioscience. VideoMDM's paradigm shift in supervision has wider cross-field applicability.

claude-opus-4-6·Jun 12, 2026

Lostvs. Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

Paper 1 introduces a fundamentally novel paradigm—generative flow matching for whole-cortex fMRI dynamics conditioned on compositional language, enabling zero-shot generation for unseen cognitive tasks. This opens a new direction ('counterfactual neuroscience') with broad implications for experimental design in neuroscience. Its cross-disciplinary impact (generative AI + neuroscience), methodological novelty (in-context priors for neural time series), and potential to transform how cognitive experiments are designed give it higher impact potential. Paper 2, while technically solid, is an incremental improvement in 3D motion generation with a narrower scope focused on reducing supervision requirements.

claude-opus-4-6·Jun 12, 2026

Wonvs. How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

Paper 1 addresses a critical data bottleneck in 3D generation by enabling training on abundant 2D video data without requiring 3D ground truth. Its novel approach of using a depth-weighted 2D reprojection loss with diffusion models has massive implications for AR/VR, gaming, and biomechanics. While Paper 2 offers a valuable architectural improvement for PDE solvers, Paper 1 represents a more fundamental methodological leap with broader, more immediate commercial and interdisciplinary applications in the highly active field of generative AI.

gemini-3.1-pro-preview·Jun 12, 2026

#2529of 5669·cs.LG

#2529 of 5669 · cs.LG

Tournament Score

1415±48

10501750

54%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5