DuoMo: Dual Motion Diffusion

Abstract

We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly (bypassing parametric models). DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space MPJPE error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space MPJPE error.

Camera-Space Model

Predictions from the camera-space motion model can be lifted into world coordinates using camera motion. But the result is jittery due to noisy input, and when the human is out of frame, the predictions become inconsistent in world space.

Camera+World-Space Models

The world-space model performs global refinement. It takes the noisy predictions from the first stage as input, and produces world-consistent motion. Even when the subject is occluded, the model generates realistic in-between motion.

Approach

Method overview (A) In the first stage, our camera-space model encodes video features and generates camera-space human motion. This motion is lifted to the world coordinates using estimated camera poses, becoming the initial proposal for world-space human motion. Some predictions are missing due to subject out of frame. In the second stage, the world-space model encodes the noisy world-space motion and generates globally consistent world-space motion. Plots at the bottom visualize the pelvis depth in the world coordinates. (B) Camera-space model architecture. (C) World-space model architecture.

Additional Details

Sparse mesh representation

Our method generates sparse mesh vertices for the entire motion sequence. The sparse mesh adopts the 595-vertex LOD6 representation from Meta's Momentum Human Rig. SMPL and SMPLX meshes can be downsampled to the same representation, allowing us to use common datasets such as BEDLAM for training.

Dense keypoint detection

We trained a dense body keypoint detection model to infer 595 body surface keypoints, sematically correponding to the 595 vertices of the sparse mesh. We then convert them to ray directions using camera intrinsics and feed them as input to the camera space motion diffusion model.

Camera-space height conditioning

Height remains a large source of ambiguity in camera-space reconstruction. We trained our camera-space diffusion model to accept human body height as an optional condition. At inference, if this information is available as in many XR applications, the model will be able to generate more metrically accurate motion.

World-space guided generation

Our world-space diffusion model generates root velocity instead of absolute root position. This process introduces drift when the human is out of frame. We employ test-time guidances to steer the velocity generation to match the observation and constraints.

BibTeX


      @article{wang2026duomo,
        title   = {DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction},
        author  = {Wang, Yufu and Ng, Evonne and Shin, Soyong and Khirodkar, Rawal and Dong, Yuan and Su, Zhaoen and Park, Jinhyung and Kitani, Kris and Richard, Alexander and Prada, Fabian and Zollhofer, Michael},
        journal = {arXiv preprint arXiv:2603.03265},
        year    = {2026},
      }

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction