TRAM: Global Trajectory and Motion of 3D Humans
from in-the-wild Videos


Abstract

We propose TRAM, a two-stage method to reconstruct a human’s global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work.


Approach

Overview of TRAM. Top-left: given a video, we first recover the relative camera motion and scene depth with DROID-SLAM, which we robustify with dual masking (Sec 3.2). Top-right: we align the recover depth to metric depth prediction with an optimization procedure to estimate metric scaling (Sec 3.3). Bottom: We introduce VIMO to reconstruct the 3D human in the camera coordinate (Sec 3.4), and use the metric-scale camera to convert the human trajectory and body motion to the global coordinate.


Results



Human world trajectory.


Camera world trajectory.


Acknowledgements

We appreciate the helpful suggestions from Jiahui Lei, Anthony Bisulco and Soyong Shin. This project is supported by the NSF grants: NSF NCS-FO 2124355, NSF FRR 2220868, and NSF IIS-RI 2212433.