PromptHMR: Promptable Human Mesh Recovery

CVPR 2025

Paper
Paper
Github code
Code (soon)

Overview of PromptHMR. PromptHMR is a promptable human pose and shape (HPS) estimation method that processes images with spatial or semantic prompts. It takes “side information” readily available from vision-language models or user input to improve the accuracy and robustness of 3D HPS. PromptHMR recovers human pose and shape from spatial prompts such as (a) face bounding boxes, (b) partial or complete person detection boxes, or (c) segmentation masks. It refines its predictions using semantic prompts such as (c) person-person interaction labels for close contact scenarios, or (d) natural language descriptions of body shape to improve body shape predictions. Both image and video versions of PromptHMR achieve state-of-the-art accuracy.


Approach

Method. PromptHMR estimates SMPL-X parameters for each person in an image based on various types of prompts, such as boxes, language descriptions, and person-person interaction cues. Given an image and prompts, we utilize a vision transformer to generate image embeddings and mask and prompt encoders to map different types of prompts to tokens. Optionally, camera intrinsics can be embedded along with the image embeddings. The image embeddings and prompt tokens are then fed to the SMPL-X decoder. The SMPL-X decoder is a transformer-based module that attends to both the image and prompt tokens to estimate SMPL-X parameters. Note that the language and interaction prompts are optional, but providing them enhances the accuracy of the estimated SMPL-X parameters.


Results

Multi-person video (rendered with Gloss)

Single-view multi-person

Two-person interaction


Acknowledgements. The authors would like to thank Yan Zhang, Yao Feng, and Nitin Saini for their suggestions. The majority of the work was done when Yufu was an intern at Meshcapade. Yufu and Kostas thank the support of NSF NCS-FO 2124355, NSF FRR 2220868, and NSF IIS-RI 2212433.

Disclosure. While MJB is a co-founder and Chief Scientist at Meshcapade, his research in this project was performed solely at, and funded solely by, the Max Planck Society.