Masked Modeling for Human Motion Recovery Under Occlusions

1ETH Zürich, 2Meta Reality Labs

TL;DR: Given a monocular video captured from a static camera, MoRo robustly reconstructs accurate and physically plausible human motion, even under challenging occlusion scenarios.

Left: Our predicted 3D human motion overlaid on the input video.
Right: Reconstructed 3D human motion in global space, compared with ground truth and baselines.

Abstract

Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings. Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps.

To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference.

To overcome the scarcity of paired video–motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse frame-level poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video–motion datasets to integrate visual cues with motion dynamics for robust inference.

Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.

Method

MoRo pipeline illustration.

MoRo features three main components: the image encoder, the motion encoder and the decoder. Given a monocular video sequence, we utilize the image encoder to extract per-frame image features and estimate a coarse global trajectory, which is canonicalized and serves as the input to the motion encoder. Along with masked local pose tokens, the motion encoder encodes a trajectory-aware motion prior via recovering the complete local pose tokens and denoising the global trajectory. The cross-modality decoder fuses the intermediate feature from both encoders via a spatial-temporal transformer to refine the camera-space global trajectory and predict a conditional categorical distribution for sampling the local pose tokens, which are then smoothed for enhanced motion realism.

Global Motion Reconstruction

Results on EgoBody

Results on PROX

Comparison with Baselines

Results on EgoBody

*Note that RoHM was run on the raw input video, whereas the other methods used undistorted videos as input.

Results on RICH

BibTeX

@article{qian2026moro,
  author    = {Qian, Zhiyin and Zhang, Siwei and Bhatnagar, Bharat Lal and Bogo, Federica and Tang, Siyu},
  title     = {Masked Modeling for Human Motion Recovery Under Occlusions},
  booktitle = {3DV}
  year      = {2026},
}