IEEE ICCV 2021

DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras

 

Yang Zheng*, Ruizhi Shao*, Yuxiang Zhang, Tao Yu, Zerong Zheng, Qionghai Dai, Yebin Liu (* - equal contribution)

Department of Automation and BNRist, Tsinghua University

 

Fig 1. Given only sparse multi-view RGB videos (6 views for the left and middle, 8 views for the right), our method is able to reconstruct various kinds of 3D shapes with temporal-varying surface details even under challenging occlusions for multi-person interactive scenarios.

 

 

Abstract

We propose DeepMultiCap, a novel method for multi-person performance capture using sparse multi-view cameras. Our method can capture time varying surface details without the need of using pre-scanned template models. To tackle with the serious occlusion challenge for close interacting scenes, we combine a recently proposed pixel-aligned implicit function with parametric model for robust reconstruction of the invisible surface areas. An effective attention-aware module is designed to obtain the fine-grained geometry details from multi-view images, where high-fidelity results can be generated. In addition to the spatial attention method, for video inputs, we further propose a novel temporal fusion method to alleviate the noise and temporal inconsistencies for moving character reconstruction. For quantitative evaluation, we contribute a high quality multi-person dataset, MultiHuman, which consists of 150 static scenes with different levels of occlusions and ground truth 3D human models. Experimental results demonstrate the state-of-the-art performance of our method and the well generalization to real multiview video data, which outperforms the prior works by a large margin.

 

[arXiv] [Code]

Overview

 

 

Fig 2. Pipeline of our framework. With estimated SMPL models and segmented multi-view, we design a spatial attention-aware network and temporal fusion method to reconstruct each character separately. Robust results with fine-grained details are generated even under closely interactive scenes.

 

 

 

Fig 3. Architecture of our attention-aware network. We leverage a two-level coarse to fine framework (left) with a multi-view feature fusion module based on self-attention (right). Human body prior SMPL is used in the coarse level to ensure the robustness of reconstruction, and a specially designed SMPL global normal map helps the fine level network better capture the details. To synthesize multi-view features efficiently, we leverage the self-attention mechanism to extract meta information from different observations, which significantly improve the reconstruction quality.

 

 


Results

 

 

Fig 4. Reconstruction results on MultiHuman dataset of single person, occluded single person, two natural-interactive person, two closely-interactive person, three person scenes (top to bottom). Our method (e) generates robust and highly detailed humans, significantly narrowing the gap between ground truth (f) and performance of current state-of-the-art methods.

 

 

 

Fig 5. Qualitative results of ablation study on MultiHuman dataset. We evaluate the performance of (e) our method and the alternative approaches including (b) ours without SMPL (which is equal to PIFuHD combined with the attention module), (c) ours without the attention module and (d) ours without the designed SMPL global normal maps.

 


MultiHuman Dataset (Download)

 

 

Fig 6. Examples of MultiHuman dataset. Our dataset consists of high quality 3D human models with photo-realistic texture. According to the occlusion level and number of persons in the scene, we divide the dataset into 5 categories, including single person scenes, occluded single person scenes (by various objects), two natural interactive single person scenes, two close interactive person scenes, and three person scenes (from top to bottom).

 

 

Technical Paper

 


Demo Video

 


Citation

Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong Zheng, Qionghai Dai and Yebin Liu. "DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras". IEEE ICCV 2021

 

@inproceedings{zheng2021deepmulticap,
title={DeepMultiCap: Performance Capture of Multiple Characters Using Sparse Multiview Cameras},
author={Zheng, Yang and Shao, Ruizhi and Zhang, Yuxiang and Yu, Tao and Zheng, Zerong and Dai, Qionghai and Liu, Yebin},
booktitle={IEEE Conference on Computer Vision (ICCV 2021)},
year={2021},
}