1Tsinghua University 2NNKosmos 3OPPO Research
With NeRF widely used for facial reenactment, recent methods can recover photo-realistic 3D head avatar from just a monocular video. Unfortunately, the training process of the NeRF-based methods is quite time-consuming, as MLP used in the NeRF-based methods is inefficient and requires too many iterations to converge. To overcome this problem, we propose AvatarMAV, a fast 3D head avatar reconstruction method using Motion-Aware Neural Voxels. AvatarMAV is the first to model both the canonical appearance and the decoupled expression motion by neural voxels for head avatar. In particular, the motion-aware neural voxels is generated from the weighted concatenation of multiple 4D tensors. The 4D tensors semantically correspond one-to-one with 3DMM expression basis and share the same weights as 3DMM expression coefficients. Benefiting from our novel representation, the proposed AvatarMAV can recover photo-realistic head avatars in just 5 minutes (implemented with pure PyTorch), which is significantly faster than the state-of-the-art facial reenactment methods.
Fig 1.We propose AvatarMAV, a fast 3D head avatar reconstruction method. Given a monocular video, our method can recover photo-realistic head avatar in 5 minutes.
Fig 2. Overview. Given a portrait video, we first track the expression and head pose using a 3DMM template. After the pre-processing, given expression coefficients, we use motion voxel grid bases to represent motions caused by each expression basis and sum them weighted as an entire motion voxel grid. The entire motion voxel grid and the following 2-layer MLP will then transfer an input point \( x \) to \( x + \delta x \) by adding all expression-related deformations. Finally, we will query point \( x + \delta x \) in the appearance voxel grid and generate a final portrait image using volumetric rendering.
Fig 3. Training Process. We illustrate the training process of our method AvatarMAV.
Fig 4. Training Speed Comparisons. We make qualitative comparisons on training speed among NeRFace, NeRFBlendShape and our AvatarMAV. Our model converges rapidly within the first 2 minutes.
Fig 5. Qualitative comparisons between AvatarMAV and other three state-of-the-art methods on self reenactment task. From left to right: IMAvatar, Deep Video Portraits (DVP), NeRFace, AvatarMAV and Ground Truth. Our approach is able to converge and learn details within a short time.
Fig 6. Qualitative results of AvatarMAV and three other state-of-the-art methods on cross-identity reenactment task. From left to right: Deep Video Portraits (DVP), IMAvatar, NeRFace and AvatarMAV.
@InProceedings{xu2023avatarmav,
title={AvatarMAV: Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels},
author={Xu, Yuelang and Wang, Lizhen and Zhao, Xiaochen and Zhang, Hongwen and Liu, Yebin},
booktitle={ACM SIGGRAPH 2023 Conference Proceedings},
pages={},
year={2023}
}