3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

ECCV 2022


Hao Meng1,3*, Sheng Jin2,3*, Wentao Liu3,4, Chen Qian3, Mengxiang Lin1, Wanli Ouyang4,5, Ping Luo2,

1Beihang University    2The University of Hong Kong    3SenseTime Research and Tetras.AI    4Shanghai AI Lab    5The University of Sydney

Abstract


Responsive image

Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. In this way, it is straightforward to take advantage of the latest research progress on the single-hand pose estimation system. However, hand pose estimation in interacting scenarios is very challenging, due to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous appearance of hands. To tackle these two challenges, we propose a novel Hand De-occlusion and Removal (HDR) framework to perform hand de-occlusion and distractor removal. We also propose the first large-scale synthetic amodal hand dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training and promote the development of the related research. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches.


Overview


Responsive image

Figure 2.Illustration of our Hand De-occlusion and Removal (HDR) framework for the task of 3D interacting hand pose estimation. We first employ HASM (Hand Amodal Segmentation Module) to segment the amodal and modal masks of the left and the right hand in the image. Given the predicted masks, we locate and crop the image patch centered at each hand. Then, for every cropped image, the HDRM (Hand De- occlusion and Removal Module) recovers the appearance content of the occluded part of one hand and removes the other distracting hand simultaneously. In this way, the interacting two-hand image is transformed into a single-hand image, and can be easily handled by SHPE (Single Hand Pose Estimation) to get the final 3D hand poses.


HDRNet


Responsive image


Amodal InterHand (AIH) Dataset


Responsive image


Qualitative Results


Responsive image



Demo video



Citation


@article{meng2022hdr,
  title={3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal},
  author={Hao Meng, Sheng Jin, Wentao Liu, Chen Qian, Mengxiang Lin, Wanli Ouyang, and Ping Luo},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022}
  month={October},
  
}


Acknowledgement


We would like to thank Wentao Jiang, Wang Zeng, Neng Qian, Yumeng Hu, Lixin Yang, Yu Rong, Qiang Zhou and Jiayi Wang for their helpful discussions and feedback. Mengxiang Lin is supported by State Key Laboratory of Software Development Environment under Grant No SKLSDE 2022ZX-06. Ping Luo is supported by the General Research Fund of HK No.27208720, No.17212120, and No.17200622. Wanli Ouyang is supported by the Australian Research Council Grant DP200103223, Australian Medical Research Future Fund MRFAI000085, CRC-P Smart Material Recovery Facility (SMRF) – Curby Soft Plastics, and CRC-P ARIA - Bionic Visual-Spatial Prosthesis for the Blind.