Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents Im2Hands, a neural implicit representation of two interacting hands
  • Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency
  • Im2Hands models the occupancy volume of two hands using two novel attention-based modules
  • Optional keypoint refinement module enables robust two-hand shape estimation from predicted hand keypoints
  • Achieves state-of-the-art results in two-hand reconstruction

Paper Content

Introduction

  • Modeling 3D shapes of two interacting hands is important for various applications
  • Existing studies have focused on single-hand reconstruction
  • Challenges include inter-hand collisions and mutual occlusions
  • Few learning-based methods on two-hand shape reconstruction have been proposed
  • Im2Hands is the first neural implicit representation of two interacting hands
  • Im2Hands produces two-hand meshes with an arbitrary resolution
  • Im2Hands learns output shapes with precise hand-to-hand and hand-to-image alignment
  • Im2Hands consists of two novel attention-based modules
  • Im2Hands is compared to existing two-hand mesh-based and single-hand implicit function-based reconstruction methods
  • Single-hand reconstruction methods use deep learning to reconstruct 3D keypoints, MANO parameters, or mesh vertex coordinates
  • Few recent works use neural implicit functions for single-hand reconstruction
  • Most existing methods for hand-object reconstruction use MANO topology-based mesh representations
  • Few recent works consider neural implicit representations to model hand-objects
  • Two-hand reconstruction is more challenging due to complex occlusions and deformations
  • Few methods can directly reconstruct the dense surface of closely interacting two-hands
  • Neural articulated implicit representation is used to model articulated objects
  • Recent works use attention mechanisms and context-aware shape refinement steps for two-hand reconstruction
  • Our occupancy-based method can learn resolution-independent hand surface with better image-shape alignment

Im2hands: implicit two-hand function

  • Im2Hands is a neural occupancy representation of two interacting hands.
  • It combines existing articulated occupancy function for single hand with a novel query-image attention module to capture shape-dependent deformations.
  • It also proposes a two-hand occupancy refinement network to perform context-aware shape refinement of interacting two hands.

Initial hand occupancy estimation

  • Our initial hand occupancy network predicts occupancy probabilities for each hand based on an RGB image and two-hand keypoints.
  • HALO [19] models an implicit single-hand occupancy field driven by 3D keypoints.
  • Our network design is partly based on HALO.
  • We introduce an additional shape feature conditioned on an RGB image to model shape-dependent deformations.
  • Our initial per-hand occupancy is modeled by feeding our per-query shape feature along with other inputs to an MLP-based part occupancy network.

Two-hand occupancy refinement

  • Estimate per-hand occupancy volumes
  • Model co-herency between two hands
  • Propose two-hand occupancy refinement network
  • Estimate refined two-hand occupancies
  • Encode two hand shapes as point clouds
  • Extract global and local latent vectors
  • Estimate refined occupancy conditioned on local latent descriptors, global latent descriptor, and initial occupancy estimation

Input keypoint refinement

  • Im2Hands is an image-based two-hand reconstruction technique
  • No ground truth hand keypoints are available
  • Keypoint refinement module (K) is introduced to reduce noise in input two-hand keypoints
  • K is designed as a combination of GCN, KptEnc, ImgEnc and MSA
  • K improves the quality of two-hand reconstruction from single images

Loss functions

  • I is trained by MSE loss to measure the difference between the ground truth and predicted occupancy probabilities
  • R is trained by MSE loss and penetration loss to avoid inter-penetration between two hands
  • K is trained by MSE loss using the ground truth and predicted two-hand keypoints
  • Mainly use InterHand2.6M dataset for quantitative and qualitative evaluation

Experiments

  • Evaluate quality of reconstructed two-hand shapes using IoU and CD
  • Evaluate accuracy of 3D hand keypoints using MPJPE
  • Compare Im2Hands to single-hand reconstruction method using implicit representation
  • Compare Im2Hands to two-hand reconstruction methods using mesh representation
  • Evaluate Im2Hands on single-image two-hand reconstruction using DIGIT and Intag-Hand

Reconstruction from images and keypoints

  • Im2Hands outperforms other methods in Table 1
  • Im2Hands produces two-hand shapes with better alignment
  • Im2Hands captures finger contacts and avoids shape penetration

Reconstruction from single images

  • Evaluating Im2Hands on single-image two-hand reconstruction
  • Estimating two-hand shapes using 3D hand keypoints predicted from an input image
  • Disabling rescaling of reconstructed hand joints and shapes
  • Keypoint refinement module successful in alleviating input keypoint errors
  • Im2Hands achieves state-of-the-art results on InterHand2.6M
  • Im2Hands outperforms existing methods without requiring dense vertex correspondences or statistical model parameter annotations for training
  • Im2Hands produces more plausible two-hand shapes that align well with the input image

Generalizability test

  • Im2Hands was trained on InterHand2.6M dataset
  • Im2Hands was tested on RGB2Hands and EgoHands datasets using DIGIT and In-tagHand respectively

Ablation study

  • Performed an ablation study to investigate effectiveness of major modules of Im2Hands
  • Full model is most effective compared to other model variants
  • Refer to supplementary section for more detailed model variations

Conclusion and future work

  • Present Im2Hands, a neural implicit representation of two interacting hands
  • Two modules for initial occupancy estimation and context-aware occupancy refinement
  • Optional input keypoint refinement module
  • State-of-the-art results on two-hand reconstruction
  • Limitations and future work: end-to-end learning, temporal information
  • Ablation study: input keypoint refinement, context latent extraction, feature cloud conversion
  • Qualitative comparison with HALO and IntagHand
  • Quantitative results of ablation study using ground truth keypoints
  • Implementation details: query positional embedding, image encoder-decoder, multi-headed self-attention, point cloud encoder, context encoder, point cloud decoder, graph convolutional network, output keypoint coordinate regressor
  • Training details: epochs, batch size, optimizer, learning rate, weight decay, loss function
  • Evaluation metrics: mean Intersection over Union and Chamfer L1-Distance
  • Generalizability test: pre-processing, RGB2Hands and EgoHands datasets