Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents Im2Hands, a neural implicit representation of two interacting hands
- Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency
- Im2Hands models the occupancy volume of two hands using two novel attention-based modules
- Optional keypoint refinement module enables robust two-hand shape estimation from predicted hand keypoints
- Achieves state-of-the-art results in two-hand reconstruction
Paper Content
Introduction
- Modeling 3D shapes of two interacting hands is important for various applications
- Existing studies have focused on single-hand reconstruction
- Challenges include inter-hand collisions and mutual occlusions
- Few learning-based methods on two-hand shape reconstruction have been proposed
- Im2Hands is the first neural implicit representation of two interacting hands
- Im2Hands produces two-hand meshes with an arbitrary resolution
- Im2Hands learns output shapes with precise hand-to-hand and hand-to-image alignment
- Im2Hands consists of two novel attention-based modules
- Im2Hands is compared to existing two-hand mesh-based and single-hand implicit function-based reconstruction methods
Related work
- Single-hand reconstruction methods use deep learning to reconstruct 3D keypoints, MANO parameters, or mesh vertex coordinates
- Few recent works use neural implicit functions for single-hand reconstruction
- Most existing methods for hand-object reconstruction use MANO topology-based mesh representations
- Few recent works consider neural implicit representations to model hand-objects
- Two-hand reconstruction is more challenging due to complex occlusions and deformations
- Few methods can directly reconstruct the dense surface of closely interacting two-hands
- Neural articulated implicit representation is used to model articulated objects
- Recent works use attention mechanisms and context-aware shape refinement steps for two-hand reconstruction
- Our occupancy-based method can learn resolution-independent hand surface with better image-shape alignment
Im2hands: implicit two-hand function
- Im2Hands is a neural occupancy representation of two interacting hands.
- It combines existing articulated occupancy function for single hand with a novel query-image attention module to capture shape-dependent deformations.
- It also proposes a two-hand occupancy refinement network to perform context-aware shape refinement of interacting two hands.
Initial hand occupancy estimation
- Our initial hand occupancy network predicts occupancy probabilities for each hand based on an RGB image and two-hand keypoints.
- HALO [19] models an implicit single-hand occupancy field driven by 3D keypoints.
- Our network design is partly based on HALO.
- We introduce an additional shape feature conditioned on an RGB image to model shape-dependent deformations.
- Our initial per-hand occupancy is modeled by feeding our per-query shape feature along with other inputs to an MLP-based part occupancy network.
Two-hand occupancy refinement
- Estimate per-hand occupancy volumes
- Model co-herency between two hands
- Propose two-hand occupancy refinement network
- Estimate refined two-hand occupancies
- Encode two hand shapes as point clouds
- Extract global and local latent vectors
- Estimate refined occupancy conditioned on local latent descriptors, global latent descriptor, and initial occupancy estimation
Input keypoint refinement
- Im2Hands is an image-based two-hand reconstruction technique
- No ground truth hand keypoints are available
- Keypoint refinement module (K) is introduced to reduce noise in input two-hand keypoints
- K is designed as a combination of GCN, KptEnc, ImgEnc and MSA
- K improves the quality of two-hand reconstruction from single images
Loss functions
- I is trained by MSE loss to measure the difference between the ground truth and predicted occupancy probabilities
- R is trained by MSE loss and penetration loss to avoid inter-penetration between two hands
- K is trained by MSE loss using the ground truth and predicted two-hand keypoints
- Mainly use InterHand2.6M dataset for quantitative and qualitative evaluation
Experiments
- Evaluate quality of reconstructed two-hand shapes using IoU and CD
- Evaluate accuracy of 3D hand keypoints using MPJPE
- Compare Im2Hands to single-hand reconstruction method using implicit representation
- Compare Im2Hands to two-hand reconstruction methods using mesh representation
- Evaluate Im2Hands on single-image two-hand reconstruction using DIGIT and Intag-Hand
Reconstruction from images and keypoints
- Im2Hands outperforms other methods in Table 1
- Im2Hands produces two-hand shapes with better alignment
- Im2Hands captures finger contacts and avoids shape penetration
Reconstruction from single images
- Evaluating Im2Hands on single-image two-hand reconstruction
- Estimating two-hand shapes using 3D hand keypoints predicted from an input image
- Disabling rescaling of reconstructed hand joints and shapes
- Keypoint refinement module successful in alleviating input keypoint errors
- Im2Hands achieves state-of-the-art results on InterHand2.6M
- Im2Hands outperforms existing methods without requiring dense vertex correspondences or statistical model parameter annotations for training
- Im2Hands produces more plausible two-hand shapes that align well with the input image
Generalizability test
- Im2Hands was trained on InterHand2.6M dataset
- Im2Hands was tested on RGB2Hands and EgoHands datasets using DIGIT and In-tagHand respectively
Ablation study
- Performed an ablation study to investigate effectiveness of major modules of Im2Hands
- Full model is most effective compared to other model variants
- Refer to supplementary section for more detailed model variations
Conclusion and future work
- Present Im2Hands, a neural implicit representation of two interacting hands
- Two modules for initial occupancy estimation and context-aware occupancy refinement
- Optional input keypoint refinement module
- State-of-the-art results on two-hand reconstruction
- Limitations and future work: end-to-end learning, temporal information
- Ablation study: input keypoint refinement, context latent extraction, feature cloud conversion
- Qualitative comparison with HALO and IntagHand
- Quantitative results of ablation study using ground truth keypoints
- Implementation details: query positional embedding, image encoder-decoder, multi-headed self-attention, point cloud encoder, context encoder, point cloud decoder, graph convolutional network, output keypoint coordinate regressor
- Training details: epochs, batch size, optimizer, learning rate, weight decay, loss function
- Evaluation metrics: mean Intersection over Union and Chamfer L1-Distance
- Generalizability test: pre-processing, RGB2Hands and EgoHands datasets