Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents Im2Hands, a neural implicit representation of two interacting hands
Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency
Im2Hands models the occupancy volume of two hands using two novel attention-based modules
Optional keypoint refinement module enables robust two-hand shape estimation from predicted hand keypoints
Achieves state-of-the-art results in two-hand reconstruction

Paper Content

Introduction

Modeling 3D shapes of two interacting hands is important for various applications
Existing studies have focused on single-hand reconstruction
Challenges include inter-hand collisions and mutual occlusions
Few learning-based methods on two-hand shape reconstruction have been proposed
Im2Hands is the first neural implicit representation of two interacting hands
Im2Hands produces two-hand meshes with an arbitrary resolution
Im2Hands learns output shapes with precise hand-to-hand and hand-to-image alignment
Im2Hands consists of two novel attention-based modules
Im2Hands is compared to existing two-hand mesh-based and single-hand implicit function-based reconstruction methods

Single-hand reconstruction methods use deep learning to reconstruct 3D keypoints, MANO parameters, or mesh vertex coordinates
Few recent works use neural implicit functions for single-hand reconstruction
Most existing methods for hand-object reconstruction use MANO topology-based mesh representations
Few recent works consider neural implicit representations to model hand-objects
Two-hand reconstruction is more challenging due to complex occlusions and deformations
Few methods can directly reconstruct the dense surface of closely interacting two-hands
Neural articulated implicit representation is used to model articulated objects
Recent works use attention mechanisms and context-aware shape refinement steps for two-hand reconstruction
Our occupancy-based method can learn resolution-independent hand surface with better image-shape alignment

Im2hands: implicit two-hand function

Im2Hands is a neural occupancy representation of two interacting hands.
It combines existing articulated occupancy function for single hand with a novel query-image attention module to capture shape-dependent deformations.
It also proposes a two-hand occupancy refinement network to perform context-aware shape refinement of interacting two hands.

Initial hand occupancy estimation

Our initial hand occupancy network predicts occupancy probabilities for each hand based on an RGB image and two-hand keypoints.
HALO [19] models an implicit single-hand occupancy field driven by 3D keypoints.
Our network design is partly based on HALO.
We introduce an additional shape feature conditioned on an RGB image to model shape-dependent deformations.
Our initial per-hand occupancy is modeled by feeding our per-query shape feature along with other inputs to an MLP-based part occupancy network.

Estimate per-hand occupancy volumes
Model co-herency between two hands
Propose two-hand occupancy refinement network
Estimate refined two-hand occupancies
Encode two hand shapes as point clouds
Extract global and local latent vectors
Estimate refined occupancy conditioned on local latent descriptors, global latent descriptor, and initial occupancy estimation

Im2Hands is an image-based two-hand reconstruction technique
No ground truth hand keypoints are available
Keypoint refinement module (K) is introduced to reduce noise in input two-hand keypoints
K is designed as a combination of GCN, KptEnc, ImgEnc and MSA
K improves the quality of two-hand reconstruction from single images

Loss functions

I is trained by MSE loss to measure the difference between the ground truth and predicted occupancy probabilities
R is trained by MSE loss and penetration loss to avoid inter-penetration between two hands
K is trained by MSE loss using the ground truth and predicted two-hand keypoints
Mainly use InterHand2.6M dataset for quantitative and qualitative evaluation

Experiments

Evaluate quality of reconstructed two-hand shapes using IoU and CD
Evaluate accuracy of 3D hand keypoints using MPJPE
Compare Im2Hands to single-hand reconstruction method using implicit representation
Compare Im2Hands to two-hand reconstruction methods using mesh representation
Evaluate Im2Hands on single-image two-hand reconstruction using DIGIT and Intag-Hand

Reconstruction from images and keypoints

Im2Hands outperforms other methods in Table 1
Im2Hands produces two-hand shapes with better alignment
Im2Hands captures finger contacts and avoids shape penetration

Reconstruction from single images

Evaluating Im2Hands on single-image two-hand reconstruction
Estimating two-hand shapes using 3D hand keypoints predicted from an input image
Disabling rescaling of reconstructed hand joints and shapes
Keypoint refinement module successful in alleviating input keypoint errors
Im2Hands achieves state-of-the-art results on InterHand2.6M
Im2Hands outperforms existing methods without requiring dense vertex correspondences or statistical model parameter annotations for training
Im2Hands produces more plausible two-hand shapes that align well with the input image

Generalizability test

Im2Hands was trained on InterHand2.6M dataset
Im2Hands was tested on RGB2Hands and EgoHands datasets using DIGIT and In-tagHand respectively

Ablation study

Performed an ablation study to investigate effectiveness of major modules of Im2Hands
Full model is most effective compared to other model variants
Refer to supplementary section for more detailed model variations

Conclusion and future work

Present Im2Hands, a neural implicit representation of two interacting hands
Two modules for initial occupancy estimation and context-aware occupancy refinement
Optional input keypoint refinement module
State-of-the-art results on two-hand reconstruction
Limitations and future work: end-to-end learning, temporal information
Ablation study: input keypoint refinement, context latent extraction, feature cloud conversion
Qualitative comparison with HALO and IntagHand
Quantitative results of ablation study using ground truth keypoints
Implementation details: query positional embedding, image encoder-decoder, multi-headed self-attention, point cloud encoder, context encoder, point cloud decoder, graph convolutional network, output keypoint coordinate regressor
Training details: epochs, batch size, optimizer, learning rate, weight decay, loss function
Evaluation metrics: mean Intersection over Union and Chamfer L1-Distance
Generalizability test: pre-processing, RGB2Hands and EgoHands datasets

Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes

Link to paper

Abstract

Paper Content

Introduction

Im2hands: implicit two-hand function

Initial hand occupancy estimation

Two-hand occupancy refinement

Input keypoint refinement

Loss functions

Experiments

Reconstruction from images and keypoints

Reconstruction from single images

Generalizability test

Ablation study

Conclusion and future work

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Im2hands: implicit two-hand function#

Initial hand occupancy estimation#

Two-hand occupancy refinement#

Input keypoint refinement#

Loss functions#

Experiments#

Reconstruction from images and keypoints#

Reconstruction from single images#

Generalizability test#

Ablation study#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Im2hands: implicit two-hand function

Initial hand occupancy estimation

Two-hand occupancy refinement

Input keypoint refinement

Loss functions

Experiments

Reconstruction from images and keypoints

Reconstruction from single images

Generalizability test

Ablation study

Conclusion and future work