Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • X-Avatar is a novel avatar model that captures the full expressiveness of digital humans.
  • X-Avatar can be learned from 3D scans or RGB-D data.
  • X-Avatar uses a part-aware learned forward skinning module.
  • X-Avatar uses part-aware sampling and initialization strategies.
  • X-Avatar uses a texture network to capture the appearance of the avatar.
  • X-Avatar outperforms strong baselines in both data domains.
  • X-Humans is a new dataset of high-quality textured scans from 20 participants.

Paper Content

Introduction

  • Human communication involves body pose, appearance, facial expressions, and hand gestures.
  • Capturing this richness of human expressiveness is difficult.
  • Existing models are limited in resolution and do not model garments or hair.
  • Neural implicit representations have potential to overcome these limitations.
  • X-Avatar captures shape, appearance, deformations, hand poses, facial expressions, and clothing.
  • Challenges include increased number of articulated body parts and different scales.
  • Part-aware initialization and sampling strategies improve quality and keep training efficient.
  • X-Humans dataset released with data, models, and SMPL[-X] registrations.
  • Explicit models use a 3D mesh and low-dimensional parameters to represent the human body
  • Explicit models have been applied to tasks such as RGB-based pose estimation, RGB-D fitting, fitting to body-worn sensor data, and 3D hand pose estimation
  • Explicit models have been extended to model human expressiveness
  • Implicit models use neural fields to represent 3D geometry
  • Implicit models have been applied to bodies, faces, hands, and clothing
  • Publicly available datasets with clothed and textured ground-truth are rare
  • X-Humans dataset contains 35,500 frames of high-quality, texturized scans of real clothed humans

Method

  • X-Avatar is a method for modeling implicit human avatars with full body control
  • X-Avatar can be learned from 3D posed scans and RGB-D images
  • X-Avatar uses SMPL-X full body model
  • X-Avatar has a training scheme and part-aware initialization and sampling strategies

Recap: smpl-x unified human body model

  • Goal: Create fully controllable human avatars
  • Use SMPL-X parameter space, which includes articulated hands and expressive face

Implicit neural avatar representation

  • Human model defined by articulated neural implicit surfaces
  • Three neural fields: geometry, deformation, and appearance
  • Geometry modeled by MLP predicting occupancy value for 3D point in canonical space
  • Deformation modeled by MLP predicting skinning weights conditioned on body pose and facial expression
  • Part-aware initialization and sampling to account for small body parts
  • Texture modeled by MLP predicting RGB values conditioned on deformed geometry and local high-frequency details

Training process

  • Objective function is used to minimize 3D scans
  • Objective function consists of two losses: binary cross entropy loss and L2 loss
  • Objective function also consists of RGB supervision, bone occupancy loss, joint LBS weights loss, and surface LBS weights loss
  • Supervision from registered SMPL-X meshes is used

Experiments

  • Introduced datasets and metrics in Sec. 4.2
  • Described state-of-the-art methods in Sec. 4.3
  • Showed and discussed results in Sec.
  • Focused on challenging animation task
  • Reported reconstruction results in Supp. Mat.

Datasets

  • GRAB subset of AMASS used for training
  • X-Humans dataset contains 20 subjects with various clothing and hair styles
  • X-Humans contains 29,036 poses for training and 6,439 test poses
  • Evaluation metrics include volumetric IoU, Chamfer distance and normal consistency

Ablation study

  • Part-aware initialization is critical to accelerate training and find good correspondences in small body parts.
  • Initializing with all bones (body, hands, face) is 3 times slower than initializing with just the body’s bone transformations.
  • Part-aware sampling improves hand shape and texture details in the eye and mouth region.

Baselines

  • Scan-based methods are compared to SM-PLX+D, SCANimate and SNARF baselines
  • SM-PLX+D is adapted from SMPL+D
  • Clothing is modeled with additive vertex offsets
  • RGB-D method variant is compared to PINA
  • Ground truth pose and shape are assumed to be known

Results on grab dataset

  • Our method outperforms all baselines on the GRAB dataset
  • SCANimate learns a mean hand and SNARF generalizes badly to unseen poses
  • Our method outperforms PINA in all metrics on X-Humans (RGB-D)
  • Our model generates more realistic faces and better hand poses

Conclusion

  • X-Avatar captures body pose, hand pose, facial expressions and appearance
  • X-Avatar has limited capability to model loose clothing
  • X-Avatar is trained for each subject
  • X-Avatar is parameterized by shape, body pose, facial expressions, and occupancy network