Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- X-Avatar is a novel avatar model that captures the full expressiveness of digital humans.
- X-Avatar can be learned from 3D scans or RGB-D data.
- X-Avatar uses a part-aware learned forward skinning module.
- X-Avatar uses part-aware sampling and initialization strategies.
- X-Avatar uses a texture network to capture the appearance of the avatar.
- X-Avatar outperforms strong baselines in both data domains.
- X-Humans is a new dataset of high-quality textured scans from 20 participants.
Paper Content
Introduction
- Human communication involves body pose, appearance, facial expressions, and hand gestures.
- Capturing this richness of human expressiveness is difficult.
- Existing models are limited in resolution and do not model garments or hair.
- Neural implicit representations have potential to overcome these limitations.
- X-Avatar captures shape, appearance, deformations, hand poses, facial expressions, and clothing.
- Challenges include increased number of articulated body parts and different scales.
- Part-aware initialization and sampling strategies improve quality and keep training efficient.
- X-Humans dataset released with data, models, and SMPL[-X] registrations.
Related work
- Explicit models use a 3D mesh and low-dimensional parameters to represent the human body
- Explicit models have been applied to tasks such as RGB-based pose estimation, RGB-D fitting, fitting to body-worn sensor data, and 3D hand pose estimation
- Explicit models have been extended to model human expressiveness
- Implicit models use neural fields to represent 3D geometry
- Implicit models have been applied to bodies, faces, hands, and clothing
- Publicly available datasets with clothed and textured ground-truth are rare
- X-Humans dataset contains 35,500 frames of high-quality, texturized scans of real clothed humans
Method
- X-Avatar is a method for modeling implicit human avatars with full body control
- X-Avatar can be learned from 3D posed scans and RGB-D images
- X-Avatar uses SMPL-X full body model
- X-Avatar has a training scheme and part-aware initialization and sampling strategies
Recap: smpl-x unified human body model
- Goal: Create fully controllable human avatars
- Use SMPL-X parameter space, which includes articulated hands and expressive face
Implicit neural avatar representation
- Human model defined by articulated neural implicit surfaces
- Three neural fields: geometry, deformation, and appearance
- Geometry modeled by MLP predicting occupancy value for 3D point in canonical space
- Deformation modeled by MLP predicting skinning weights conditioned on body pose and facial expression
- Part-aware initialization and sampling to account for small body parts
- Texture modeled by MLP predicting RGB values conditioned on deformed geometry and local high-frequency details
Training process
- Objective function is used to minimize 3D scans
- Objective function consists of two losses: binary cross entropy loss and L2 loss
- Objective function also consists of RGB supervision, bone occupancy loss, joint LBS weights loss, and surface LBS weights loss
- Supervision from registered SMPL-X meshes is used
Experiments
- Introduced datasets and metrics in Sec. 4.2
- Described state-of-the-art methods in Sec. 4.3
- Showed and discussed results in Sec.
- Focused on challenging animation task
- Reported reconstruction results in Supp. Mat.
Datasets
- GRAB subset of AMASS used for training
- X-Humans dataset contains 20 subjects with various clothing and hair styles
- X-Humans contains 29,036 poses for training and 6,439 test poses
- Evaluation metrics include volumetric IoU, Chamfer distance and normal consistency
Ablation study
- Part-aware initialization is critical to accelerate training and find good correspondences in small body parts.
- Initializing with all bones (body, hands, face) is 3 times slower than initializing with just the body’s bone transformations.
- Part-aware sampling improves hand shape and texture details in the eye and mouth region.
Baselines
- Scan-based methods are compared to SM-PLX+D, SCANimate and SNARF baselines
- SM-PLX+D is adapted from SMPL+D
- Clothing is modeled with additive vertex offsets
- RGB-D method variant is compared to PINA
- Ground truth pose and shape are assumed to be known
Results on grab dataset
- Our method outperforms all baselines on the GRAB dataset
- SCANimate learns a mean hand and SNARF generalizes badly to unseen poses
- Our method outperforms PINA in all metrics on X-Humans (RGB-D)
- Our model generates more realistic faces and better hand poses
Conclusion
- X-Avatar captures body pose, hand pose, facial expressions and appearance
- X-Avatar has limited capability to model loose clothing
- X-Avatar is trained for each subject
- X-Avatar is parameterized by shape, body pose, facial expressions, and occupancy network