Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed novel approach for unsupervised 3D animation of non-rigid deformable objects
- Learns 3D structure and dynamics from single-view RGB videos
- Decomposes objects into semantically meaningful parts
- Uses 3D autodecoder framework and keypoint estimator
- Evaluated on two video datasets and one image dataset
- Can obtain animatable 3D objects from single or few images
Paper Content
Introduction
- Ability to animate dynamic object from single image enables creative tasks
- Applications range from visual effects to consumer applications
- Two approaches: outsourcing understanding to existing models or learning from raw data
- Outsourcing requires knowledge of object, learning is unsupervised
- Recent progress in unsupervised image animation
- Methods typically learn motion model based on object parts and transformations
- Prior works offer means to perform 2D animation only
- Our work explores unsupervised image animation in 3D
- Challenges include identifying and controlling object parts from 2D videos, modeling camera in 3D, and lack of bias of 2D CNNs
- Our framework maps object to canonical volumetric representation, parameterized with voxel grid
- Rigid parts are softly assigned to points in canonical volume
- Linear blend skinning produces deformed volume according to pose of each part
- Differentiable Perspectiven-Point algorithm estimates pose, linking 2D observations to 3D representation
- Parts are learned in unsupervised manner, allowing for 3D reconstruction and novel view synthesis
- Evaluated on three diverse datasets
Related work
- 3D-aware image and video synthesis has seen significant progress in the last two years
- Early works used Neural Radiance Fields (NeRFs) to synthesize simple objects
- Later works scaled the generator and increased its efficiency to attain high-resolution 3D synthesis
- Different types of volumetric representations have been used
- Implicit video synthesis techniques have been combined with volumetric rendering to generate 3D-aware videos
- Unsupervised 3D reconstruction has been attempted
- Supervised image animation requires an off-the-shelf keypoint predictor or a 3D morphable model estimator
- Unsupervised image animation does not require supervision beyond photometric reconstruction loss
- Improved motion representations have been proposed for animation
- Latent Image Animator learned a latent space for possible motions
Method
Canonical voxel generator
- Uses voxel grid to parametrize volume
- Generates volume cube with density and RGB fields
- Models object as set of rigid moving parts
- Optimizes identity embeddings directly during training
- Learns 3D keypoints and uses 2D keypoint predictor to predict 2D keypoints
- Computes deformed density and radiance via volumetric skinning
- Volumetrically renders deformed radiance to produce rendered image
- Supervised using reconstruction loss
Unsupervised pose estimation
- An object movement can be factorized into a set of rigid movements of each individual object’s part.
- Estimating 2D parts and their poses in an unsupervised fashion is a difficult task.
- Pose prediction is framed as a 2D landmark detection problem which CNNs can solve.
- 3D poses are estimated by learning a set of 3D keypoints in the canonical space and detecting their 2D projections in the current frame.
Volumetric skinning
- Establishing correspondences between points in deformed and canonical space using Linear Blend Skinning (LBS)
- Weight assigned to each part of the object
- Approximate solution introduced in HumanNeRF to solve for x c
- Inverse LBS weights w d p defined to approximate solution
- Each point has strict assignment to single part, no self-penetration in deformed space
Volumetric rendering
- Rendering of deformed object is done using differentiable volumetric rendering
- Ray is cast through each pixel in the image plane and color is computed by integration
- Volume density and radiance are mapped to 3D points along each ray
- Model is trained using fixed extrinsics and intrinsics
- No additional MLP or upsampling technique is used
- Background is modeled as a plate of fixed, high density
Training
- Learning a 3D representation of an articulated object from 2D observations is difficult and can lead to corrupted renderings.
- A two-stage training strategy is used to promote learning of correct 3D representations.
- The first stage is a Geometry phase (G-phase) which trains the model with only a single part.
- The second stage introduces 10 parts and copies all weights from the G-phase.
- The model is trained using a range of losses, including reconstruction loss, unsupervised background loss, and pose losses.
Inference
- Model can be used to model previously unseen identities
- Optimize reconstruction loss with respect to embedding
- Finetune generator to address imperfect textures
- Regularize V Density and V LBS during finetuning
- Do not modify 2D keypoint predictor to ensure motion can be transferred
Experiments
- Evaluating animation is difficult because there is no ground truth
- Established an evaluation protocol for unsupervised volumetric animation
- Used 3 publicly available datasets for evaluation: Cats, VoxCeleb, and TEDXPeople
- Used 40896 videos for training and 100 for testing
Geometry from image data
- Our method can learn high-fidelity geometry from images or videos without camera or geometry supervision.
- We compare the quality of inferred geometry to a state-of-the-art 3D-GAN, EpiGRAF.
- UVA provides higher-quality depth than EpiGRAF, reaching a correlation value of 0.63.
Animation evaluation
- Unsupervised animation in 3D is a new task
- Commonly used animation datasets do not typically offer multi-view data
- Three new metrics introduced to evaluate viewpoint consistency without access to multi-view data
- Average Yaw Deviation (AYD), Average Shape Consistency (ASC), and Average Pose Consistency (APC)
- UVA models objects in canonical 3D space, better preserving an object’s shapes when animated
- Depth-based method introduced to generate novel views for [52,55]
- Exploiting a latent space component to generate head rotation for LIA [64]
- Standard 2D reconstruction metrics used: L1, AKD/MKR [53], AED [53]
- PnP-based part pose predictor compared to direct part pose prediction (Direct)
- Unsupervised background loss L bkg evaluated
- Two-phase training evaluated
- High-quality depth estimates obtained for synthetic renderings using models trained only on real, in-the-wild images
- Limitations of the model noted