Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed novel approach for unsupervised 3D animation of non-rigid deformable objects
Learns 3D structure and dynamics from single-view RGB videos
Decomposes objects into semantically meaningful parts
Uses 3D autodecoder framework and keypoint estimator
Evaluated on two video datasets and one image dataset
Can obtain animatable 3D objects from single or few images

Paper Content

Introduction

Ability to animate dynamic object from single image enables creative tasks
Applications range from visual effects to consumer applications
Two approaches: outsourcing understanding to existing models or learning from raw data
Outsourcing requires knowledge of object, learning is unsupervised
Recent progress in unsupervised image animation
Methods typically learn motion model based on object parts and transformations
Prior works offer means to perform 2D animation only
Our work explores unsupervised image animation in 3D
Challenges include identifying and controlling object parts from 2D videos, modeling camera in 3D, and lack of bias of 2D CNNs
Our framework maps object to canonical volumetric representation, parameterized with voxel grid
Rigid parts are softly assigned to points in canonical volume
Linear blend skinning produces deformed volume according to pose of each part
Differentiable Perspectiven-Point algorithm estimates pose, linking 2D observations to 3D representation
Parts are learned in unsupervised manner, allowing for 3D reconstruction and novel view synthesis
Evaluated on three diverse datasets

3D-aware image and video synthesis has seen significant progress in the last two years
Early works used Neural Radiance Fields (NeRFs) to synthesize simple objects
Later works scaled the generator and increased its efficiency to attain high-resolution 3D synthesis
Different types of volumetric representations have been used
Implicit video synthesis techniques have been combined with volumetric rendering to generate 3D-aware videos
Unsupervised 3D reconstruction has been attempted
Supervised image animation requires an off-the-shelf keypoint predictor or a 3D morphable model estimator
Unsupervised image animation does not require supervision beyond photometric reconstruction loss
Improved motion representations have been proposed for animation
Latent Image Animator learned a latent space for possible motions

Method

Canonical voxel generator

Uses voxel grid to parametrize volume
Generates volume cube with density and RGB fields
Models object as set of rigid moving parts
Optimizes identity embeddings directly during training
Learns 3D keypoints and uses 2D keypoint predictor to predict 2D keypoints
Computes deformed density and radiance via volumetric skinning
Volumetrically renders deformed radiance to produce rendered image
Supervised using reconstruction loss

Unsupervised pose estimation

An object movement can be factorized into a set of rigid movements of each individual object’s part.
Estimating 2D parts and their poses in an unsupervised fashion is a difficult task.
Pose prediction is framed as a 2D landmark detection problem which CNNs can solve.
3D poses are estimated by learning a set of 3D keypoints in the canonical space and detecting their 2D projections in the current frame.

Volumetric skinning

Establishing correspondences between points in deformed and canonical space using Linear Blend Skinning (LBS)
Weight assigned to each part of the object
Approximate solution introduced in HumanNeRF to solve for x c
Inverse LBS weights w d p defined to approximate solution
Each point has strict assignment to single part, no self-penetration in deformed space

Volumetric rendering

Rendering of deformed object is done using differentiable volumetric rendering
Ray is cast through each pixel in the image plane and color is computed by integration
Volume density and radiance are mapped to 3D points along each ray
Model is trained using fixed extrinsics and intrinsics
No additional MLP or upsampling technique is used
Background is modeled as a plate of fixed, high density

Training

Learning a 3D representation of an articulated object from 2D observations is difficult and can lead to corrupted renderings.
A two-stage training strategy is used to promote learning of correct 3D representations.
The first stage is a Geometry phase (G-phase) which trains the model with only a single part.
The second stage introduces 10 parts and copies all weights from the G-phase.
The model is trained using a range of losses, including reconstruction loss, unsupervised background loss, and pose losses.

Inference

Model can be used to model previously unseen identities
Optimize reconstruction loss with respect to embedding
Finetune generator to address imperfect textures
Regularize V Density and V LBS during finetuning
Do not modify 2D keypoint predictor to ensure motion can be transferred

Experiments

Evaluating animation is difficult because there is no ground truth
Established an evaluation protocol for unsupervised volumetric animation
Used 3 publicly available datasets for evaluation: Cats, VoxCeleb, and TEDXPeople
Used 40896 videos for training and 100 for testing

Geometry from image data

Our method can learn high-fidelity geometry from images or videos without camera or geometry supervision.
We compare the quality of inferred geometry to a state-of-the-art 3D-GAN, EpiGRAF.
UVA provides higher-quality depth than EpiGRAF, reaching a correlation value of 0.63.

Animation evaluation

Unsupervised animation in 3D is a new task
Commonly used animation datasets do not typically offer multi-view data
Three new metrics introduced to evaluate viewpoint consistency without access to multi-view data
Average Yaw Deviation (AYD), Average Shape Consistency (ASC), and Average Pose Consistency (APC)
UVA models objects in canonical 3D space, better preserving an object’s shapes when animated
Depth-based method introduced to generate novel views for [52,55]
Exploiting a latent space component to generate head rotation for LIA [64]
Standard 2D reconstruction metrics used: L1, AKD/MKR [53], AED [53]
PnP-based part pose predictor compared to direct part pose prediction (Direct)
Unsupervised background loss L bkg evaluated
Two-phase training evaluated
High-quality depth estimates obtained for synthetic renderings using models trained only on real, in-the-wild images
Limitations of the model noted

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Canonical voxel generator#

Unsupervised pose estimation#

Volumetric skinning#

Volumetric rendering#

Training#

Inference#

Experiments#

Geometry from image data#

Animation evaluation#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Canonical voxel generator

Unsupervised pose estimation

Volumetric skinning

Volumetric rendering

Training

Inference

Experiments

Geometry from image data

Animation evaluation