Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed novel approach for unsupervised 3D animation of non-rigid deformable objects
  • Learns 3D structure and dynamics from single-view RGB videos
  • Decomposes objects into semantically meaningful parts
  • Uses 3D autodecoder framework and keypoint estimator
  • Evaluated on two video datasets and one image dataset
  • Can obtain animatable 3D objects from single or few images

Paper Content

Introduction

  • Ability to animate dynamic object from single image enables creative tasks
  • Applications range from visual effects to consumer applications
  • Two approaches: outsourcing understanding to existing models or learning from raw data
  • Outsourcing requires knowledge of object, learning is unsupervised
  • Recent progress in unsupervised image animation
  • Methods typically learn motion model based on object parts and transformations
  • Prior works offer means to perform 2D animation only
  • Our work explores unsupervised image animation in 3D
  • Challenges include identifying and controlling object parts from 2D videos, modeling camera in 3D, and lack of bias of 2D CNNs
  • Our framework maps object to canonical volumetric representation, parameterized with voxel grid
  • Rigid parts are softly assigned to points in canonical volume
  • Linear blend skinning produces deformed volume according to pose of each part
  • Differentiable Perspectiven-Point algorithm estimates pose, linking 2D observations to 3D representation
  • Parts are learned in unsupervised manner, allowing for 3D reconstruction and novel view synthesis
  • Evaluated on three diverse datasets
  • 3D-aware image and video synthesis has seen significant progress in the last two years
  • Early works used Neural Radiance Fields (NeRFs) to synthesize simple objects
  • Later works scaled the generator and increased its efficiency to attain high-resolution 3D synthesis
  • Different types of volumetric representations have been used
  • Implicit video synthesis techniques have been combined with volumetric rendering to generate 3D-aware videos
  • Unsupervised 3D reconstruction has been attempted
  • Supervised image animation requires an off-the-shelf keypoint predictor or a 3D morphable model estimator
  • Unsupervised image animation does not require supervision beyond photometric reconstruction loss
  • Improved motion representations have been proposed for animation
  • Latent Image Animator learned a latent space for possible motions

Method

Canonical voxel generator

  • Uses voxel grid to parametrize volume
  • Generates volume cube with density and RGB fields
  • Models object as set of rigid moving parts
  • Optimizes identity embeddings directly during training
  • Learns 3D keypoints and uses 2D keypoint predictor to predict 2D keypoints
  • Computes deformed density and radiance via volumetric skinning
  • Volumetrically renders deformed radiance to produce rendered image
  • Supervised using reconstruction loss

Unsupervised pose estimation

  • An object movement can be factorized into a set of rigid movements of each individual object’s part.
  • Estimating 2D parts and their poses in an unsupervised fashion is a difficult task.
  • Pose prediction is framed as a 2D landmark detection problem which CNNs can solve.
  • 3D poses are estimated by learning a set of 3D keypoints in the canonical space and detecting their 2D projections in the current frame.

Volumetric skinning

  • Establishing correspondences between points in deformed and canonical space using Linear Blend Skinning (LBS)
  • Weight assigned to each part of the object
  • Approximate solution introduced in HumanNeRF to solve for x c
  • Inverse LBS weights w d p defined to approximate solution
  • Each point has strict assignment to single part, no self-penetration in deformed space

Volumetric rendering

  • Rendering of deformed object is done using differentiable volumetric rendering
  • Ray is cast through each pixel in the image plane and color is computed by integration
  • Volume density and radiance are mapped to 3D points along each ray
  • Model is trained using fixed extrinsics and intrinsics
  • No additional MLP or upsampling technique is used
  • Background is modeled as a plate of fixed, high density

Training

  • Learning a 3D representation of an articulated object from 2D observations is difficult and can lead to corrupted renderings.
  • A two-stage training strategy is used to promote learning of correct 3D representations.
  • The first stage is a Geometry phase (G-phase) which trains the model with only a single part.
  • The second stage introduces 10 parts and copies all weights from the G-phase.
  • The model is trained using a range of losses, including reconstruction loss, unsupervised background loss, and pose losses.

Inference

  • Model can be used to model previously unseen identities
  • Optimize reconstruction loss with respect to embedding
  • Finetune generator to address imperfect textures
  • Regularize V Density and V LBS during finetuning
  • Do not modify 2D keypoint predictor to ensure motion can be transferred

Experiments

  • Evaluating animation is difficult because there is no ground truth
  • Established an evaluation protocol for unsupervised volumetric animation
  • Used 3 publicly available datasets for evaluation: Cats, VoxCeleb, and TEDXPeople
  • Used 40896 videos for training and 100 for testing

Geometry from image data

  • Our method can learn high-fidelity geometry from images or videos without camera or geometry supervision.
  • We compare the quality of inferred geometry to a state-of-the-art 3D-GAN, EpiGRAF.
  • UVA provides higher-quality depth than EpiGRAF, reaching a correlation value of 0.63.

Animation evaluation

  • Unsupervised animation in 3D is a new task
  • Commonly used animation datasets do not typically offer multi-view data
  • Three new metrics introduced to evaluate viewpoint consistency without access to multi-view data
  • Average Yaw Deviation (AYD), Average Shape Consistency (ASC), and Average Pose Consistency (APC)
  • UVA models objects in canonical 3D space, better preserving an object’s shapes when animated
  • Depth-based method introduced to generate novel views for [52,55]
  • Exploiting a latent space component to generate head rotation for LIA [64]
  • Standard 2D reconstruction metrics used: L1, AKD/MKR [53], AED [53]
  • PnP-based part pose predictor compared to direct part pose prediction (Direct)
  • Unsupervised background loss L bkg evaluated
  • Two-phase training evaluated
  • High-quality depth estimates obtained for synthetic renderings using models trained only on real, in-the-wild images
  • Limitations of the model noted