Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • MAV3D is a method for generating 3D dynamic scenes from text descriptions
  • MAV3D uses a 4D dynamic Neural Radiance Field (NeRF)
  • MAV3D does not require 3D or 4D data
  • MAV3D is trained on Text-Image pairs and unlabeled videos
  • MAV3D is the first to generate 3D dynamic scenes given a text description

Paper Content

Introduction

  • Generative models can now generate realistic images from natural language prompts
  • Generative models have been extended to videos and 3D shapes
  • MAV3D combines the benefits of video and 3D generative models
  • MAV3D takes natural language description and outputs dynamic 3D scene
  • No readily available collection of 4D models with textual annotations
  • MAV3D uses video generator as ‘statistical’ multi-camera setup
  • MAV3D uses Neural Radiance Field (NeRF) to represent dynamic 3D scenes
  • MAV3D uses multi-stage training pipeline for dynamic scene rendering
  • MAV3D uses temporal-aware SDS loss and motion regularizers
  • MAV3D uses temporal-aware super-resolution fine-tuning for higher resolution outputs
  • Neural rendering uses neural networks to represent 3D scenes
  • Recent work has improved efficiency by incorporating 3D data structures
  • Aim to generate dynamic scenes which can be viewed from any angle
  • Generating 3D scenes from text dates back decades
  • Recent improvements in diffusion models have led to advanced image synthesis
  • Video generator is based on Make-A-Video (MAV)

Method

  • Goal is to develop a method to produce a dynamic 3D scene from a natural-language description
  • Use a pretrained text-to-video (T2V) diffusion model as a scene prior
  • Given a text prompt, fit a 4D scene representation
  • Render a sequence of images from the 4D scene representation
  • Pass the text prompt and the video to a pretrained T2V diffusion model
  • Use Score Distillation Sampling (SDS) to compute an update direction for the scene parameters

4d scene representation

  • Neural rendering is used to represent a dynamic 3D scene implicitly
  • Rays are cast through the camera plane into the scene and points are sampled along the ray
  • Volume density and color are computed for each point
  • MLP is used to output the color
  • HexPlane is used to represent the 4D scene
  • MLP is used to predict volume density and color
  • Background model simulates a large static sphere surrounding the dynamic foreground

Dynamic scene optimization

  • HexPlane model used to match textual prompt
  • Temporal Score Distillation Sampling (SDS-T) introduced as an extension of SDS
  • Loss computed and applied to MAV3D
  • Pretrained conditional video generator based on diffusion
  • Update direction for scene parameters θ computed using SDS
  • Multi-stage static-to-dynamic optimization scheme used
  • Gaussian Annealing and Total Variation Loss used as regularizers

Super-resolution fine-tuning

  • 4D scene representation is supervised via low-resolution 64x64 renderings
  • Rendering higher-resolution videos from the learned model can lack detail and exhibit artifacts
  • SRFT uses pretrained and frozen video super-resolution module SR t l
  • SR t l inputs a high-resolution noisy 256x256 video and a clean 64x64 low-res video
  • SR t l is used to improve high resolution renderings from 4D scene model
  • SRFT trains jointly using SDS from SR t l and SDS-T

Experiments

  • MAV3D evaluates dynamic scenes from text descriptions
  • Three alternative methods developed as baselines
  • Evaluates simplified versions of model on sub-tasks of T2V and Text-To-3D
  • Comprehensive ablation study to justify method’s design
  • Conversion of dynamic NeRFs into dynamic meshes

Results

  • Text-to-4D comparison:
  • Text-to-3D comparison:
  • Text-to-Video comparison:

Ablation study

  • Human raters prefer model trained with SR for quality, text alignment and motion
  • SR fine-tuning enhances quality of rendered videos
  • Model trained without static scene pre-training has lower scene quality and poor convergence
  • Dynamic camera variant has less motion and suffers from multi-face object
  • Gaussian annealing leads to renderings with larger and more realistic motion
  • HexPlane is slightly preferred in terms of overall quality and realistic motion
  • Instant-NGP is significantly less preferred

Real-time rendering

  • HexPlane model can be converted to animated meshes
  • Marching cube algorithm is used to extract a simplicial mesh
  • Mesh decimation and removal of small noisy connected components
  • XATLAS algorithm is used to map mesh vertices to a texture atlas
  • Texture is initialized using HexPlane colors
  • Texture is further optimized to better match example frames
  • Collection of texture meshes can be played back in 3D engine

Image to 4d

  • Input image can be used to generate 4D asset
  • 4D asset shares same semantics as input image
  • Images provided by Nichol et al. (2022b) used for Image-to-3D task
  • Method can generate depth and motion from input image

Discussion

  • Creating 3D content is difficult with current tools
  • Difficult to generate dynamic scenes compared to images and videos
  • MAV3D uses diffusion models and dynamic NeRFs to integrate world knowledge into 3D temporal representations
  • MAV3D expands the functionality of previously established diffusion-based models
  • MAV3D has limitations, such as inefficient conversion of dynamic NeRFs to a sequence of disjoint meshes
  • MAV3D uses Gaussian annealing to encourage density in the center of the scene
  • Adam optimizer with cosine decay scheduler used for training
  • Soft binary cross entropy regularization added to encourage model to make harder predictions
  • MAV3D uses dynamic camera trajectory to simulate real camera motion
  • Comparison with baselines and ablation study conducted