Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

MAV3D is a method for generating 3D dynamic scenes from text descriptions
MAV3D uses a 4D dynamic Neural Radiance Field (NeRF)
MAV3D does not require 3D or 4D data
MAV3D is trained on Text-Image pairs and unlabeled videos
MAV3D is the first to generate 3D dynamic scenes given a text description

Paper Content

Introduction

Generative models can now generate realistic images from natural language prompts
Generative models have been extended to videos and 3D shapes
MAV3D combines the benefits of video and 3D generative models
MAV3D takes natural language description and outputs dynamic 3D scene
No readily available collection of 4D models with textual annotations
MAV3D uses video generator as ‘statistical’ multi-camera setup
MAV3D uses Neural Radiance Field (NeRF) to represent dynamic 3D scenes
MAV3D uses multi-stage training pipeline for dynamic scene rendering
MAV3D uses temporal-aware SDS loss and motion regularizers
MAV3D uses temporal-aware super-resolution fine-tuning for higher resolution outputs

Neural rendering uses neural networks to represent 3D scenes
Recent work has improved efficiency by incorporating 3D data structures
Aim to generate dynamic scenes which can be viewed from any angle
Generating 3D scenes from text dates back decades
Recent improvements in diffusion models have led to advanced image synthesis
Video generator is based on Make-A-Video (MAV)

Method

Goal is to develop a method to produce a dynamic 3D scene from a natural-language description
Use a pretrained text-to-video (T2V) diffusion model as a scene prior
Given a text prompt, fit a 4D scene representation
Render a sequence of images from the 4D scene representation
Pass the text prompt and the video to a pretrained T2V diffusion model
Use Score Distillation Sampling (SDS) to compute an update direction for the scene parameters

4d scene representation

Neural rendering is used to represent a dynamic 3D scene implicitly
Rays are cast through the camera plane into the scene and points are sampled along the ray
Volume density and color are computed for each point
MLP is used to output the color
HexPlane is used to represent the 4D scene
MLP is used to predict volume density and color
Background model simulates a large static sphere surrounding the dynamic foreground

Dynamic scene optimization

HexPlane model used to match textual prompt
Temporal Score Distillation Sampling (SDS-T) introduced as an extension of SDS
Loss computed and applied to MAV3D
Pretrained conditional video generator based on diffusion
Update direction for scene parameters θ computed using SDS
Multi-stage static-to-dynamic optimization scheme used
Gaussian Annealing and Total Variation Loss used as regularizers

Super-resolution fine-tuning

4D scene representation is supervised via low-resolution 64x64 renderings
Rendering higher-resolution videos from the learned model can lack detail and exhibit artifacts
SRFT uses pretrained and frozen video super-resolution module SR t l
SR t l inputs a high-resolution noisy 256x256 video and a clean 64x64 low-res video
SR t l is used to improve high resolution renderings from 4D scene model
SRFT trains jointly using SDS from SR t l and SDS-T

Experiments

MAV3D evaluates dynamic scenes from text descriptions
Three alternative methods developed as baselines
Evaluates simplified versions of model on sub-tasks of T2V and Text-To-3D
Comprehensive ablation study to justify method’s design
Conversion of dynamic NeRFs into dynamic meshes

Results

Text-to-4D comparison:
Text-to-3D comparison:
Text-to-Video comparison:

Ablation study

Human raters prefer model trained with SR for quality, text alignment and motion
SR fine-tuning enhances quality of rendered videos
Model trained without static scene pre-training has lower scene quality and poor convergence
Dynamic camera variant has less motion and suffers from multi-face object
Gaussian annealing leads to renderings with larger and more realistic motion
HexPlane is slightly preferred in terms of overall quality and realistic motion
Instant-NGP is significantly less preferred

Real-time rendering

HexPlane model can be converted to animated meshes
Marching cube algorithm is used to extract a simplicial mesh
Mesh decimation and removal of small noisy connected components
XATLAS algorithm is used to map mesh vertices to a texture atlas
Texture is initialized using HexPlane colors
Texture is further optimized to better match example frames
Collection of texture meshes can be played back in 3D engine

Image to 4d

Input image can be used to generate 4D asset
4D asset shares same semantics as input image
Images provided by Nichol et al. (2022b) used for Image-to-3D task
Method can generate depth and motion from input image

Discussion

Creating 3D content is difficult with current tools
Difficult to generate dynamic scenes compared to images and videos
MAV3D uses diffusion models and dynamic NeRFs to integrate world knowledge into 3D temporal representations
MAV3D expands the functionality of previously established diffusion-based models
MAV3D has limitations, such as inefficient conversion of dynamic NeRFs to a sequence of disjoint meshes
MAV3D uses Gaussian annealing to encourage density in the center of the scene
Adam optimizer with cosine decay scheduler used for training
Soft binary cross entropy regularization added to encourage model to make harder predictions
MAV3D uses dynamic camera trajectory to simulate real camera motion
Comparison with baselines and ablation study conducted

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

4d scene representation#

Dynamic scene optimization#

Super-resolution fine-tuning#

Experiments#

Results#

Ablation study#

Real-time rendering#

Image to 4d#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

4d scene representation

Dynamic scene optimization

Super-resolution fine-tuning

Experiments

Results

Ablation study

Real-time rendering

Image to 4d

Discussion