Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- MAV3D is a method for generating 3D dynamic scenes from text descriptions
- MAV3D uses a 4D dynamic Neural Radiance Field (NeRF)
- MAV3D does not require 3D or 4D data
- MAV3D is trained on Text-Image pairs and unlabeled videos
- MAV3D is the first to generate 3D dynamic scenes given a text description
Paper Content
Introduction
- Generative models can now generate realistic images from natural language prompts
- Generative models have been extended to videos and 3D shapes
- MAV3D combines the benefits of video and 3D generative models
- MAV3D takes natural language description and outputs dynamic 3D scene
- No readily available collection of 4D models with textual annotations
- MAV3D uses video generator as ‘statistical’ multi-camera setup
- MAV3D uses Neural Radiance Field (NeRF) to represent dynamic 3D scenes
- MAV3D uses multi-stage training pipeline for dynamic scene rendering
- MAV3D uses temporal-aware SDS loss and motion regularizers
- MAV3D uses temporal-aware super-resolution fine-tuning for higher resolution outputs
Related work
- Neural rendering uses neural networks to represent 3D scenes
- Recent work has improved efficiency by incorporating 3D data structures
- Aim to generate dynamic scenes which can be viewed from any angle
- Generating 3D scenes from text dates back decades
- Recent improvements in diffusion models have led to advanced image synthesis
- Video generator is based on Make-A-Video (MAV)
Method
- Goal is to develop a method to produce a dynamic 3D scene from a natural-language description
- Use a pretrained text-to-video (T2V) diffusion model as a scene prior
- Given a text prompt, fit a 4D scene representation
- Render a sequence of images from the 4D scene representation
- Pass the text prompt and the video to a pretrained T2V diffusion model
- Use Score Distillation Sampling (SDS) to compute an update direction for the scene parameters
4d scene representation
- Neural rendering is used to represent a dynamic 3D scene implicitly
- Rays are cast through the camera plane into the scene and points are sampled along the ray
- Volume density and color are computed for each point
- MLP is used to output the color
- HexPlane is used to represent the 4D scene
- MLP is used to predict volume density and color
- Background model simulates a large static sphere surrounding the dynamic foreground
Dynamic scene optimization
- HexPlane model used to match textual prompt
- Temporal Score Distillation Sampling (SDS-T) introduced as an extension of SDS
- Loss computed and applied to MAV3D
- Pretrained conditional video generator based on diffusion
- Update direction for scene parameters θ computed using SDS
- Multi-stage static-to-dynamic optimization scheme used
- Gaussian Annealing and Total Variation Loss used as regularizers
Super-resolution fine-tuning
- 4D scene representation is supervised via low-resolution 64x64 renderings
- Rendering higher-resolution videos from the learned model can lack detail and exhibit artifacts
- SRFT uses pretrained and frozen video super-resolution module SR t l
- SR t l inputs a high-resolution noisy 256x256 video and a clean 64x64 low-res video
- SR t l is used to improve high resolution renderings from 4D scene model
- SRFT trains jointly using SDS from SR t l and SDS-T
Experiments
- MAV3D evaluates dynamic scenes from text descriptions
- Three alternative methods developed as baselines
- Evaluates simplified versions of model on sub-tasks of T2V and Text-To-3D
- Comprehensive ablation study to justify method’s design
- Conversion of dynamic NeRFs into dynamic meshes
Results
- Text-to-4D comparison:
- Text-to-3D comparison:
- Text-to-Video comparison:
Ablation study
- Human raters prefer model trained with SR for quality, text alignment and motion
- SR fine-tuning enhances quality of rendered videos
- Model trained without static scene pre-training has lower scene quality and poor convergence
- Dynamic camera variant has less motion and suffers from multi-face object
- Gaussian annealing leads to renderings with larger and more realistic motion
- HexPlane is slightly preferred in terms of overall quality and realistic motion
- Instant-NGP is significantly less preferred
Real-time rendering
- HexPlane model can be converted to animated meshes
- Marching cube algorithm is used to extract a simplicial mesh
- Mesh decimation and removal of small noisy connected components
- XATLAS algorithm is used to map mesh vertices to a texture atlas
- Texture is initialized using HexPlane colors
- Texture is further optimized to better match example frames
- Collection of texture meshes can be played back in 3D engine
Image to 4d
- Input image can be used to generate 4D asset
- 4D asset shares same semantics as input image
- Images provided by Nichol et al. (2022b) used for Image-to-3D task
- Method can generate depth and motion from input image
Discussion
- Creating 3D content is difficult with current tools
- Difficult to generate dynamic scenes compared to images and videos
- MAV3D uses diffusion models and dynamic NeRFs to integrate world knowledge into 3D temporal representations
- MAV3D expands the functionality of previously established diffusion-based models
- MAV3D has limitations, such as inefficient conversion of dynamic NeRFs to a sequence of disjoint meshes
- MAV3D uses Gaussian annealing to encourage density in the center of the scene
- Adam optimizer with cosine decay scheduler used for training
- Soft binary cross entropy regularization added to encourage model to make harder predictions
- MAV3D uses dynamic camera trajectory to simulate real camera motion
- Comparison with baselines and ablation study conducted