Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed method for editing NeRF scenes with text-instructions
Uses an image-conditioned diffusion model (InstructPix2Pix)
Iteratively edits input images while optimizing underlying scene
Results in optimized 3D scene that respects edit instruction
Able to edit large-scale, real-world scenes
More realistic, targeted edits than prior work

Paper Content

Introduction

Capturing a realistic digital representation of a real-world 3D scene is easy
Captured 3D content is replacing traditional processes of manually-generated assets
Tools for editing 3D assets are underdeveloped
Text instructions can be used to edit 3D scenes
2D diffusion model is used to extract shape and appearance priors

NeRFs are a popular approach for generating photorealistic novel views of a scene
Editing NeRFs is a challenge
Physics-based inductive biases can be used to enable changes in materials or scene lighting
Bounding boxes can be used to allow easy compositing of different objects and spatial manipulations
Cli-mateNeRF extracts rough geometry from a NeRF and uses physical simulation to apply weather changes
Most physically-based edits revolve around changing physical properties of the reconstructed scene
Recent works have explored artistic 3D stylization of NeRFs
EditNeRF explores editing NeRFs by manipulating latent codes learned from object categories
ClipNeRF and NeRF-Art extend this line of work by encouraging similarity between CLIP embeddings of the scene and a short text prompt
Recent progress in pre-trained large-scale models has enabled rapid progress in the domain of generating 3D content from scratch
Instruction-based 2D image-conditioned diffusion model enables purely language-based interface for 3D editing

Method

Takes as input a reconstructed NeRF scene, source data, and a natural-language editing instruction
Outputs an edited version of the NeRF and input images using a diffusion model and NeRF training

Background

Neural radiance fields (NeRFs) are a way to represent and render a 3D scene.
NeRFs are optimized using camera parameters and pixel colors from captured images.
InstructPix2Pix is a diffusion-based method that gradually transforms a noisy sample towards a modeled data distribution.
InstructPix2Pix is based on a latent diffusion model, meaning the variables are all latent images created by encoding an RGB image.
To produce an RGB image from the diffusion model, one must decode the predicted latents.

Instruct-nerf2nerf

Given a reconstructed NeRF scene and a text instruction, an edit instruction is produced
An alternating update scheme is used to update the training dataset images
A diffusion model (InstructPix2Pix) is used to edit each dataset image
A noised version of the current render is used as input
An iterative process is used to gradually update dataset images and refine the reconstructed NeRF
The diffusion model is conditioned on the un-edited images
Images are updated sequentially in a random ordering
NeRF updates sample a set of random rays from the entire training dataset
Iterative DU is a variant of the score distillation sampling (SDS) loss

Implementation details

Used ’nerfacto’ model from NeRFStudio
Parameters determine strength and consistency of updates
Values for [t min , t max ] = [0.02, 0.98] define amount of noise
Sampled denoised image with 20 denoising steps
Default values of s I = 1.5 and s T = 7.5, or user can hand-tune guidance weight
Figure 5 shows varying degrees of scene edits
Update one image at a time, d = 1 and n = 10
Used L1 and LPIPS losses for NeRF training

Results

Experiments conducted on real scenes optimized using Nerfstudio
Variety of scenes with varying complexity
Scenes captured using smartphone and mirrorless camera
Camera poses extracted using COLMAP or PolyCam
Dataset size ranges from 50-300 images
Evaluated through qualitative and quantitative evaluations
Compared against ablative baselines and NeRF-Art

Qualitative evaluation

Editing 3D scenes is possible with our approach
We can achieve a range of edits from global to locally specific
We can add contextual elements and dress the person
We can turn portraits into notable figures and fictional characters
We can also apply edits to large-scale scenes
We validate our design choices by comparing to different variants
We provide quantitative metrics to evaluate alignment of edits to text
We compare to concurrent work NeRF-Art
Limitations include not always being able to perform desired edit and producing inconsistent edits in 2D

Quantitative evaluation

Editing is a subjective task, so qualitative evaluation is used.
Auxiliary quantitative metrics are reported over 10 total edits across two scenes.
Metrics measure alignment of 3D edit with text instruction and temporal consistency of edit across views.

Limitations

Inability to perform large spatial manipulations
Uses a diffusion model on a single view at a time
May suffer from artifacts such as double faces on added objects
Examples of two types of failure cases shown in Figure 9

Conclusion

We introduced Instruct-NeRF2NeRF, a method for 3D scene editing using natural text instructions
We operate on pre-captured NeRF scenes, ensuring 3D-consistency
We showed results on a variety of captured NeRF scenes and demonstrated its ability to accomplish a wide range of edits
We use the ’nerfacto’ model from NeRFStudio to obtain a NeRF reconstruction
We use InstructPix2Pix to specify edits, inheriting its parameter values
We use a diffusion model as guidance, which produces a collection of temporally varying images
We select an iteration at which to terminate optimization and visualize the edited scene
We use a CLIP Directional Score and a CLIP Direction Consistency Score to measure the edit
We note that the edited NeRF scenes often contain slightly blurrier textures
We attribute this to the effects of the autoencoder
We compare with CLIP-based method NeRF-Art

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Background#

Instruct-nerf2nerf#

Implementation details#

Results#

Qualitative evaluation#

Quantitative evaluation#

Limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Background

Instruct-nerf2nerf

Implementation details

Results

Qualitative evaluation

Quantitative evaluation

Limitations

Conclusion