Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
- Proposed method for editing NeRF scenes with text-instructions
- Uses an image-conditioned diffusion model (InstructPix2Pix)
- Iteratively edits input images while optimizing underlying scene
- Results in optimized 3D scene that respects edit instruction
- Able to edit large-scale, real-world scenes
- More realistic, targeted edits than prior work
- Capturing a realistic digital representation of a real-world 3D scene is easy
- Captured 3D content is replacing traditional processes of manually-generated assets
- Tools for editing 3D assets are underdeveloped
- Text instructions can be used to edit 3D scenes
- 2D diffusion model is used to extract shape and appearance priors
- NeRFs are a popular approach for generating photorealistic novel views of a scene
- Editing NeRFs is a challenge
- Physics-based inductive biases can be used to enable changes in materials or scene lighting
- Bounding boxes can be used to allow easy compositing of different objects and spatial manipulations
- Cli-mateNeRF extracts rough geometry from a NeRF and uses physical simulation to apply weather changes
- Most physically-based edits revolve around changing physical properties of the reconstructed scene
- Recent works have explored artistic 3D stylization of NeRFs
- EditNeRF explores editing NeRFs by manipulating latent codes learned from object categories
- ClipNeRF and NeRF-Art extend this line of work by encouraging similarity between CLIP embeddings of the scene and a short text prompt
- Recent progress in pre-trained large-scale models has enabled rapid progress in the domain of generating 3D content from scratch
- Instruction-based 2D image-conditioned diffusion model enables purely language-based interface for 3D editing
- Takes as input a reconstructed NeRF scene, source data, and a natural-language editing instruction
- Outputs an edited version of the NeRF and input images using a diffusion model and NeRF training
- Neural radiance fields (NeRFs) are a way to represent and render a 3D scene.
- NeRFs are optimized using camera parameters and pixel colors from captured images.
- InstructPix2Pix is a diffusion-based method that gradually transforms a noisy sample towards a modeled data distribution.
- InstructPix2Pix is based on a latent diffusion model, meaning the variables are all latent images created by encoding an RGB image.
- To produce an RGB image from the diffusion model, one must decode the predicted latents.
- Given a reconstructed NeRF scene and a text instruction, an edit instruction is produced
- An alternating update scheme is used to update the training dataset images
- A diffusion model (InstructPix2Pix) is used to edit each dataset image
- A noised version of the current render is used as input
- An iterative process is used to gradually update dataset images and refine the reconstructed NeRF
- The diffusion model is conditioned on the un-edited images
- Images are updated sequentially in a random ordering
- NeRF updates sample a set of random rays from the entire training dataset
- Iterative DU is a variant of the score distillation sampling (SDS) loss
- Used ’nerfacto’ model from NeRFStudio
- Parameters determine strength and consistency of updates
- Values for [t min , t max ] = [0.02, 0.98] define amount of noise
- Sampled denoised image with 20 denoising steps
- Default values of s I = 1.5 and s T = 7.5, or user can hand-tune guidance weight
- Figure 5 shows varying degrees of scene edits
- Update one image at a time, d = 1 and n = 10
- Used L1 and LPIPS losses for NeRF training
- Experiments conducted on real scenes optimized using Nerfstudio
- Variety of scenes with varying complexity
- Scenes captured using smartphone and mirrorless camera
- Camera poses extracted using COLMAP or PolyCam
- Dataset size ranges from 50-300 images
- Evaluated through qualitative and quantitative evaluations
- Compared against ablative baselines and NeRF-Art
- Editing 3D scenes is possible with our approach
- We can achieve a range of edits from global to locally specific
- We can add contextual elements and dress the person
- We can turn portraits into notable figures and fictional characters
- We can also apply edits to large-scale scenes
- We validate our design choices by comparing to different variants
- We provide quantitative metrics to evaluate alignment of edits to text
- We compare to concurrent work NeRF-Art
- Limitations include not always being able to perform desired edit and producing inconsistent edits in 2D
- Editing is a subjective task, so qualitative evaluation is used.
- Auxiliary quantitative metrics are reported over 10 total edits across two scenes.
- Metrics measure alignment of 3D edit with text instruction and temporal consistency of edit across views.
- Inability to perform large spatial manipulations
- Uses a diffusion model on a single view at a time
- May suffer from artifacts such as double faces on added objects
- Examples of two types of failure cases shown in Figure 9
- We introduced Instruct-NeRF2NeRF, a method for 3D scene editing using natural text instructions
- We operate on pre-captured NeRF scenes, ensuring 3D-consistency
- We showed results on a variety of captured NeRF scenes and demonstrated its ability to accomplish a wide range of edits
- We use the ’nerfacto’ model from NeRFStudio to obtain a NeRF reconstruction
- We use InstructPix2Pix to specify edits, inheriting its parameter values
- We use a diffusion model as guidance, which produces a collection of temporally varying images
- We select an iteration at which to terminate optimization and visualize the edited scene
- We use a CLIP Directional Score and a CLIP Direction Consistency Score to measure the edit
- We note that the edited NeRF scenes often contain slightly blurrier textures
- We attribute this to the effects of the autoencoder
- We compare with CLIP-based method NeRF-Art