Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Proposed method for editing NeRF scenes with text-instructions
  • Uses an image-conditioned diffusion model (InstructPix2Pix)
  • Iteratively edits input images while optimizing underlying scene
  • Results in optimized 3D scene that respects edit instruction
  • Able to edit large-scale, real-world scenes
  • More realistic, targeted edits than prior work

Paper Content


  • Capturing a realistic digital representation of a real-world 3D scene is easy
  • Captured 3D content is replacing traditional processes of manually-generated assets
  • Tools for editing 3D assets are underdeveloped
  • Text instructions can be used to edit 3D scenes
  • 2D diffusion model is used to extract shape and appearance priors
  • NeRFs are a popular approach for generating photorealistic novel views of a scene
  • Editing NeRFs is a challenge
  • Physics-based inductive biases can be used to enable changes in materials or scene lighting
  • Bounding boxes can be used to allow easy compositing of different objects and spatial manipulations
  • Cli-mateNeRF extracts rough geometry from a NeRF and uses physical simulation to apply weather changes
  • Most physically-based edits revolve around changing physical properties of the reconstructed scene
  • Recent works have explored artistic 3D stylization of NeRFs
  • EditNeRF explores editing NeRFs by manipulating latent codes learned from object categories
  • ClipNeRF and NeRF-Art extend this line of work by encouraging similarity between CLIP embeddings of the scene and a short text prompt
  • Recent progress in pre-trained large-scale models has enabled rapid progress in the domain of generating 3D content from scratch
  • Instruction-based 2D image-conditioned diffusion model enables purely language-based interface for 3D editing


  • Takes as input a reconstructed NeRF scene, source data, and a natural-language editing instruction
  • Outputs an edited version of the NeRF and input images using a diffusion model and NeRF training


  • Neural radiance fields (NeRFs) are a way to represent and render a 3D scene.
  • NeRFs are optimized using camera parameters and pixel colors from captured images.
  • InstructPix2Pix is a diffusion-based method that gradually transforms a noisy sample towards a modeled data distribution.
  • InstructPix2Pix is based on a latent diffusion model, meaning the variables are all latent images created by encoding an RGB image.
  • To produce an RGB image from the diffusion model, one must decode the predicted latents.


  • Given a reconstructed NeRF scene and a text instruction, an edit instruction is produced
  • An alternating update scheme is used to update the training dataset images
  • A diffusion model (InstructPix2Pix) is used to edit each dataset image
  • A noised version of the current render is used as input
  • An iterative process is used to gradually update dataset images and refine the reconstructed NeRF
  • The diffusion model is conditioned on the un-edited images
  • Images are updated sequentially in a random ordering
  • NeRF updates sample a set of random rays from the entire training dataset
  • Iterative DU is a variant of the score distillation sampling (SDS) loss

Implementation details

  • Used ’nerfacto’ model from NeRFStudio
  • Parameters determine strength and consistency of updates
  • Values for [t min , t max ] = [0.02, 0.98] define amount of noise
  • Sampled denoised image with 20 denoising steps
  • Default values of s I = 1.5 and s T = 7.5, or user can hand-tune guidance weight
  • Figure 5 shows varying degrees of scene edits
  • Update one image at a time, d = 1 and n = 10
  • Used L1 and LPIPS losses for NeRF training


  • Experiments conducted on real scenes optimized using Nerfstudio
  • Variety of scenes with varying complexity
  • Scenes captured using smartphone and mirrorless camera
  • Camera poses extracted using COLMAP or PolyCam
  • Dataset size ranges from 50-300 images
  • Evaluated through qualitative and quantitative evaluations
  • Compared against ablative baselines and NeRF-Art

Qualitative evaluation

  • Editing 3D scenes is possible with our approach
  • We can achieve a range of edits from global to locally specific
  • We can add contextual elements and dress the person
  • We can turn portraits into notable figures and fictional characters
  • We can also apply edits to large-scale scenes
  • We validate our design choices by comparing to different variants
  • We provide quantitative metrics to evaluate alignment of edits to text
  • We compare to concurrent work NeRF-Art
  • Limitations include not always being able to perform desired edit and producing inconsistent edits in 2D

Quantitative evaluation

  • Editing is a subjective task, so qualitative evaluation is used.
  • Auxiliary quantitative metrics are reported over 10 total edits across two scenes.
  • Metrics measure alignment of 3D edit with text instruction and temporal consistency of edit across views.


  • Inability to perform large spatial manipulations
  • Uses a diffusion model on a single view at a time
  • May suffer from artifacts such as double faces on added objects
  • Examples of two types of failure cases shown in Figure 9


  • We introduced Instruct-NeRF2NeRF, a method for 3D scene editing using natural text instructions
  • We operate on pre-captured NeRF scenes, ensuring 3D-consistency
  • We showed results on a variety of captured NeRF scenes and demonstrated its ability to accomplish a wide range of edits
  • We use the ’nerfacto’ model from NeRFStudio to obtain a NeRF reconstruction
  • We use InstructPix2Pix to specify edits, inheriting its parameter values
  • We use a diffusion model as guidance, which produces a collection of temporally varying images
  • We select an iteration at which to terminate optimization and visualize the edited scene
  • We use a CLIP Directional Score and a CLIP Direction Consistency Score to measure the edit
  • We note that the edited NeRF scenes often contain slightly blurrier textures
  • We attribute this to the effects of the autoencoder
  • We compare with CLIP-based method NeRF-Art