Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recently, there has been much attention on stylized view generation of scenes captured casually using a camera.
  • Previous work captures the geometry and appearance of the scene as neural point sets or neural radiance fields.
  • StyleTRF is a compact, quick-to-optimize strategy for stylized view generation using TensoRF.
  • StyleTRF decouples style-adaption from view capture and is much faster than previous methods.
  • StyleTRF shows state-of-the-art results on several scenes.

Paper Content

Abstract

  • Recently, there has been much attention on stylized view generation of scenes captured casually using a camera.
  • Previous work typically captures the geometry and appearance of the scene as neural point sets or neural radiance fields.

Introduction

  • Stylizing a content image based on a reference style image has been of interest to the computer science community.
  • 3D content generation has grown with the development of 3D visual devices.
  • Style transfer on 3D scenes has applications in augmented reality (AR) and virtual reality (VR).
  • A pipeline diagram presents an overview of the strategy employed by the work.
  • Several efforts have been made to stylize image content.
  • Stylizing videos is a harder problem as it needs temporal consistency.
  • Combining stylization with new view generation takes the game one step further.
  • TensoRF is used to represent the geometry and appearance of the scene.
  • Johnson et al. [20] is used for stylization.
  • TensoRF’s appearance is fine-tuned for a few iterations to obtain a stylized scene representation.
  • Stylizing images has been studied in the vision community for a long time
  • Gatys et al. proposed an optimization method to match content and transfer style
  • Johnson et al. proposed a feed-forward architecture to produce stylized content in real-time
  • Several works have addressed the issue of style-dependent training
  • Svoboda et al. decomposed an image into style and content codes
  • Temporal stylization of videos requires temporal-consistent stylization
  • Various methods have tried incorporating temporal-consistent losses
  • Huang et al. introduced consistent 3D point cloud stylization
  • Wang et al. relaxed the objective function to stylize video content
  • Cao et al. proposed stylizing geometry represented as 3D-point-cloud
  • NPBG used U-Net-based architecture to fill gaps in geometry
  • Huang et al. built on NPBG to change the appearance
  • NeRF learns radiance fields using MLPs and positionally encoded input features
  • PlenOxels proposed voxel grid with density and appearance associated
  • TensoRF proposed VM-decomposition to reduce memory footprint
  • Chiang et al. extended NeRF to generate stylized novel views
  • StylizedNeRF optimized 2D-3D consistency to stylize radiance fields
  • SNeRF incrementally stylized novel views using Gatys et al.
  • SNeRF training cost is 4-5 days

Method

  • Aim is to render stylized novel views of a scene
  • Use TensoRF to fine-tune appearance of scene

Preprocessing stylization module

  • Uses Johnson et al. [20] stylization method to produce content in real time
  • Training per-style takes 20 minutes
  • Inference produces 30 images/sec
  • Johnson et al. [20] chosen over Gatys et al. [10] for stable stylization
  • Stylization Module training depicted in Fig. 2

Scene representation using tensorf

  • Radiance fields are used to generate a novel view
  • TensoRF is used as a representation of Radiance Fields
  • TensoRF uses a Tensor decomposition known as VM-decomposition
  • 1 norm loss and total variation (TV) loss are used for regularization

Stylizing tensorf representation

  • Optimize radiance fields to encode geometry and radiance of scene
  • Render sparse set of 20-30 novel views in a simple trajectory
  • Use pre-trained Stylization Module to stylize novel renders

Stylizing

  • Utilize sparse set of stylized novel views generated using Johnson et al. module
  • Optimize appearance vectors of TensoRF, freezing density
  • Fine-tuning takes 40 secs
  • Obtain geometric scene represented as Tensorial Radiance fields
  • Render stylized novel views with consistent appearance
  • Rendering of each image takes 4-5 seconds

Implementation details 4.1 optimizing tensorf

  • Training/optimization of radiance fields requires camera pose information
  • COLMAP used for real scenes, Blender for synthetic scenes
  • TensoRF optimized for 15 iterations, 4096 rays shot into voxel grid
  • Adam optimizer used, learning rate of 0.02
  • Voxel grid resolution of 128 3, upsampled every 1000 iterations until 640 3 for real scenes, 300 3 for synthetic scenes
  • Bi-linear + linear interpolation used during upsampling and evaluation of ray query
  • Pre-optimization phase takes 10-15 minutes

Stylization

  • Trained stylization module Johnson et al. [20] on COCO-14 Dataset [24]
  • Trained multiple models for each desired style
  • Used stylized priors from previous phase
  • Optimized only appearance parameters for 1 iteration
  • Style adaption took 40-50 seconds
  • Generated novel stylized views using voxel rendering techniques, each view took 4-5 seconds

Experiments

  • Experiments conducted on workstation PC with AMD Ryzen-5800x and NVidia RTX-3090 GPU
  • Discussion of choice of stylization module used to stylize sparse prior
  • Comparison of temporal-stylization of smooth trajectories of ground truth 3D content vs Actual 3D-Stylization
  • Showing effect of different optimization strategies used to stylize 3D content

Stylization module

  • Johnson et al. [20] is used to stylize the sparse prior required by the method.
  • Gatys et al. [10] is used by SNeRF [29] to iteratively stylize the radiance fields.
  • Johnson et al. [20] is chosen over Gatys et al. [10] because it is more reliable and leads to stable stylization across close-by views.
  • Johnson et al. [20] uses a fixed CNN-based architecture to infer the stylized image.

Video stylization v/s 3d stylization

  • Generating novel views on a camera trajectory can be used instead of stylizing 3D content.
  • ReReVST fails to fully capture the style information.
  • StylizedNeRF and the method presented in the paper capture better style than ReReVST.

Optimization strategies for style adaptation

  • Three strategies for stylizing Radiance fields are presented: S1, S2, and S3
  • S1 produces geometric artifacts
  • S2 produces better results than S1 but has fuzzy geometry
  • S3 produces crispier results than S1 and S2
  • Good stylization is obtained when 30-40 randomly sampled camera positions are used

Results

  • Compared results against 3D-stylization techniques
  • Compared results against temporally-consistent stylization techniques
  • Compared results both quantitatively and qualitatively

Qualitative results

  • We compare qualitative results using smooth trajectories
  • We compare with 3D-Stylization and temporally-consistent stylization techniques
  • Our method captures different colors and preserves geometry
  • StylizedNeRF produces blurry geometry and noisy results
  • ReReVST lacks proper capture of style information
  • We show better short & long-term consistency compared to StylizedNeRF
  • Our method produces multi-view consistent stylized content

Quantitative results

  • Generated a smooth trajectory to evaluate consistency
  • Compared method with temporal-consistent and 3D-stylization methods
  • Calculated optical flow and pixel-averaged 2 loss
  • Reported metrics in Table 1
  • Chose short-term consistency of 1 and long-term of 5

Conclusions

  • Presented StyleTRF, a compact and quick-to-optimize stylization technique
  • Can generate stylized novel views of a scene
  • Efficiently and faithfully incorporate style into a radiance field representation
  • Qualitatively and quantitatively compared with previous stylization methods
  • Better short and long-term consistency metrics compared to 3D stylization methods
  • StyleTRF concentrates on geometric preserved fast-style adaptation
  • StyleTRF produces consistent stylization across views
  • Training time of โ‰ˆ 20 mins compared to at least 12 hrs for other methods