Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recently, there has been much attention on stylized view generation of scenes captured casually using a camera.
- Previous work captures the geometry and appearance of the scene as neural point sets or neural radiance fields.
- StyleTRF is a compact, quick-to-optimize strategy for stylized view generation using TensoRF.
- StyleTRF decouples style-adaption from view capture and is much faster than previous methods.
- StyleTRF shows state-of-the-art results on several scenes.
Paper Content
Abstract
- Recently, there has been much attention on stylized view generation of scenes captured casually using a camera.
- Previous work typically captures the geometry and appearance of the scene as neural point sets or neural radiance fields.
Introduction
- Stylizing a content image based on a reference style image has been of interest to the computer science community.
- 3D content generation has grown with the development of 3D visual devices.
- Style transfer on 3D scenes has applications in augmented reality (AR) and virtual reality (VR).
- A pipeline diagram presents an overview of the strategy employed by the work.
- Several efforts have been made to stylize image content.
- Stylizing videos is a harder problem as it needs temporal consistency.
- Combining stylization with new view generation takes the game one step further.
- TensoRF is used to represent the geometry and appearance of the scene.
- Johnson et al. [20] is used for stylization.
- TensoRF’s appearance is fine-tuned for a few iterations to obtain a stylized scene representation.
Related work
- Stylizing images has been studied in the vision community for a long time
- Gatys et al. proposed an optimization method to match content and transfer style
- Johnson et al. proposed a feed-forward architecture to produce stylized content in real-time
- Several works have addressed the issue of style-dependent training
- Svoboda et al. decomposed an image into style and content codes
- Temporal stylization of videos requires temporal-consistent stylization
- Various methods have tried incorporating temporal-consistent losses
- Huang et al. introduced consistent 3D point cloud stylization
- Wang et al. relaxed the objective function to stylize video content
- Cao et al. proposed stylizing geometry represented as 3D-point-cloud
- NPBG used U-Net-based architecture to fill gaps in geometry
- Huang et al. built on NPBG to change the appearance
- NeRF learns radiance fields using MLPs and positionally encoded input features
- PlenOxels proposed voxel grid with density and appearance associated
- TensoRF proposed VM-decomposition to reduce memory footprint
- Chiang et al. extended NeRF to generate stylized novel views
- StylizedNeRF optimized 2D-3D consistency to stylize radiance fields
- SNeRF incrementally stylized novel views using Gatys et al.
- SNeRF training cost is 4-5 days
Method
- Aim is to render stylized novel views of a scene
- Use TensoRF to fine-tune appearance of scene
Preprocessing stylization module
- Uses Johnson et al. [20] stylization method to produce content in real time
- Training per-style takes 20 minutes
- Inference produces 30 images/sec
- Johnson et al. [20] chosen over Gatys et al. [10] for stable stylization
- Stylization Module training depicted in Fig. 2
Scene representation using tensorf
- Radiance fields are used to generate a novel view
- TensoRF is used as a representation of Radiance Fields
- TensoRF uses a Tensor decomposition known as VM-decomposition
- 1 norm loss and total variation (TV) loss are used for regularization
Stylizing tensorf representation
- Optimize radiance fields to encode geometry and radiance of scene
- Render sparse set of 20-30 novel views in a simple trajectory
- Use pre-trained Stylization Module to stylize novel renders
Stylizing
- Utilize sparse set of stylized novel views generated using Johnson et al. module
- Optimize appearance vectors of TensoRF, freezing density
- Fine-tuning takes 40 secs
- Obtain geometric scene represented as Tensorial Radiance fields
- Render stylized novel views with consistent appearance
- Rendering of each image takes 4-5 seconds
Implementation details 4.1 optimizing tensorf
- Training/optimization of radiance fields requires camera pose information
- COLMAP used for real scenes, Blender for synthetic scenes
- TensoRF optimized for 15 iterations, 4096 rays shot into voxel grid
- Adam optimizer used, learning rate of 0.02
- Voxel grid resolution of 128 3, upsampled every 1000 iterations until 640 3 for real scenes, 300 3 for synthetic scenes
- Bi-linear + linear interpolation used during upsampling and evaluation of ray query
- Pre-optimization phase takes 10-15 minutes
Stylization
- Trained stylization module Johnson et al. [20] on COCO-14 Dataset [24]
- Trained multiple models for each desired style
- Used stylized priors from previous phase
- Optimized only appearance parameters for 1 iteration
- Style adaption took 40-50 seconds
- Generated novel stylized views using voxel rendering techniques, each view took 4-5 seconds
Experiments
- Experiments conducted on workstation PC with AMD Ryzen-5800x and NVidia RTX-3090 GPU
- Discussion of choice of stylization module used to stylize sparse prior
- Comparison of temporal-stylization of smooth trajectories of ground truth 3D content vs Actual 3D-Stylization
- Showing effect of different optimization strategies used to stylize 3D content
Stylization module
- Johnson et al. [20] is used to stylize the sparse prior required by the method.
- Gatys et al. [10] is used by SNeRF [29] to iteratively stylize the radiance fields.
- Johnson et al. [20] is chosen over Gatys et al. [10] because it is more reliable and leads to stable stylization across close-by views.
- Johnson et al. [20] uses a fixed CNN-based architecture to infer the stylized image.
Video stylization v/s 3d stylization
- Generating novel views on a camera trajectory can be used instead of stylizing 3D content.
- ReReVST fails to fully capture the style information.
- StylizedNeRF and the method presented in the paper capture better style than ReReVST.
Optimization strategies for style adaptation
- Three strategies for stylizing Radiance fields are presented: S1, S2, and S3
- S1 produces geometric artifacts
- S2 produces better results than S1 but has fuzzy geometry
- S3 produces crispier results than S1 and S2
- Good stylization is obtained when 30-40 randomly sampled camera positions are used
Results
- Compared results against 3D-stylization techniques
- Compared results against temporally-consistent stylization techniques
- Compared results both quantitatively and qualitatively
Qualitative results
- We compare qualitative results using smooth trajectories
- We compare with 3D-Stylization and temporally-consistent stylization techniques
- Our method captures different colors and preserves geometry
- StylizedNeRF produces blurry geometry and noisy results
- ReReVST lacks proper capture of style information
- We show better short & long-term consistency compared to StylizedNeRF
- Our method produces multi-view consistent stylized content
Quantitative results
- Generated a smooth trajectory to evaluate consistency
- Compared method with temporal-consistent and 3D-stylization methods
- Calculated optical flow and pixel-averaged 2 loss
- Reported metrics in Table 1
- Chose short-term consistency of 1 and long-term of 5
Conclusions
- Presented StyleTRF, a compact and quick-to-optimize stylization technique
- Can generate stylized novel views of a scene
- Efficiently and faithfully incorporate style into a radiance field representation
- Qualitatively and quantitatively compared with previous stylization methods
- Better short and long-term consistency metrics compared to 3D stylization methods
- StyleTRF concentrates on geometric preserved fast-style adaptation
- StyleTRF produces consistent stylization across views
- Training time of โ 20 mins compared to at least 12 hrs for other methods