Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models have been successful on single-image super-resolution and other image-to-image translation tasks.
- Diffusion models have not outperformed GAN models on blind super-resolution tasks.
- This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art.
- SR3+ uses self-supervised training, noise-conditioning augmentation, a large-scale convolutional architecture, and large-scale datasets.
- SR3+ outperforms SR3 and Real-ESRGAN, with a DRealSR FID score of 32.37.
Paper Content
Introduction
- Diffusion models are a powerful class of generative models used for text-to-image synthesis and image-to-image translation tasks.
- Self-supervised diffusion models have been used for single image super-resolution tasks.
- SR3 falls short on out-of-distribution data.
- Ablation study is used to demonstrate the benefits of parametric degradations and noise conditioning augmentation techniques.
- Model size and larger datasets lead to improved SR3+ performance.
Background on diffusion models
- Generative diffusion models are trained to learn a data distribution and generate samples from the model.
- A Gaussian forward process is used, with a monotonically decreasing function over a range of 0 to 1.
- A reweighted evidence lower bound is used as a loss function.
- The neural network learns to infer the additive noise.
- The denoising neural network is repurposed into a generative model.
- For single-image super-resolution, conditional diffusion models are used.
Related work
- Two approaches to blind super-resolution: explicit and implicit
- Implicit requires large datasets to generalize well
- Best results use explicit degradation modeling
- Degradation scheme used by Real-ESRGAN model
- Other methods for super-resolution include diffusion models and non-generative models
- SRCNN showed superiority of deep convolutional neural networks
- Further innovations have been found to deepen neural networks
- Contrastive learning and attention-based networks have been proposed
- Image-conditional diffusion models have been shown to be superior to regression-based models
Methodology
- SR3+ is a self-supervised model for single-image super-resolution.
- SR3+ is a convolutional variant of SR3, allowing for flexibility in image resolution and aspect ratio.
- LR-HR image pairs are generated by downsampling high-resolution images.
- Robustness is achieved through two augmentations: composite parametric degradations and noise conditioning augmentation.
Architecture
- SR3+ uses a UNet architecture without self-attention layers
- Self-attention has a positive impact on image quality, but makes generalization difficult
- Modifications used by Saharia et al. (2022b) are adopted to improve training speed
Higher-order degradations
- Self-supervision for super-resolution involves down-sampling HR images to create LR inputs.
- Combining down-sampling kernels with other degradations is ideal to avoid domain shift.
- SR3+ uses a data-augmentation pipeline with multiple types of degradation.
- Higher-order deformations have a substantial impact on OOD generalization.
- SR3+ uses the same degradation pipeline without additive noise.
- Noise conditioning augmentation is better than including noise in the degradation pipeline.
Noise conditioning augmentation
- Noise conditioning was introduced to make super-resolution models self-supervised with down-sampling.
- Noise conditioning augmentation provides robustness to the distribution of inputs from the previous stage.
- Noise-conditioning augmentation entails adding noise to the up-sampled LR input and providing the noise level to the neural denoiser.
- At test time, the noise level hyper-parameter provides a trade-off between alignment with the LR input and hallucination by the generative model.
Experiments
- SR3+ is a computer science model trained with a combination of degradations and noise-conditioning augmentation
- SR3+ is tested with zero-shot on test data
- SR3+ is used for blind super-resolution with a 4x magnification factor
- Baselines used are SR3 and Real-ESRGAN
- LR input is up-sampled by 4x using bicubic interpolation
- Output samples for SR3 and SR3+ are obtained using DDPM ancestral sampling
- Training data includes DF2K+OST and a collection of in-house images
- 400x400 crop is extracted from each image and then degraded
- LR images are up-sampled to 400x400 and center crops yield 256x256 images
- Models are trained for 1.5M steps with batch sizes of 256 or 512
- Models are tested on RealSR and DRealSR datasets
- Performance is assessed with PSNR, SSIM and FID
- Generative models are hard to learn and require larger data and models
Comparison with real-esrgan and sr3
- Compared SR3+ models of different sizes with Real-ESRGAN
- Trained models on same data and on 61M-image dataset
- Performed grid sweep over t eval from 0 to 0.4
- Reported results with t eval = 0.1
- 40M-parameter SR3+ achieved competitive FID scores with Real-ESRGAN
- 400M-parameter SR3+ outperformed Real-ESRGAN on FID scores
- SR3+ does not outperform on reference-based metrics (PSNR, SSIM)
- Main contributions are higher-order degradation scheme and noise conditioning augmentation
- Ablation study showed FID scores increased significantly upon removal of either of main contributions
- SR3+ excels at natural images
- Test-time noise conditioning augmentation can force model to rely on own knowledge to infer high-frequency details
- FID scores can drop when using noise conditioning augmentation at test time, best value often t eval = 0.1