Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models have been successful on single-image super-resolution and other image-to-image translation tasks.
Diffusion models have not outperformed GAN models on blind super-resolution tasks.
This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art.
SR3+ uses self-supervised training, noise-conditioning augmentation, a large-scale convolutional architecture, and large-scale datasets.
SR3+ outperforms SR3 and Real-ESRGAN, with a DRealSR FID score of 32.37.

Diffusion models are a powerful class of generative models used for text-to-image synthesis and image-to-image translation tasks.
Self-supervised diffusion models have been used for single image super-resolution tasks.
SR3 falls short on out-of-distribution data.
Ablation study is used to demonstrate the benefits of parametric degradations and noise conditioning augmentation techniques.
Model size and larger datasets lead to improved SR3+ performance.

Generative diffusion models are trained to learn a data distribution and generate samples from the model.
A Gaussian forward process is used, with a monotonically decreasing function over a range of 0 to 1.
A reweighted evidence lower bound is used as a loss function.
The neural network learns to infer the additive noise.
The denoising neural network is repurposed into a generative model.
For single-image super-resolution, conditional diffusion models are used.

Two approaches to blind super-resolution: explicit and implicit
Implicit requires large datasets to generalize well
Best results use explicit degradation modeling
Degradation scheme used by Real-ESRGAN model
Other methods for super-resolution include diffusion models and non-generative models
SRCNN showed superiority of deep convolutional neural networks
Further innovations have been found to deepen neural networks
Contrastive learning and attention-based networks have been proposed
Image-conditional diffusion models have been shown to be superior to regression-based models

SR3+ is a self-supervised model for single-image super-resolution.
SR3+ is a convolutional variant of SR3, allowing for flexibility in image resolution and aspect ratio.
LR-HR image pairs are generated by downsampling high-resolution images.
Robustness is achieved through two augmentations: composite parametric degradations and noise conditioning augmentation.

SR3+ uses a UNet architecture without self-attention layers
Self-attention has a positive impact on image quality, but makes generalization difficult
Modifications used by Saharia et al. (2022b) are adopted to improve training speed

Self-supervision for super-resolution involves down-sampling HR images to create LR inputs.
Combining down-sampling kernels with other degradations is ideal to avoid domain shift.
SR3+ uses a data-augmentation pipeline with multiple types of degradation.
Higher-order deformations have a substantial impact on OOD generalization.
SR3+ uses the same degradation pipeline without additive noise.
Noise conditioning augmentation is better than including noise in the degradation pipeline.

Noise conditioning was introduced to make super-resolution models self-supervised with down-sampling.
Noise conditioning augmentation provides robustness to the distribution of inputs from the previous stage.
Noise-conditioning augmentation entails adding noise to the up-sampled LR input and providing the noise level to the neural denoiser.
At test time, the noise level hyper-parameter provides a trade-off between alignment with the LR input and hallucination by the generative model.

SR3+ is a computer science model trained with a combination of degradations and noise-conditioning augmentation
SR3+ is tested with zero-shot on test data
SR3+ is used for blind super-resolution with a 4x magnification factor
Baselines used are SR3 and Real-ESRGAN
LR input is up-sampled by 4x using bicubic interpolation
Output samples for SR3 and SR3+ are obtained using DDPM ancestral sampling
Training data includes DF2K+OST and a collection of in-house images
400x400 crop is extracted from each image and then degraded
LR images are up-sampled to 400x400 and center crops yield 256x256 images
Models are trained for 1.5M steps with batch sizes of 256 or 512
Models are tested on RealSR and DRealSR datasets
Performance is assessed with PSNR, SSIM and FID
Generative models are hard to learn and require larger data and models

Compared SR3+ models of different sizes with Real-ESRGAN
Trained models on same data and on 61M-image dataset
Performed grid sweep over t eval from 0 to 0.4
Reported results with t eval = 0.1
40M-parameter SR3+ achieved competitive FID scores with Real-ESRGAN
400M-parameter SR3+ outperformed Real-ESRGAN on FID scores
SR3+ does not outperform on reference-based metrics (PSNR, SSIM)
Main contributions are higher-order degradation scheme and noise conditioning augmentation
Ablation study showed FID scores increased significantly upon removal of either of main contributions
SR3+ excels at natural images
Test-time noise conditioning augmentation can force model to rely on own knowledge to infer high-frequency details
FID scores can drop when using noise conditioning augmentation at test time, best value often t eval = 0.1