Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models have been successful on single-image super-resolution and other image-to-image translation tasks.
  • Diffusion models have not outperformed GAN models on blind super-resolution tasks.
  • This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art.
  • SR3+ uses self-supervised training, noise-conditioning augmentation, a large-scale convolutional architecture, and large-scale datasets.
  • SR3+ outperforms SR3 and Real-ESRGAN, with a DRealSR FID score of 32.37.

Paper Content

Introduction

  • Diffusion models are a powerful class of generative models used for text-to-image synthesis and image-to-image translation tasks.
  • Self-supervised diffusion models have been used for single image super-resolution tasks.
  • SR3 falls short on out-of-distribution data.
  • Ablation study is used to demonstrate the benefits of parametric degradations and noise conditioning augmentation techniques.
  • Model size and larger datasets lead to improved SR3+ performance.

Background on diffusion models

  • Generative diffusion models are trained to learn a data distribution and generate samples from the model.
  • A Gaussian forward process is used, with a monotonically decreasing function over a range of 0 to 1.
  • A reweighted evidence lower bound is used as a loss function.
  • The neural network learns to infer the additive noise.
  • The denoising neural network is repurposed into a generative model.
  • For single-image super-resolution, conditional diffusion models are used.
  • Two approaches to blind super-resolution: explicit and implicit
  • Implicit requires large datasets to generalize well
  • Best results use explicit degradation modeling
  • Degradation scheme used by Real-ESRGAN model
  • Other methods for super-resolution include diffusion models and non-generative models
  • SRCNN showed superiority of deep convolutional neural networks
  • Further innovations have been found to deepen neural networks
  • Contrastive learning and attention-based networks have been proposed
  • Image-conditional diffusion models have been shown to be superior to regression-based models

Methodology

  • SR3+ is a self-supervised model for single-image super-resolution.
  • SR3+ is a convolutional variant of SR3, allowing for flexibility in image resolution and aspect ratio.
  • LR-HR image pairs are generated by downsampling high-resolution images.
  • Robustness is achieved through two augmentations: composite parametric degradations and noise conditioning augmentation.

Architecture

  • SR3+ uses a UNet architecture without self-attention layers
  • Self-attention has a positive impact on image quality, but makes generalization difficult
  • Modifications used by Saharia et al. (2022b) are adopted to improve training speed

Higher-order degradations

  • Self-supervision for super-resolution involves down-sampling HR images to create LR inputs.
  • Combining down-sampling kernels with other degradations is ideal to avoid domain shift.
  • SR3+ uses a data-augmentation pipeline with multiple types of degradation.
  • Higher-order deformations have a substantial impact on OOD generalization.
  • SR3+ uses the same degradation pipeline without additive noise.
  • Noise conditioning augmentation is better than including noise in the degradation pipeline.

Noise conditioning augmentation

  • Noise conditioning was introduced to make super-resolution models self-supervised with down-sampling.
  • Noise conditioning augmentation provides robustness to the distribution of inputs from the previous stage.
  • Noise-conditioning augmentation entails adding noise to the up-sampled LR input and providing the noise level to the neural denoiser.
  • At test time, the noise level hyper-parameter provides a trade-off between alignment with the LR input and hallucination by the generative model.

Experiments

  • SR3+ is a computer science model trained with a combination of degradations and noise-conditioning augmentation
  • SR3+ is tested with zero-shot on test data
  • SR3+ is used for blind super-resolution with a 4x magnification factor
  • Baselines used are SR3 and Real-ESRGAN
  • LR input is up-sampled by 4x using bicubic interpolation
  • Output samples for SR3 and SR3+ are obtained using DDPM ancestral sampling
  • Training data includes DF2K+OST and a collection of in-house images
  • 400x400 crop is extracted from each image and then degraded
  • LR images are up-sampled to 400x400 and center crops yield 256x256 images
  • Models are trained for 1.5M steps with batch sizes of 256 or 512
  • Models are tested on RealSR and DRealSR datasets
  • Performance is assessed with PSNR, SSIM and FID
  • Generative models are hard to learn and require larger data and models

Comparison with real-esrgan and sr3

  • Compared SR3+ models of different sizes with Real-ESRGAN
  • Trained models on same data and on 61M-image dataset
  • Performed grid sweep over t eval from 0 to 0.4
  • Reported results with t eval = 0.1
  • 40M-parameter SR3+ achieved competitive FID scores with Real-ESRGAN
  • 400M-parameter SR3+ outperformed Real-ESRGAN on FID scores
  • SR3+ does not outperform on reference-based metrics (PSNR, SSIM)
  • Main contributions are higher-order degradation scheme and noise conditioning augmentation
  • Ablation study showed FID scores increased significantly upon removal of either of main contributions
  • SR3+ excels at natural images
  • Test-time noise conditioning augmentation can force model to rely on own knowledge to infer high-frequency details
  • FID scores can drop when using noise conditioning augmentation at test time, best value often t eval = 0.1