Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Blind face restoration is a difficult problem that requires additional help to improve the mapping from degraded inputs to desired outputs.
  • CodeFormer is a Transformer-based prediction network that models the global composition and context of low-quality faces for code prediction.
  • CodeFormer has a controllable feature transformation module that allows a flexible trade-off between fidelity and quality.

Paper Content

Introduction

  • Face images captured in the wild often suffer from various degradation
  • Restoring such images is highly ill-posed
  • Learning a LQ-HQ mapping without additional guidance is intractable
  • Auxiliary information is needed to reduce uncertainty and complement details
  • Various priors have been used to mitigate the ill-posedness
  • Existing approaches often suffer from high sensitivity to degradation
  • This work casts blind face restoration as a code prediction task in a small finite proxy space
  • Codebook is learned by self-reconstruction of HQ faces
  • Combinations of codebook items form a discrete prior space with finite cardinality
  • Transformer-based code prediction network is proposed to exploit global compositions and long-range dependencies of LQ faces
  • Results demonstrate superior performance in existing datasets and newly introduced WIDER-Test dataset
  • Method also demonstrates effectiveness on other challenging tasks such as face inpainting
  • Face restoration methods use facial landmarks, face parsing maps, facial component heatmaps, or 3D shapes as geometric priors.
  • Reference-based approaches require references with same identity as input degraded face.
  • DFDNet pre-constructs dictionaries composed of high-quality facial component features.
  • VQGAN-based methods explore a learned HQ dictionary.
  • Generative facial priors from pre-trained generators are used for blind face restoration.
  • Dictionary learning uses sparse representation with learned dictionaries.

Methodology

  • Exploits a discrete representation space to reduce uncertainty of restoration mapping and complement high-quality details for degraded inputs
  • Employs a Transformer module to model global composition of natural faces to remedy local information loss
  • Incorporates idea of vector quantization to pre-train a quantized autoencoder for a codebook and decoder
  • Employs a Transformer for accurate prediction of code combination from low-quality inputs
  • Introduces a controllable feature transformation module to exploit a flexible trade-off between restoration quality and fidelity

Codebook learning (stage i)

  • Pre-train a quantized autoencoder to learn a context-rich codebook
  • Compress HQ face image into a compressed feature
  • Replace each “pixel” in compressed feature with nearest item in codebook
  • Decoder reconstructs HQ face image
  • Use 3 image-level reconstruction losses to train autoencoder
  • Use code-level loss to reduce distance between codebook and input feature embeddings
  • Codebook consists of 12 residual blocks and 5 resize layers
  • Codebook item number set to 1024, code dimension set to 256

Codebook lookup transformer learning (stage ii)

  • Corruptions of textures in LQ faces make it difficult to find accurate codes for face restoration
  • A Transformer is used to model global interrelations for better code prediction
  • Transformer module contains nine self-attention blocks
  • Training objectives include crossentropy loss and L2 loss
  • Network is equipped with superior robustness and effectiveness in face restoration

Controllable feature transformation (stage iii)

  • Stage II has obtained a great face restoration model
  • We propose the controllable feature transformation (CFT) module to control information flow from LQ encoder to decoder
  • An adjustable coefficient is used to control the relative importance of the inputs
  • CFT modules are used at multiple scales between encoder and decoder
  • Small w reduces reliance on input LQ images with heavy degradation, producing high-quality outputs
  • Larger w introduces more information from LQ images to enhance fidelity in case of mild degradation

Experiments

Datasets

  • Trained models on FFHQ dataset with 70,000 HQ images
  • Synthesized LQ images from HQ counterparts with degradation model
  • Evaluated method on synthetic CelebA-Test and three real-world datasets
  • LFW-Test has mild degradation, WebPhoto-Test has medium degradation, WIDER-Test has heavy degradation

Experimental settings and metrics

  • Represent face image as 16x16 code sequence
  • Adam optimizer with batch size of 16
  • Learning rate 8x10^-5 for stages I and II, 2x10^-5 for stage III
  • 1.5M, 200K, and 20K iterations for stages I, II, and III respectively
  • Implemented with PyTorch and trained with four NVIDIA Tesla V100 GPUs
  • Evaluate on CelebA-Test with PSNR, SSIM, and LPIPS
  • Evaluate identity preservation with cosine similarity of ArcFace network features
  • Evaluate on real-world datasets with FID and MUSIQ (KonIQ)

Comparisons with state-of-the-art methods

  • Compare proposed CodeFormer with state-of-the-art methods
  • Conduct comparison on synthetic and real-world datasets
  • CodeFormer achieves best scores in terms of image quality metrics, identity preservation, and PSNR
  • Compared methods fail to produce pleasant restoration results, introducing artifacts and over-smoothing
  • CodeFormer produces high-quality faces and preserves identity
  • Codebook space is key to ensure robustness and effectiveness of model
  • Fixing codebook and decoder produces better performance than fine-tuning decoder
  • Controllable feature transformation module allows flexible trade-off between quality and fidelity
  • CodeFormer has similar running time as other methods and achieves best performance in terms of LPIPS
  • CodeFormer finetuned on face color enhancement produces more natural color and faithful details
  • CodeFormer extended to face inpainting produces high-quality natural faces without strokes and artifacts
  • Autoencoder capability and expressiveness affects performance of CodeFormer
  • Limited superiority of CodeFormer in side faces due to lack of codes in training dataset