Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Blind face restoration is a difficult problem that requires additional help to improve the mapping from degraded inputs to desired outputs.
CodeFormer is a Transformer-based prediction network that models the global composition and context of low-quality faces for code prediction.
CodeFormer has a controllable feature transformation module that allows a flexible trade-off between fidelity and quality.

Paper Content

Introduction

Face images captured in the wild often suffer from various degradation
Restoring such images is highly ill-posed
Learning a LQ-HQ mapping without additional guidance is intractable
Auxiliary information is needed to reduce uncertainty and complement details
Various priors have been used to mitigate the ill-posedness
Existing approaches often suffer from high sensitivity to degradation
This work casts blind face restoration as a code prediction task in a small finite proxy space
Codebook is learned by self-reconstruction of HQ faces
Combinations of codebook items form a discrete prior space with finite cardinality
Transformer-based code prediction network is proposed to exploit global compositions and long-range dependencies of LQ faces
Results demonstrate superior performance in existing datasets and newly introduced WIDER-Test dataset
Method also demonstrates effectiveness on other challenging tasks such as face inpainting

Face restoration methods use facial landmarks, face parsing maps, facial component heatmaps, or 3D shapes as geometric priors.
Reference-based approaches require references with same identity as input degraded face.
DFDNet pre-constructs dictionaries composed of high-quality facial component features.
VQGAN-based methods explore a learned HQ dictionary.
Generative facial priors from pre-trained generators are used for blind face restoration.
Dictionary learning uses sparse representation with learned dictionaries.

Methodology

Exploits a discrete representation space to reduce uncertainty of restoration mapping and complement high-quality details for degraded inputs
Employs a Transformer module to model global composition of natural faces to remedy local information loss
Incorporates idea of vector quantization to pre-train a quantized autoencoder for a codebook and decoder
Employs a Transformer for accurate prediction of code combination from low-quality inputs
Introduces a controllable feature transformation module to exploit a flexible trade-off between restoration quality and fidelity

Codebook learning (stage i)

Pre-train a quantized autoencoder to learn a context-rich codebook
Compress HQ face image into a compressed feature
Replace each “pixel” in compressed feature with nearest item in codebook
Decoder reconstructs HQ face image
Use 3 image-level reconstruction losses to train autoencoder
Use code-level loss to reduce distance between codebook and input feature embeddings
Codebook consists of 12 residual blocks and 5 resize layers
Codebook item number set to 1024, code dimension set to 256

Codebook lookup transformer learning (stage ii)

Corruptions of textures in LQ faces make it difficult to find accurate codes for face restoration
A Transformer is used to model global interrelations for better code prediction
Transformer module contains nine self-attention blocks
Training objectives include crossentropy loss and L2 loss
Network is equipped with superior robustness and effectiveness in face restoration

Controllable feature transformation (stage iii)

Stage II has obtained a great face restoration model
We propose the controllable feature transformation (CFT) module to control information flow from LQ encoder to decoder
An adjustable coefficient is used to control the relative importance of the inputs
CFT modules are used at multiple scales between encoder and decoder
Small w reduces reliance on input LQ images with heavy degradation, producing high-quality outputs
Larger w introduces more information from LQ images to enhance fidelity in case of mild degradation

Experiments

Datasets

Trained models on FFHQ dataset with 70,000 HQ images
Synthesized LQ images from HQ counterparts with degradation model
Evaluated method on synthetic CelebA-Test and three real-world datasets
LFW-Test has mild degradation, WebPhoto-Test has medium degradation, WIDER-Test has heavy degradation

Experimental settings and metrics

Represent face image as 16x16 code sequence
Adam optimizer with batch size of 16
Learning rate 8x10^-5 for stages I and II, 2x10^-5 for stage III
1.5M, 200K, and 20K iterations for stages I, II, and III respectively
Implemented with PyTorch and trained with four NVIDIA Tesla V100 GPUs
Evaluate on CelebA-Test with PSNR, SSIM, and LPIPS
Evaluate identity preservation with cosine similarity of ArcFace network features
Evaluate on real-world datasets with FID and MUSIQ (KonIQ)

Comparisons with state-of-the-art methods

Compare proposed CodeFormer with state-of-the-art methods
Conduct comparison on synthetic and real-world datasets
CodeFormer achieves best scores in terms of image quality metrics, identity preservation, and PSNR
Compared methods fail to produce pleasant restoration results, introducing artifacts and over-smoothing
CodeFormer produces high-quality faces and preserves identity
Codebook space is key to ensure robustness and effectiveness of model
Fixing codebook and decoder produces better performance than fine-tuning decoder
Controllable feature transformation module allows flexible trade-off between quality and fidelity
CodeFormer has similar running time as other methods and achieves best performance in terms of LPIPS
CodeFormer finetuned on face color enhancement produces more natural color and faithful details
CodeFormer extended to face inpainting produces high-quality natural faces without strokes and artifacts
Autoencoder capability and expressiveness affects performance of CodeFormer
Limited superiority of CodeFormer in side faces due to lack of codes in training dataset

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methodology#

Codebook learning (stage i)#

Codebook lookup transformer learning (stage ii)#

Controllable feature transformation (stage iii)#

Experiments#

Datasets#

Experimental settings and metrics#

Comparisons with state-of-the-art methods#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methodology

Codebook learning (stage i)

Codebook lookup transformer learning (stage ii)

Controllable feature transformation (stage iii)

Experiments

Datasets

Experimental settings and metrics

Comparisons with state-of-the-art methods