Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Blind face restoration is a difficult problem that requires additional help to improve the mapping from degraded inputs to desired outputs.
- CodeFormer is a Transformer-based prediction network that models the global composition and context of low-quality faces for code prediction.
- CodeFormer has a controllable feature transformation module that allows a flexible trade-off between fidelity and quality.
Paper Content
Introduction
- Face images captured in the wild often suffer from various degradation
- Restoring such images is highly ill-posed
- Learning a LQ-HQ mapping without additional guidance is intractable
- Auxiliary information is needed to reduce uncertainty and complement details
- Various priors have been used to mitigate the ill-posedness
- Existing approaches often suffer from high sensitivity to degradation
- This work casts blind face restoration as a code prediction task in a small finite proxy space
- Codebook is learned by self-reconstruction of HQ faces
- Combinations of codebook items form a discrete prior space with finite cardinality
- Transformer-based code prediction network is proposed to exploit global compositions and long-range dependencies of LQ faces
- Results demonstrate superior performance in existing datasets and newly introduced WIDER-Test dataset
- Method also demonstrates effectiveness on other challenging tasks such as face inpainting
Related work
- Face restoration methods use facial landmarks, face parsing maps, facial component heatmaps, or 3D shapes as geometric priors.
- Reference-based approaches require references with same identity as input degraded face.
- DFDNet pre-constructs dictionaries composed of high-quality facial component features.
- VQGAN-based methods explore a learned HQ dictionary.
- Generative facial priors from pre-trained generators are used for blind face restoration.
- Dictionary learning uses sparse representation with learned dictionaries.
Methodology
- Exploits a discrete representation space to reduce uncertainty of restoration mapping and complement high-quality details for degraded inputs
- Employs a Transformer module to model global composition of natural faces to remedy local information loss
- Incorporates idea of vector quantization to pre-train a quantized autoencoder for a codebook and decoder
- Employs a Transformer for accurate prediction of code combination from low-quality inputs
- Introduces a controllable feature transformation module to exploit a flexible trade-off between restoration quality and fidelity
Codebook learning (stage i)
- Pre-train a quantized autoencoder to learn a context-rich codebook
- Compress HQ face image into a compressed feature
- Replace each “pixel” in compressed feature with nearest item in codebook
- Decoder reconstructs HQ face image
- Use 3 image-level reconstruction losses to train autoencoder
- Use code-level loss to reduce distance between codebook and input feature embeddings
- Codebook consists of 12 residual blocks and 5 resize layers
- Codebook item number set to 1024, code dimension set to 256
Codebook lookup transformer learning (stage ii)
- Corruptions of textures in LQ faces make it difficult to find accurate codes for face restoration
- A Transformer is used to model global interrelations for better code prediction
- Transformer module contains nine self-attention blocks
- Training objectives include crossentropy loss and L2 loss
- Network is equipped with superior robustness and effectiveness in face restoration
Controllable feature transformation (stage iii)
- Stage II has obtained a great face restoration model
- We propose the controllable feature transformation (CFT) module to control information flow from LQ encoder to decoder
- An adjustable coefficient is used to control the relative importance of the inputs
- CFT modules are used at multiple scales between encoder and decoder
- Small w reduces reliance on input LQ images with heavy degradation, producing high-quality outputs
- Larger w introduces more information from LQ images to enhance fidelity in case of mild degradation
Experiments
Datasets
- Trained models on FFHQ dataset with 70,000 HQ images
- Synthesized LQ images from HQ counterparts with degradation model
- Evaluated method on synthetic CelebA-Test and three real-world datasets
- LFW-Test has mild degradation, WebPhoto-Test has medium degradation, WIDER-Test has heavy degradation
Experimental settings and metrics
- Represent face image as 16x16 code sequence
- Adam optimizer with batch size of 16
- Learning rate 8x10^-5 for stages I and II, 2x10^-5 for stage III
- 1.5M, 200K, and 20K iterations for stages I, II, and III respectively
- Implemented with PyTorch and trained with four NVIDIA Tesla V100 GPUs
- Evaluate on CelebA-Test with PSNR, SSIM, and LPIPS
- Evaluate identity preservation with cosine similarity of ArcFace network features
- Evaluate on real-world datasets with FID and MUSIQ (KonIQ)
Comparisons with state-of-the-art methods
- Compare proposed CodeFormer with state-of-the-art methods
- Conduct comparison on synthetic and real-world datasets
- CodeFormer achieves best scores in terms of image quality metrics, identity preservation, and PSNR
- Compared methods fail to produce pleasant restoration results, introducing artifacts and over-smoothing
- CodeFormer produces high-quality faces and preserves identity
- Codebook space is key to ensure robustness and effectiveness of model
- Fixing codebook and decoder produces better performance than fine-tuning decoder
- Controllable feature transformation module allows flexible trade-off between quality and fidelity
- CodeFormer has similar running time as other methods and achieves best performance in terms of LPIPS
- CodeFormer finetuned on face color enhancement produces more natural color and faithful details
- CodeFormer extended to face inpainting produces high-quality natural faces without strokes and artifacts
- Autoencoder capability and expressiveness affects performance of CodeFormer
- Limited superiority of CodeFormer in side faces due to lack of codes in training dataset