Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new single-stage method for subject agnostic face swapping and identity transfer called FaceDancer
  • Adaptive Feature Fusion Attention (AFFA) and Interpreted Feature Similarity Regularization (IFSR) modules embedded in the decoder
  • AFFA adaptively learns to fuse attribute features and features conditioned on identity information without requiring any additional facial segmentation process
  • IFSR leverages intermediate features in an identity encoder to preserve important attributes while transferring the identity of the source face
  • Experiments show FaceDancer outperforms other state-of-the-art networks in terms of identity transfer and has better pose preservation than most previous methods

Paper Content

Introduction

  • Face swapping is a task that shifts the identity of a source face into a target face while preserving the target face’s attributes.
  • It has applications in the film, game, and entertainment industry.
  • There are two mainstream approaches for face synthesis: source-oriented and target-oriented methods.
  • FaceDancer is a novel, target-oriented, and single-stage method for face swapping.
  • FaceDancer uses an Adaptive Feature Fusion Attention (AFFA) module and an Interpreted Feature Similarity Regularization (IFSR) method to improve identity transfer and pose preservation.
  • Two leading approaches for face swapping: source-and target-oriented methods
  • Digital Emily project used 3D scanning of a single actor to perform face swapping
  • Banz et al. used 3D Morphable Models to generate source faces with matching target attributes
  • Nirkin et al. used 3DMM to extract pose and expression coefficients from the target face
  • FSGAN used reenactment network to reenact the source face based on the target landmarks
  • Target-oriented approaches rely on generative models to manipulate features of an encoded target face
  • FaceShifter used attribute encoder-decoder model trained in a semi-supervised fashion
  • SimSwap used modified version of feature matching loss from pix2pixHD
  • HifiFace used combination of GANs and 3DMMs to achieve state-of-the-art identity performance
  • Our approach relies on identity encoder and can handle harsh image distortions

Method

  • FaceDancer network architecture is described
  • AFFA module, IFSR method, and loss functions are mentioned
  • Notations are defined: X t (target face), X s (source face), X c (changed face)

Network architecture

  • FaceDancer involves a generator, discriminator, mapping network and ArcFace.
  • Generator uses U-Net like encoder-decoder architecture with residual blocks and AdaIN, AFFA module or concatenation layer.
  • Discriminator is same as StarGan-v2 and Hi-fiFace, but hinge loss is used.
  • Mapping network consists of four fully-connected layers and leaky ReLU.
  • ArcFace is used to extract and inject identity information from source image and for identity loss.

The adaptive feature fusion attention (affa) module

  • AFFA module is inspired by previous works
  • AFFA avoids the need to compute segmentation masks for each training sample
  • AFFA uses information from skip connections in the generator encoder
  • AFFA learns to extract relevant descriptive face features
  • AFFA produces an attention mask to gate and fuse h and za

Interpreted feature similarity regularization (ifsr)

  • Target-oriented face swapping methods rely on semi-supervised or unsupervised techniques to maintain target attributes
  • Use pretrained identity encoders to explore facial expressions
  • Pre-study on state-of-the-art face swapping model FaceShifter
  • Compare cosine distances between target and generated face swaps
  • Define margin for each layer based on mean distances
  • Regularization equation to match mean of distribution

Loss functions

  • FaceDancer uses various loss functions to train
  • Identity loss is used to transfer source identity
  • Reconstruction loss is used to make sure final result is equal to target image
  • Perceptual loss is used to improve semantic understanding of image
  • Cycle consistence loss is used to keep important attributes and structures

Results

  • FaceDancer is trained on VGGFace2 and LS3D-W datasets
  • Five point landmarks are extracted with RetinaFace
  • ArcFace is pretrained on MS1M with a ResNet50 backbone
  • Adam optimizer is used with a learning rate of 0.0001
  • Target and source images are randomly augmented with brightness, contrast, and saturation
  • Configurations are trained for 300K steps for the ablation study
  • Best performing configurations are trained up to 500K steps
  • Image resolution is 256x256
  • 20% chance that an image pair is the same
  • Margin scale is set to 1.2
  • Quantitative experiments on FaceForensics++

Quantitative results

  • FaceDancer is compared to other face swapping networks
  • Metrics evaluated are identity retrieval, pose error, expression error, and FID
  • Identity retrieval is done with a secondary identity encoder
  • Pose is measured with an average L2 error
  • Expression is measured with an average L2 error
  • FID is calculated between swapped and unaltered test set
  • 10K samples used for testing
  • FaceDancer outperforms other works in identity retrieval

Qualitative results

  • FaceDancer is compared to SimSwap, FaceShifter, HifiFace, and FaceController
  • FaceDancer performs similarly to SimSwap but with improved identity transfer
  • FaceShifter performs good identity transfer but struggles with lighting and gaze direction
  • FaceController has good identity transferability and decent pose error but fails with gaze direction
  • HifiFace has promising results for facial shape but not feasible to compare qualitatively

Ablation study

  • FaceDancer components (AFFA module and IFSR method) are ablated and compared to two baselines
  • Baseline 1 and 2 use concatenation and addition to fuse feature maps from the decoder and skip connection
  • Feature fusion is performed at top three resolutions (256, 128, 64)
  • Configurations A-E are evaluated on FaceForensic++ and AFLW2000-3D
  • Baseline 1 performs best for pose, but falls short on other metrics
  • Configurations A-D perform better for ID, but have comparable pose error and similar or better expression error
  • Configuration E falls short for identity performance and pose error

Analysis of the affa module

  • AFFA module was studied in FaceDancer
  • Training with AFFA in the three upper resolutions of the decoder (256, 128, 64) leads to noticeable color defects
  • Attention maps generated at the resolution of 256 are mostly gray
  • Baselines 1 and 2 do not have any problems with the color defect
  • To remedy the problem, AFFA module was replaced with a simple concatenation operation and an AFFA module was added to resolution 32

Analysis of ifsr

  • IFSR method is evaluated experimentally
  • Comparing cosine distance between feature maps of target, source, and changed faces
  • Changed face obtained from FaceShifter
  • Early layers of ArcFace contain information such as pose, expression, and occlusions
  • Final blocks store identity information
  • c2t and c2s distributions are completely separable until block 14
  • IFSR does not contain any learnable parameters

Conclusion

  • FaceDancer is a single-stage face swapping model that reaches state-of-the-art
  • FaceDancer has a regularization component IFSR to preserve attributes
  • AFFA module improves identity transfer without sacrificing visual quality
  • FaceDancer is limited in face shape transfer and IFSR margin calculation
  • IFSR can be used to compress complex face swap models
  • Combining IFSR with 3DMM could improve attribute preservation