Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a new single-stage method for subject agnostic face swapping and identity transfer called FaceDancer
Adaptive Feature Fusion Attention (AFFA) and Interpreted Feature Similarity Regularization (IFSR) modules embedded in the decoder
AFFA adaptively learns to fuse attribute features and features conditioned on identity information without requiring any additional facial segmentation process
IFSR leverages intermediate features in an identity encoder to preserve important attributes while transferring the identity of the source face
Experiments show FaceDancer outperforms other state-of-the-art networks in terms of identity transfer and has better pose preservation than most previous methods

Paper Content

Introduction

Face swapping is a task that shifts the identity of a source face into a target face while preserving the target face’s attributes.
It has applications in the film, game, and entertainment industry.
There are two mainstream approaches for face synthesis: source-oriented and target-oriented methods.
FaceDancer is a novel, target-oriented, and single-stage method for face swapping.
FaceDancer uses an Adaptive Feature Fusion Attention (AFFA) module and an Interpreted Feature Similarity Regularization (IFSR) method to improve identity transfer and pose preservation.

Two leading approaches for face swapping: source-and target-oriented methods
Digital Emily project used 3D scanning of a single actor to perform face swapping
Banz et al. used 3D Morphable Models to generate source faces with matching target attributes
Nirkin et al. used 3DMM to extract pose and expression coefficients from the target face
FSGAN used reenactment network to reenact the source face based on the target landmarks
Target-oriented approaches rely on generative models to manipulate features of an encoded target face
FaceShifter used attribute encoder-decoder model trained in a semi-supervised fashion
SimSwap used modified version of feature matching loss from pix2pixHD
HifiFace used combination of GANs and 3DMMs to achieve state-of-the-art identity performance
Our approach relies on identity encoder and can handle harsh image distortions

Method

FaceDancer network architecture is described
AFFA module, IFSR method, and loss functions are mentioned
Notations are defined: X t (target face), X s (source face), X c (changed face)

Network architecture

FaceDancer involves a generator, discriminator, mapping network and ArcFace.
Generator uses U-Net like encoder-decoder architecture with residual blocks and AdaIN, AFFA module or concatenation layer.
Discriminator is same as StarGan-v2 and Hi-fiFace, but hinge loss is used.
Mapping network consists of four fully-connected layers and leaky ReLU.
ArcFace is used to extract and inject identity information from source image and for identity loss.

The adaptive feature fusion attention (affa) module

AFFA module is inspired by previous works
AFFA avoids the need to compute segmentation masks for each training sample
AFFA uses information from skip connections in the generator encoder
AFFA learns to extract relevant descriptive face features
AFFA produces an attention mask to gate and fuse h and za

Interpreted feature similarity regularization (ifsr)

Target-oriented face swapping methods rely on semi-supervised or unsupervised techniques to maintain target attributes
Use pretrained identity encoders to explore facial expressions
Pre-study on state-of-the-art face swapping model FaceShifter
Compare cosine distances between target and generated face swaps
Define margin for each layer based on mean distances
Regularization equation to match mean of distribution

Loss functions

FaceDancer uses various loss functions to train
Identity loss is used to transfer source identity
Reconstruction loss is used to make sure final result is equal to target image
Perceptual loss is used to improve semantic understanding of image
Cycle consistence loss is used to keep important attributes and structures

Results

FaceDancer is trained on VGGFace2 and LS3D-W datasets
Five point landmarks are extracted with RetinaFace
ArcFace is pretrained on MS1M with a ResNet50 backbone
Adam optimizer is used with a learning rate of 0.0001
Target and source images are randomly augmented with brightness, contrast, and saturation
Configurations are trained for 300K steps for the ablation study
Best performing configurations are trained up to 500K steps
Image resolution is 256x256
20% chance that an image pair is the same
Margin scale is set to 1.2
Quantitative experiments on FaceForensics++

Quantitative results

FaceDancer is compared to other face swapping networks
Metrics evaluated are identity retrieval, pose error, expression error, and FID
Identity retrieval is done with a secondary identity encoder
Pose is measured with an average L2 error
Expression is measured with an average L2 error
FID is calculated between swapped and unaltered test set
10K samples used for testing
FaceDancer outperforms other works in identity retrieval

Qualitative results

FaceDancer is compared to SimSwap, FaceShifter, HifiFace, and FaceController
FaceDancer performs similarly to SimSwap but with improved identity transfer
FaceShifter performs good identity transfer but struggles with lighting and gaze direction
FaceController has good identity transferability and decent pose error but fails with gaze direction
HifiFace has promising results for facial shape but not feasible to compare qualitatively

Ablation study

FaceDancer components (AFFA module and IFSR method) are ablated and compared to two baselines
Baseline 1 and 2 use concatenation and addition to fuse feature maps from the decoder and skip connection
Feature fusion is performed at top three resolutions (256, 128, 64)
Configurations A-E are evaluated on FaceForensic++ and AFLW2000-3D
Baseline 1 performs best for pose, but falls short on other metrics
Configurations A-D perform better for ID, but have comparable pose error and similar or better expression error
Configuration E falls short for identity performance and pose error

Analysis of the affa module

AFFA module was studied in FaceDancer
Training with AFFA in the three upper resolutions of the decoder (256, 128, 64) leads to noticeable color defects
Attention maps generated at the resolution of 256 are mostly gray
Baselines 1 and 2 do not have any problems with the color defect
To remedy the problem, AFFA module was replaced with a simple concatenation operation and an AFFA module was added to resolution 32

Analysis of ifsr

IFSR method is evaluated experimentally
Comparing cosine distance between feature maps of target, source, and changed faces
Changed face obtained from FaceShifter
Early layers of ArcFace contain information such as pose, expression, and occlusions
Final blocks store identity information
c2t and c2s distributions are completely separable until block 14
IFSR does not contain any learnable parameters

Conclusion

FaceDancer is a single-stage face swapping model that reaches state-of-the-art
FaceDancer has a regularization component IFSR to preserve attributes
AFFA module improves identity transfer without sacrificing visual quality
FaceDancer is limited in face shape transfer and IFSR margin calculation
IFSR can be used to compress complex face swap models
Combining IFSR with 3DMM could improve attribute preservation

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Network architecture#

The adaptive feature fusion attention (affa) module#

Interpreted feature similarity regularization (ifsr)#

Loss functions#

Results#

Quantitative results#

Qualitative results#

Ablation study#

Analysis of the affa module#

Analysis of ifsr#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Network architecture

The adaptive feature fusion attention (affa) module

Interpreted feature similarity regularization (ifsr)

Loss functions

Results

Quantitative results

Qualitative results

Ablation study

Analysis of the affa module

Analysis of ifsr

Conclusion