Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a new single-stage method for subject agnostic face swapping and identity transfer called FaceDancer
- Adaptive Feature Fusion Attention (AFFA) and Interpreted Feature Similarity Regularization (IFSR) modules embedded in the decoder
- AFFA adaptively learns to fuse attribute features and features conditioned on identity information without requiring any additional facial segmentation process
- IFSR leverages intermediate features in an identity encoder to preserve important attributes while transferring the identity of the source face
- Experiments show FaceDancer outperforms other state-of-the-art networks in terms of identity transfer and has better pose preservation than most previous methods
Paper Content
Introduction
- Face swapping is a task that shifts the identity of a source face into a target face while preserving the target face’s attributes.
- It has applications in the film, game, and entertainment industry.
- There are two mainstream approaches for face synthesis: source-oriented and target-oriented methods.
- FaceDancer is a novel, target-oriented, and single-stage method for face swapping.
- FaceDancer uses an Adaptive Feature Fusion Attention (AFFA) module and an Interpreted Feature Similarity Regularization (IFSR) method to improve identity transfer and pose preservation.
Related work
- Two leading approaches for face swapping: source-and target-oriented methods
- Digital Emily project used 3D scanning of a single actor to perform face swapping
- Banz et al. used 3D Morphable Models to generate source faces with matching target attributes
- Nirkin et al. used 3DMM to extract pose and expression coefficients from the target face
- FSGAN used reenactment network to reenact the source face based on the target landmarks
- Target-oriented approaches rely on generative models to manipulate features of an encoded target face
- FaceShifter used attribute encoder-decoder model trained in a semi-supervised fashion
- SimSwap used modified version of feature matching loss from pix2pixHD
- HifiFace used combination of GANs and 3DMMs to achieve state-of-the-art identity performance
- Our approach relies on identity encoder and can handle harsh image distortions
Method
- FaceDancer network architecture is described
- AFFA module, IFSR method, and loss functions are mentioned
- Notations are defined: X t (target face), X s (source face), X c (changed face)
Network architecture
- FaceDancer involves a generator, discriminator, mapping network and ArcFace.
- Generator uses U-Net like encoder-decoder architecture with residual blocks and AdaIN, AFFA module or concatenation layer.
- Discriminator is same as StarGan-v2 and Hi-fiFace, but hinge loss is used.
- Mapping network consists of four fully-connected layers and leaky ReLU.
- ArcFace is used to extract and inject identity information from source image and for identity loss.
The adaptive feature fusion attention (affa) module
- AFFA module is inspired by previous works
- AFFA avoids the need to compute segmentation masks for each training sample
- AFFA uses information from skip connections in the generator encoder
- AFFA learns to extract relevant descriptive face features
- AFFA produces an attention mask to gate and fuse h and za
Interpreted feature similarity regularization (ifsr)
- Target-oriented face swapping methods rely on semi-supervised or unsupervised techniques to maintain target attributes
- Use pretrained identity encoders to explore facial expressions
- Pre-study on state-of-the-art face swapping model FaceShifter
- Compare cosine distances between target and generated face swaps
- Define margin for each layer based on mean distances
- Regularization equation to match mean of distribution
Loss functions
- FaceDancer uses various loss functions to train
- Identity loss is used to transfer source identity
- Reconstruction loss is used to make sure final result is equal to target image
- Perceptual loss is used to improve semantic understanding of image
- Cycle consistence loss is used to keep important attributes and structures
Results
- FaceDancer is trained on VGGFace2 and LS3D-W datasets
- Five point landmarks are extracted with RetinaFace
- ArcFace is pretrained on MS1M with a ResNet50 backbone
- Adam optimizer is used with a learning rate of 0.0001
- Target and source images are randomly augmented with brightness, contrast, and saturation
- Configurations are trained for 300K steps for the ablation study
- Best performing configurations are trained up to 500K steps
- Image resolution is 256x256
- 20% chance that an image pair is the same
- Margin scale is set to 1.2
- Quantitative experiments on FaceForensics++
Quantitative results
- FaceDancer is compared to other face swapping networks
- Metrics evaluated are identity retrieval, pose error, expression error, and FID
- Identity retrieval is done with a secondary identity encoder
- Pose is measured with an average L2 error
- Expression is measured with an average L2 error
- FID is calculated between swapped and unaltered test set
- 10K samples used for testing
- FaceDancer outperforms other works in identity retrieval
Qualitative results
- FaceDancer is compared to SimSwap, FaceShifter, HifiFace, and FaceController
- FaceDancer performs similarly to SimSwap but with improved identity transfer
- FaceShifter performs good identity transfer but struggles with lighting and gaze direction
- FaceController has good identity transferability and decent pose error but fails with gaze direction
- HifiFace has promising results for facial shape but not feasible to compare qualitatively
Ablation study
- FaceDancer components (AFFA module and IFSR method) are ablated and compared to two baselines
- Baseline 1 and 2 use concatenation and addition to fuse feature maps from the decoder and skip connection
- Feature fusion is performed at top three resolutions (256, 128, 64)
- Configurations A-E are evaluated on FaceForensic++ and AFLW2000-3D
- Baseline 1 performs best for pose, but falls short on other metrics
- Configurations A-D perform better for ID, but have comparable pose error and similar or better expression error
- Configuration E falls short for identity performance and pose error
Analysis of the affa module
- AFFA module was studied in FaceDancer
- Training with AFFA in the three upper resolutions of the decoder (256, 128, 64) leads to noticeable color defects
- Attention maps generated at the resolution of 256 are mostly gray
- Baselines 1 and 2 do not have any problems with the color defect
- To remedy the problem, AFFA module was replaced with a simple concatenation operation and an AFFA module was added to resolution 32
Analysis of ifsr
- IFSR method is evaluated experimentally
- Comparing cosine distance between feature maps of target, source, and changed faces
- Changed face obtained from FaceShifter
- Early layers of ArcFace contain information such as pose, expression, and occlusions
- Final blocks store identity information
- c2t and c2s distributions are completely separable until block 14
- IFSR does not contain any learnable parameters
Conclusion
- FaceDancer is a single-stage face swapping model that reaches state-of-the-art
- FaceDancer has a regularization component IFSR to preserve attributes
- AFFA module improves identity transfer without sacrificing visual quality
- FaceDancer is limited in face shape transfer and IFSR margin calculation
- IFSR can be used to compress complex face swap models
- Combining IFSR with 3DMM could improve attribute preservation