Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- StyleGAN is limited to cropped aligned faces at a fixed image resolution.
- Dilated convolutions can be used to rescale the receptive fields of shallow layers in StyleGAN.
- This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions.
- An encoder is introduced to enable real face inversion and manipulation.
Paper Content
Related work
- StyleGAN inversion aims to project real face images into the latent space of StyleGAN
- StyleGAN-based face manipulation can be done by optimizing the latent code or searching for offline editing vectors
- Image-to-image translation frameworks can also be used to achieve face manipulation
- StyleGANEX can manipulate mid-layer features with various resolutions and supports diverse manipulation tasks
Styleganex
- StyleGAN has a fixed-crop limitation
- StyleGAN2 is the focus of most existing StyleGAN-based manipulation models
Analysis of the fixed-crop limitation
- StyleGAN has potential for handling normal FoV face images
- Generator of StyleGAN is a fully convolutional architecture that can handle different feature resolutions
- Translation equivariance of convolution operations supports feature translation
- Limiting factor is first-layer feature with fixed resolution of 4x4
- Sub-pixel translation fuses adjacent feature values, resulting in blurry face
- First-layer feature inadequate to characterize spatial information of unaligned faces
- Style-GANEX expands shallow layers to have same resolution as 7th layer, enabling face manipulation beyond cropped and aligned faces
Face manipulation with styleganex
Styleganex encoder
- StyleGANEX encoder E is used to project real face images into the W + -F space of StyleGANEX G
- E builds upon the pSp encoder
- For F space, a convolution layer is added to map concatenated features to the first-layer input feature f of G
- For W + space, the original pSp encoder takes a 256 x 256 image as input and convolves it to eighteen 1 x 1 x 512 features
- Global average pooling is added to resize all features to 1 x 1 x 512 before mapping to latent codes
- Scalar parameter indicates shallow layers of G receive encoder features
Styleganex inversion and editing
- We perform a two-step StyleGANEX inversion to find appropriate f and ŵ+ that precisely reconstruct a target image x.
- Step I projects x to initial f and w + with E. Step II optimizes f and w + to reduce the reconstruction error.
- We use x instead of x to predict more accurate w +.
- We measure the distance between the reconstructed x and the target x in terms of pixel similarity, perceptual similarity, and identity preservation.
- We can perform flexible editing over x as in StyleGAN.
Styleganex-based translation
- End-to-end image-to-image translation framework
- Can be trained to do different face manipulation tasks
- Face super-resolution: train encoder to recover high-resolution image from low-resolution image
- Sketch/mask-to-face translation: use trainable light-weight translation network to map source to intermediate domain
- Video face editing: train encoder to edit face with editing vector
- Video toonification: use StyleGAN fine-tuned on cartoon images
Experimental results
- Set λ 2 , λ 3 , λ 4 , λ 5 , λ 6 for tasks
- Translation network consists of two downsampling convolutional layers, two ResBlocks and two upsampling convolutional layers
- Experiments performed on single NVIDIA Tesla V100 GPU
- Training data from FFHQ, augmented with random geometric transformations
- Testing data from FaceForensics++, Unsplash and Pexels
Face manipulation
- Face editing works well on StyleGANEX
- Compared with other baselines, our approach processes the entire image as a whole and avoids discontinuity issues
- Our method successfully edits facial attributes and styles
- 32x super resolution results show both face and non-face regions are reasonably restored
- Our method surpasses other methods in precise detail restoration and uniform super-resolution
- Our method successfully translates whole images and achieves realism and structural consistency to the inputs
- Our method receives the best score in a user study
- Our method preserves more details of the non-face region and generates sharper faces than VToonify-T
Ablation study
- Step II of two-step inversion verified in Fig. 6
- Step I studied in Fig. 14
- 500-iteration optimization for precise reconstruction
- Domain transfer to Disney Princess
- Poor result with 2,000 iterations if directly optimize mean w+ and random f
- Input choice studied in Fig. 15
- Cropped aligned faces default choice
- Reasonable results with cropped input to decrease background proportion
- Skip connection studied in Fig. 16
- Smaller skip connection to enhance model robustness to low-quality inputs
Limitations
- Relies on inefficient optimization process for precise reconstruction
- Focuses on overcoming fixed-crop limitation of StyleGAN, not GAN inversion
- Limited by feature representation of StyleGAN
- May not handle out-of-distribution features
- Focuses on face manipulation, not non-facial regions
- May inherit model bias of StyleGAN
Conclusion
- Presented an approach to refactor StyleGAN to overcome fixed-crop limitation while retaining style control abilities
- Refactored model called StyleGANEX fully inherits parameters of pre-trained StyleGAN without retraining
- Introduced StyleGANEX encoder to project normal FoV face images to joint W + -F space of StyleGANEX for real face inversion and manipulation
- Training encoder uses one NVIDIA Tesla V100 GPU for 100,000 iterations
- Training time is about 2 days for 100,000 iterations
- Image inference uses one NVIDIA Tesla V100 GPU and a batch size of 1
- Inference on 796 testing images takes about 107.11 s
- Normal FoV face super-resolution results show detail restoration and uniform super-resolution without discontinuity between face and non-face regions
- Sketch/mask-to-face translation results show realism and structural consistency to the inputs
- StyleGANEX inversion and facial attribute/style editing results show full background with target style
- Comparison results show better scores for StyleGANEX encoder and full two-step inversion