Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • StyleGAN is limited to cropped aligned faces at a fixed image resolution.
  • Dilated convolutions can be used to rescale the receptive fields of shallow layers in StyleGAN.
  • This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions.
  • An encoder is introduced to enable real face inversion and manipulation.

Paper Content

  • StyleGAN inversion aims to project real face images into the latent space of StyleGAN
  • StyleGAN-based face manipulation can be done by optimizing the latent code or searching for offline editing vectors
  • Image-to-image translation frameworks can also be used to achieve face manipulation
  • StyleGANEX can manipulate mid-layer features with various resolutions and supports diverse manipulation tasks

Styleganex

  • StyleGAN has a fixed-crop limitation
  • StyleGAN2 is the focus of most existing StyleGAN-based manipulation models

Analysis of the fixed-crop limitation

  • StyleGAN has potential for handling normal FoV face images
  • Generator of StyleGAN is a fully convolutional architecture that can handle different feature resolutions
  • Translation equivariance of convolution operations supports feature translation
  • Limiting factor is first-layer feature with fixed resolution of 4x4
  • Sub-pixel translation fuses adjacent feature values, resulting in blurry face
  • First-layer feature inadequate to characterize spatial information of unaligned faces
  • Style-GANEX expands shallow layers to have same resolution as 7th layer, enabling face manipulation beyond cropped and aligned faces

Face manipulation with styleganex

Styleganex encoder

  • StyleGANEX encoder E is used to project real face images into the W + -F space of StyleGANEX G
  • E builds upon the pSp encoder
  • For F space, a convolution layer is added to map concatenated features to the first-layer input feature f of G
  • For W + space, the original pSp encoder takes a 256 x 256 image as input and convolves it to eighteen 1 x 1 x 512 features
  • Global average pooling is added to resize all features to 1 x 1 x 512 before mapping to latent codes
  • Scalar parameter indicates shallow layers of G receive encoder features

Styleganex inversion and editing

  • We perform a two-step StyleGANEX inversion to find appropriate f and ŵ+ that precisely reconstruct a target image x.
  • Step I projects x to initial f and w + with E. Step II optimizes f and w + to reduce the reconstruction error.
  • We use x instead of x to predict more accurate w +.
  • We measure the distance between the reconstructed x and the target x in terms of pixel similarity, perceptual similarity, and identity preservation.
  • We can perform flexible editing over x as in StyleGAN.

Styleganex-based translation

  • End-to-end image-to-image translation framework
  • Can be trained to do different face manipulation tasks
  • Face super-resolution: train encoder to recover high-resolution image from low-resolution image
  • Sketch/mask-to-face translation: use trainable light-weight translation network to map source to intermediate domain
  • Video face editing: train encoder to edit face with editing vector
  • Video toonification: use StyleGAN fine-tuned on cartoon images

Experimental results

  • Set λ 2 , λ 3 , λ 4 , λ 5 , λ 6 for tasks
  • Translation network consists of two downsampling convolutional layers, two ResBlocks and two upsampling convolutional layers
  • Experiments performed on single NVIDIA Tesla V100 GPU
  • Training data from FFHQ, augmented with random geometric transformations
  • Testing data from FaceForensics++, Unsplash and Pexels

Face manipulation

  • Face editing works well on StyleGANEX
  • Compared with other baselines, our approach processes the entire image as a whole and avoids discontinuity issues
  • Our method successfully edits facial attributes and styles
  • 32x super resolution results show both face and non-face regions are reasonably restored
  • Our method surpasses other methods in precise detail restoration and uniform super-resolution
  • Our method successfully translates whole images and achieves realism and structural consistency to the inputs
  • Our method receives the best score in a user study
  • Our method preserves more details of the non-face region and generates sharper faces than VToonify-T

Ablation study

  • Step II of two-step inversion verified in Fig. 6
  • Step I studied in Fig. 14
  • 500-iteration optimization for precise reconstruction
  • Domain transfer to Disney Princess
  • Poor result with 2,000 iterations if directly optimize mean w+ and random f
  • Input choice studied in Fig. 15
  • Cropped aligned faces default choice
  • Reasonable results with cropped input to decrease background proportion
  • Skip connection studied in Fig. 16
  • Smaller skip connection to enhance model robustness to low-quality inputs

Limitations

  • Relies on inefficient optimization process for precise reconstruction
  • Focuses on overcoming fixed-crop limitation of StyleGAN, not GAN inversion
  • Limited by feature representation of StyleGAN
  • May not handle out-of-distribution features
  • Focuses on face manipulation, not non-facial regions
  • May inherit model bias of StyleGAN

Conclusion

  • Presented an approach to refactor StyleGAN to overcome fixed-crop limitation while retaining style control abilities
  • Refactored model called StyleGANEX fully inherits parameters of pre-trained StyleGAN without retraining
  • Introduced StyleGANEX encoder to project normal FoV face images to joint W + -F space of StyleGANEX for real face inversion and manipulation
  • Training encoder uses one NVIDIA Tesla V100 GPU for 100,000 iterations
  • Training time is about 2 days for 100,000 iterations
  • Image inference uses one NVIDIA Tesla V100 GPU and a batch size of 1
  • Inference on 796 testing images takes about 107.11 s
  • Normal FoV face super-resolution results show detail restoration and uniform super-resolution without discontinuity between face and non-face regions
  • Sketch/mask-to-face translation results show realism and structural consistency to the inputs
  • StyleGANEX inversion and facial attribute/style editing results show full background with target style
  • Comparison results show better scores for StyleGANEX encoder and full two-step inversion