Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language-guided image editing has been successful
  • Investigated exemplar-guided image editing for more precise control
  • Leveraged self-supervised training to disentangle and re-organize source and exemplar
  • Proposed information bottleneck and strong augmentations to avoid copying and pasting
  • Designed arbitrary shape mask for exemplar image and leveraged classifier-free guidance
  • Involves single forward of diffusion model without iterative optimization
  • Impressive performance and controllable editing on in-the-wild images with high fidelity

Paper Content

Introduction

  • Creative editing for photos is becoming more common due to advances in social media platforms
  • AI-based techniques can make image editing easier
  • Deep neural networks can produce results for various low-level image editing tasks
  • Semantic image editing is more challenging and requires manipulation of high-level semantics
  • Language-image models have enabled various image manipulation tasks
  • Exemplar-based image editing allows accurate semantic manipulation with an exemplar image
  • Image-conditioned diffusion model is trained in a self-supervised manner
  • Techniques are proposed to tackle degenerate challenge
  • Approach performs favorably over prior arts for inthe-wild image editing
  • Cutting and pasting one image onto another to create a realistic composite is a common photo editing operation
  • Many methods have been proposed to make the composite look more realistic
  • Traditional methods extract handcrafted features to match the color distribution
  • Recent works leverage deep semantic features to improve robustness
  • Existing methods are limited to specific image genres, such as faces, cars, birds, cats, etc.

Method

  • Automatically merge reference image into source image
  • Images better than verbal descriptions for expressing complex ideas
  • Source image (x s ), edit region (m) and reference image (x r ) used to synthesize image (y)
  • Model needs to understand object in reference image, transform view of object, inpaint area around object and involve super-resolution

Preliminaries

  • Self-supervised training is proposed to simulate training data for exemplar-based image editing.
  • A naive solution is to replace the text condition with the reference image condition.
  • Diffusion models have been applied to many text-based image editing works.
  • Naive solution leads to copy-and-paste artifacts in the edit region.
  • Three principles are proposed to prevent the model from learning a trivial mapping function.

Model designs

  • Re-analyzed difference between text and image conditions
  • Compressed representation ignores high-frequency details while maintaining semantic information
  • Added fully-connected layers to decode feature and inject into diffusion process
  • Leveraged well-trained diffusion model for initialization as strong image prior
  • Used data augmentation techniques on reference image to reduce domain gap
  • Generated arbitrarily shaped mask based on bounding box to reduce gap between training and testing
  • Used classifier-free sampling strategy to control similarity degree between edited area and reference image
  • Trained model for 40 epochs on 64 NVIDIA V100 GPUs
  • Built COCO Exemplar-based image Editing benchmark for qualitative and quantitative analysis
  • Used FID, Quality Score and CLIP score to evaluate generated images

Comparisons

  • Blended Diffusion uses CLIP model to provide gradients to guide diffusion sampling process
  • Blended Diffusion (image) uses reference image to calculate CLIP loss
  • Stable Diffusion uses text prompt to represent reference image
  • DCCF is state-of-the-art image harmonization method
  • Our method achieves photo-realistic result while being similar to reference image

Ablation study

  • Leverage image prior
  • Strong augmentation
  • Information bottleneck
  • Classifier-free guidance

From language to image condition

  • Language can be used to control the inpainting of a mask region
  • Image-guided results are more similar to the reference image than language-guided results

In-the-wild image editing

  • Our method can generate multiple outputs from the same input.
  • Generated images vary, but keep key identity of reference image.

Conclusion

  • Introduce novel image editing scenario: exemplar-based image editing
  • Leverage self-supervised training based on diffusion model
  • Carefully analyze and solve boundary artifacts issue
  • Enable user to precisely control editing
  • Impressive performance on in-the-wild images