Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language-guided image editing has been successful
Investigated exemplar-guided image editing for more precise control
Leveraged self-supervised training to disentangle and re-organize source and exemplar
Proposed information bottleneck and strong augmentations to avoid copying and pasting
Designed arbitrary shape mask for exemplar image and leveraged classifier-free guidance
Involves single forward of diffusion model without iterative optimization
Impressive performance and controllable editing on in-the-wild images with high fidelity

Paper Content

Introduction

Creative editing for photos is becoming more common due to advances in social media platforms
AI-based techniques can make image editing easier
Deep neural networks can produce results for various low-level image editing tasks
Semantic image editing is more challenging and requires manipulation of high-level semantics
Language-image models have enabled various image manipulation tasks
Exemplar-based image editing allows accurate semantic manipulation with an exemplar image
Image-conditioned diffusion model is trained in a self-supervised manner
Techniques are proposed to tackle degenerate challenge
Approach performs favorably over prior arts for inthe-wild image editing

Cutting and pasting one image onto another to create a realistic composite is a common photo editing operation
Many methods have been proposed to make the composite look more realistic
Traditional methods extract handcrafted features to match the color distribution
Recent works leverage deep semantic features to improve robustness
Existing methods are limited to specific image genres, such as faces, cars, birds, cats, etc.

Method

Automatically merge reference image into source image
Images better than verbal descriptions for expressing complex ideas
Source image (x s ), edit region (m) and reference image (x r ) used to synthesize image (y)
Model needs to understand object in reference image, transform view of object, inpaint area around object and involve super-resolution

Preliminaries

Self-supervised training is proposed to simulate training data for exemplar-based image editing.
A naive solution is to replace the text condition with the reference image condition.
Diffusion models have been applied to many text-based image editing works.
Naive solution leads to copy-and-paste artifacts in the edit region.
Three principles are proposed to prevent the model from learning a trivial mapping function.

Model designs

Re-analyzed difference between text and image conditions
Compressed representation ignores high-frequency details while maintaining semantic information
Added fully-connected layers to decode feature and inject into diffusion process
Leveraged well-trained diffusion model for initialization as strong image prior
Used data augmentation techniques on reference image to reduce domain gap
Generated arbitrarily shaped mask based on bounding box to reduce gap between training and testing
Used classifier-free sampling strategy to control similarity degree between edited area and reference image
Trained model for 40 epochs on 64 NVIDIA V100 GPUs
Built COCO Exemplar-based image Editing benchmark for qualitative and quantitative analysis
Used FID, Quality Score and CLIP score to evaluate generated images

Comparisons

Blended Diffusion uses CLIP model to provide gradients to guide diffusion sampling process
Blended Diffusion (image) uses reference image to calculate CLIP loss
Stable Diffusion uses text prompt to represent reference image
DCCF is state-of-the-art image harmonization method
Our method achieves photo-realistic result while being similar to reference image

Ablation study

Leverage image prior
Strong augmentation
Information bottleneck
Classifier-free guidance

From language to image condition

Language can be used to control the inpainting of a mask region
Image-guided results are more similar to the reference image than language-guided results

In-the-wild image editing

Our method can generate multiple outputs from the same input.
Generated images vary, but keep key identity of reference image.

Conclusion

Introduce novel image editing scenario: exemplar-based image editing
Leverage self-supervised training based on diffusion model
Carefully analyze and solve boundary artifacts issue
Enable user to precisely control editing
Impressive performance on in-the-wild images

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Preliminaries#

Model designs#

Comparisons#

Ablation study#

From language to image condition#

In-the-wild image editing#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Preliminaries

Model designs

Comparisons

Ablation study

From language to image condition

In-the-wild image editing

Conclusion