Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language-guided image editing has been successful
- Investigated exemplar-guided image editing for more precise control
- Leveraged self-supervised training to disentangle and re-organize source and exemplar
- Proposed information bottleneck and strong augmentations to avoid copying and pasting
- Designed arbitrary shape mask for exemplar image and leveraged classifier-free guidance
- Involves single forward of diffusion model without iterative optimization
- Impressive performance and controllable editing on in-the-wild images with high fidelity
Paper Content
Introduction
- Creative editing for photos is becoming more common due to advances in social media platforms
- AI-based techniques can make image editing easier
- Deep neural networks can produce results for various low-level image editing tasks
- Semantic image editing is more challenging and requires manipulation of high-level semantics
- Language-image models have enabled various image manipulation tasks
- Exemplar-based image editing allows accurate semantic manipulation with an exemplar image
- Image-conditioned diffusion model is trained in a self-supervised manner
- Techniques are proposed to tackle degenerate challenge
- Approach performs favorably over prior arts for inthe-wild image editing
Related work
- Cutting and pasting one image onto another to create a realistic composite is a common photo editing operation
- Many methods have been proposed to make the composite look more realistic
- Traditional methods extract handcrafted features to match the color distribution
- Recent works leverage deep semantic features to improve robustness
- Existing methods are limited to specific image genres, such as faces, cars, birds, cats, etc.
Method
- Automatically merge reference image into source image
- Images better than verbal descriptions for expressing complex ideas
- Source image (x s ), edit region (m) and reference image (x r ) used to synthesize image (y)
- Model needs to understand object in reference image, transform view of object, inpaint area around object and involve super-resolution
Preliminaries
- Self-supervised training is proposed to simulate training data for exemplar-based image editing.
- A naive solution is to replace the text condition with the reference image condition.
- Diffusion models have been applied to many text-based image editing works.
- Naive solution leads to copy-and-paste artifacts in the edit region.
- Three principles are proposed to prevent the model from learning a trivial mapping function.
Model designs
- Re-analyzed difference between text and image conditions
- Compressed representation ignores high-frequency details while maintaining semantic information
- Added fully-connected layers to decode feature and inject into diffusion process
- Leveraged well-trained diffusion model for initialization as strong image prior
- Used data augmentation techniques on reference image to reduce domain gap
- Generated arbitrarily shaped mask based on bounding box to reduce gap between training and testing
- Used classifier-free sampling strategy to control similarity degree between edited area and reference image
- Trained model for 40 epochs on 64 NVIDIA V100 GPUs
- Built COCO Exemplar-based image Editing benchmark for qualitative and quantitative analysis
- Used FID, Quality Score and CLIP score to evaluate generated images
Comparisons
- Blended Diffusion uses CLIP model to provide gradients to guide diffusion sampling process
- Blended Diffusion (image) uses reference image to calculate CLIP loss
- Stable Diffusion uses text prompt to represent reference image
- DCCF is state-of-the-art image harmonization method
- Our method achieves photo-realistic result while being similar to reference image
Ablation study
- Leverage image prior
- Strong augmentation
- Information bottleneck
- Classifier-free guidance
From language to image condition
- Language can be used to control the inpainting of a mask region
- Image-guided results are more similar to the reference image than language-guided results
In-the-wild image editing
- Our method can generate multiple outputs from the same input.
- Generated images vary, but keep key identity of reference image.
Conclusion
- Introduce novel image editing scenario: exemplar-based image editing
- Leverage self-supervised training based on diffusion model
- Carefully analyze and solve boundary artifacts issue
- Enable user to precisely control editing
- Impressive performance on in-the-wild images