Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent text-to-image generative models can generate diverse and creative imagery.
- Current state-of-the-art diffusion models may fail to generate images that fully convey the semantics in the given text prompt.
- We introduce the concept of Generative Semantic Nursing to help mitigate these failure cases.
- We use an attention-based formulation of GSN to guide the model to refine the cross-attention units.
- Our approach conveys the desired concepts more faithfully across a range of text prompts.
Paper Content
Introduction
- Recent advancements in text-based image generation have shown an ability to generate diverse imagery
- Images produced by such models do not always reflect the semantic meaning of the target prompt
- Two key semantic issues in state-of-the-art text-based image generation models: “catastrophic neglect” and incorrect “attribute binding”
- Introduce concept of “Generative Semantic Nursing” (GSN) to mitigate these issues
- Proposed form of GSN called Attend-and-Excite
- Attend-and-Excite tackles issue of catastrophic neglect and encourages correct bindings between attributes and their subjects
- Demonstrated superiority of Attend-and-Excite over Stable Diffusion and alternative methods
- Analyzed cross-attention maps with and without Attend-and-Excite
Related work
- Early works studied text-guided image synthesis with GANs
- More recently, impressive results with large-scale auto-regressive models and diffusion models
- To enforce reliance on text, classifier-free guidance used
- To provide users with more control, segmentation map or spatial conditioning used
- To introduce concepts, map images to “word” in embedding space of model
Preliminaries
- Latent Diffusion Models use a state-of-the-art Stable Diffusion model to operate in the latent space of an autoencoder.
- An encoder and decoder are trained to map an image to a latent code and reconstruct the image.
- A denoising diffusion probabilistic model is used to produce a denoised version of an input latent.
- The model is conditioned on an additional input vector, typically a text encoding.
- The denoising network consists of self-attention and cross-attention layers.
- At each timestep, the denoising network’s intermediate features receive information from the embedding of the guiding text via the cross-attention layers.
- An attention map is calculated over linear projections of the intermediate features and text embedding.
Attend-and-excite
- At the core of the method is the idea of generative semantic nursing
- Gradually shift the noised latent code at each timestep toward a more semantically-faithful generation
- Consider the attention maps of the subject tokens in the prompt
- Define a loss objective that attempts to maximize the attention values for each subject token
- Update the noised latent according to the gradient of the computed loss
- Extract a spatial attention map for each token in the prompt
- Perform a Softmax operation on the attention values
- Extract the normalized attention map for each subject token
- Optimization objective encourages the existence of at least one patch of attention with a high activation value
- Perform iterative updates at various denoising steps
- Use cross-attention maps as an explanation for the model
Results
- Constructed a new benchmark to evaluate methods
- Constructed prompts containing two subjects and a variety of attributes
- Considered three types of text prompts
- Considered 12 animals, 12 objects, and 11 colors
- Generated 64 images using 64 random seeds
Qualitative comparisons
- Attend-and-Excite is able to generate images that contain all subjects with correctly binded colors.
- Attend-and-Excite is able to improve attribute bindings between colors and subjects.
- Attend-and-Excite is able to generate images for complex prompts with three or more subjects, complex attributes, and interactions between subjects.
Quantitative analysis
- Quantify performance of each method using CLIP-space distances
- Evaluate image-text similarities between generated images and text prompts
- Analyze modality gap between CLIP’s image and text embeddings
- Compute average CLIP cosine similarity between text prompt and set of 64 generated images
- Consider only most neglected subject independently of full text
- Maximize smaller of two scores to minimize neglect
- Compute Upper Bound by collecting 50 images from classification and detection datasets
- Attend-and-Excite outperforms all baselines across all subsets
- Attend-and-Excite significantly improves Minimum Object Similarity
- Compute average CLIP similarity between prompt and all captions
- Attend-and-Excite outperforms all alternative methods
- User study to analyze fidelity of generated images
- Attend-and-Excite received highest percentage of votes across all subsets
Limitations
- Generative model has limited expressive power
- Results may be out of distribution when prompt is outside of what the model learned
- Combinations of subjects that don’t naturally appear together may lead to less realistic results
Conclusions
- Diffusion process can be corrected once it takes a wrong turn
- Introduce concept of Generative Semantic Nursing (GSN)
- GSN encourages all subject tokens in text to be attended to by some image patch
- Alleviate two core semantic issues on the fly
- Strengthen text conditioning along image generation process
- GSN can be applied to any image editing and generation task
- Cross-attention maps act as a good medium for explainability
- Iterative latent refinement ensures each subject token achieves a certain maximum activation value
- Stop latent modification after 25 denoising steps
- Attention re-weighting variant of ptp explored
- Attend-and-Excite outperforms ptp in mitigation of semantic issues
- Figures 15 and 16 present uncurated results
- Figure 17 provides additional results and comparisons
- Figure 18 contains additional results for prompts from StructureDiffusion paper
- Figure 22 presents additional results obtained using Composable Diffusion
- Figure 23 contains additional comparisons of cross-attention maps for subject tokens