Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent text-to-image generative models can generate diverse and creative imagery.
Current state-of-the-art diffusion models may fail to generate images that fully convey the semantics in the given text prompt.
We introduce the concept of Generative Semantic Nursing to help mitigate these failure cases.
We use an attention-based formulation of GSN to guide the model to refine the cross-attention units.
Our approach conveys the desired concepts more faithfully across a range of text prompts.

Recent advancements in text-based image generation have shown an ability to generate diverse imagery
Images produced by such models do not always reflect the semantic meaning of the target prompt
Two key semantic issues in state-of-the-art text-based image generation models: “catastrophic neglect” and incorrect “attribute binding”
Introduce concept of “Generative Semantic Nursing” (GSN) to mitigate these issues
Proposed form of GSN called Attend-and-Excite
Attend-and-Excite tackles issue of catastrophic neglect and encourages correct bindings between attributes and their subjects
Demonstrated superiority of Attend-and-Excite over Stable Diffusion and alternative methods
Analyzed cross-attention maps with and without Attend-and-Excite

Early works studied text-guided image synthesis with GANs
More recently, impressive results with large-scale auto-regressive models and diffusion models
To enforce reliance on text, classifier-free guidance used
To provide users with more control, segmentation map or spatial conditioning used
To introduce concepts, map images to “word” in embedding space of model

Latent Diffusion Models use a state-of-the-art Stable Diffusion model to operate in the latent space of an autoencoder.
An encoder and decoder are trained to map an image to a latent code and reconstruct the image.
A denoising diffusion probabilistic model is used to produce a denoised version of an input latent.
The model is conditioned on an additional input vector, typically a text encoding.
The denoising network consists of self-attention and cross-attention layers.
At each timestep, the denoising network’s intermediate features receive information from the embedding of the guiding text via the cross-attention layers.
An attention map is calculated over linear projections of the intermediate features and text embedding.

At the core of the method is the idea of generative semantic nursing
Gradually shift the noised latent code at each timestep toward a more semantically-faithful generation
Consider the attention maps of the subject tokens in the prompt
Define a loss objective that attempts to maximize the attention values for each subject token
Update the noised latent according to the gradient of the computed loss
Extract a spatial attention map for each token in the prompt
Perform a Softmax operation on the attention values
Extract the normalized attention map for each subject token
Optimization objective encourages the existence of at least one patch of attention with a high activation value
Perform iterative updates at various denoising steps
Use cross-attention maps as an explanation for the model

Attend-and-Excite is able to generate images that contain all subjects with correctly binded colors.
Attend-and-Excite is able to improve attribute bindings between colors and subjects.
Attend-and-Excite is able to generate images for complex prompts with three or more subjects, complex attributes, and interactions between subjects.

Quantify performance of each method using CLIP-space distances
Evaluate image-text similarities between generated images and text prompts
Analyze modality gap between CLIP’s image and text embeddings
Compute average CLIP cosine similarity between text prompt and set of 64 generated images
Consider only most neglected subject independently of full text
Maximize smaller of two scores to minimize neglect
Compute Upper Bound by collecting 50 images from classification and detection datasets
Attend-and-Excite outperforms all baselines across all subsets
Attend-and-Excite significantly improves Minimum Object Similarity
Compute average CLIP similarity between prompt and all captions
Attend-and-Excite outperforms all alternative methods
User study to analyze fidelity of generated images
Attend-and-Excite received highest percentage of votes across all subsets

Generative model has limited expressive power
Results may be out of distribution when prompt is outside of what the model learned
Combinations of subjects that don’t naturally appear together may lead to less realistic results

Diffusion process can be corrected once it takes a wrong turn
Introduce concept of Generative Semantic Nursing (GSN)
GSN encourages all subject tokens in text to be attended to by some image patch
Alleviate two core semantic issues on the fly
Strengthen text conditioning along image generation process
GSN can be applied to any image editing and generation task
Cross-attention maps act as a good medium for explainability
Iterative latent refinement ensures each subject token achieves a certain maximum activation value
Stop latent modification after 25 denoising steps
Attention re-weighting variant of ptp explored
Attend-and-Excite outperforms ptp in mitigation of semantic issues
Figures 15 and 16 present uncurated results
Figure 17 provides additional results and comparisons
Figure 18 contains additional results for prompts from StructureDiffusion paper
Figure 22 presents additional results obtained using Composable Diffusion
Figure 23 contains additional comparisons of cross-attention maps for subject tokens