Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Recent text-to-image generative models can generate diverse and creative imagery.
  • Current state-of-the-art diffusion models may fail to generate images that fully convey the semantics in the given text prompt.
  • We introduce the concept of Generative Semantic Nursing to help mitigate these failure cases.
  • We use an attention-based formulation of GSN to guide the model to refine the cross-attention units.
  • Our approach conveys the desired concepts more faithfully across a range of text prompts.

Paper Content


  • Recent advancements in text-based image generation have shown an ability to generate diverse imagery
  • Images produced by such models do not always reflect the semantic meaning of the target prompt
  • Two key semantic issues in state-of-the-art text-based image generation models: “catastrophic neglect” and incorrect “attribute binding”
  • Introduce concept of “Generative Semantic Nursing” (GSN) to mitigate these issues
  • Proposed form of GSN called Attend-and-Excite
  • Attend-and-Excite tackles issue of catastrophic neglect and encourages correct bindings between attributes and their subjects
  • Demonstrated superiority of Attend-and-Excite over Stable Diffusion and alternative methods
  • Analyzed cross-attention maps with and without Attend-and-Excite
  • Early works studied text-guided image synthesis with GANs
  • More recently, impressive results with large-scale auto-regressive models and diffusion models
  • To enforce reliance on text, classifier-free guidance used
  • To provide users with more control, segmentation map or spatial conditioning used
  • To introduce concepts, map images to “word” in embedding space of model


  • Latent Diffusion Models use a state-of-the-art Stable Diffusion model to operate in the latent space of an autoencoder.
  • An encoder and decoder are trained to map an image to a latent code and reconstruct the image.
  • A denoising diffusion probabilistic model is used to produce a denoised version of an input latent.
  • The model is conditioned on an additional input vector, typically a text encoding.
  • The denoising network consists of self-attention and cross-attention layers.
  • At each timestep, the denoising network’s intermediate features receive information from the embedding of the guiding text via the cross-attention layers.
  • An attention map is calculated over linear projections of the intermediate features and text embedding.


  • At the core of the method is the idea of generative semantic nursing
  • Gradually shift the noised latent code at each timestep toward a more semantically-faithful generation
  • Consider the attention maps of the subject tokens in the prompt
  • Define a loss objective that attempts to maximize the attention values for each subject token
  • Update the noised latent according to the gradient of the computed loss
  • Extract a spatial attention map for each token in the prompt
  • Perform a Softmax operation on the attention values
  • Extract the normalized attention map for each subject token
  • Optimization objective encourages the existence of at least one patch of attention with a high activation value
  • Perform iterative updates at various denoising steps
  • Use cross-attention maps as an explanation for the model


  • Constructed a new benchmark to evaluate methods
  • Constructed prompts containing two subjects and a variety of attributes
  • Considered three types of text prompts
  • Considered 12 animals, 12 objects, and 11 colors
  • Generated 64 images using 64 random seeds

Qualitative comparisons

  • Attend-and-Excite is able to generate images that contain all subjects with correctly binded colors.
  • Attend-and-Excite is able to improve attribute bindings between colors and subjects.
  • Attend-and-Excite is able to generate images for complex prompts with three or more subjects, complex attributes, and interactions between subjects.

Quantitative analysis

  • Quantify performance of each method using CLIP-space distances
  • Evaluate image-text similarities between generated images and text prompts
  • Analyze modality gap between CLIP’s image and text embeddings
  • Compute average CLIP cosine similarity between text prompt and set of 64 generated images
  • Consider only most neglected subject independently of full text
  • Maximize smaller of two scores to minimize neglect
  • Compute Upper Bound by collecting 50 images from classification and detection datasets
  • Attend-and-Excite outperforms all baselines across all subsets
  • Attend-and-Excite significantly improves Minimum Object Similarity
  • Compute average CLIP similarity between prompt and all captions
  • Attend-and-Excite outperforms all alternative methods
  • User study to analyze fidelity of generated images
  • Attend-and-Excite received highest percentage of votes across all subsets


  • Generative model has limited expressive power
  • Results may be out of distribution when prompt is outside of what the model learned
  • Combinations of subjects that don’t naturally appear together may lead to less realistic results


  • Diffusion process can be corrected once it takes a wrong turn
  • Introduce concept of Generative Semantic Nursing (GSN)
  • GSN encourages all subject tokens in text to be attended to by some image patch
  • Alleviate two core semantic issues on the fly
  • Strengthen text conditioning along image generation process
  • GSN can be applied to any image editing and generation task
  • Cross-attention maps act as a good medium for explainability
  • Iterative latent refinement ensures each subject token achieves a certain maximum activation value
  • Stop latent modification after 25 denoising steps
  • Attention re-weighting variant of ptp explored
  • Attend-and-Excite outperforms ptp in mitigation of semantic issues
  • Figures 15 and 16 present uncurated results
  • Figure 17 provides additional results and comparisons
  • Figure 18 contains additional results for prompts from StructureDiffusion paper
  • Figure 22 presents additional results obtained using Composable Diffusion
  • Figure 23 contains additional comparisons of cross-attention maps for subject tokens