Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Aim to provide user personalization to diffusion models
Focus on learning custom objects from few images
Alternative approach for personalization of text-to-image diffusion models
Goal is to guide generative process towards custom aesthetics defined by user
User chooses textual prompt to guide generation
Represent aesthetic preferences of user with average of visual embeddings of images
Measure agreement between CLIP representation of prompt and user preferences
Perform gradient descent with respect to CLIP text encoder weights
Only modify weights of CLIP text encoder
Benefits include agnostic to diffusion model, computationally cheap, and user only needs to store one aesthetic embedding per set of images