Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep generative models have been used for text-to-image synthesis.
- Current models often generate images that are not well-aligned with text prompts.
- A fine-tuning method is proposed to improve alignment using human feedback.
Paper Content
Introduction
- Deep generative models have been successful in generating high-quality images from text prompts
- Scaling of deep generative models to large-scale datasets has been a factor in this success
- Challenges remain in domains where large-scale text-to-image models fail to generate images that are well-aligned with text prompts
- Learning from human feedback has emerged as a powerful solution for aligning model behavior with human intent
- Proposed fine-tuning method for aligning text-to-image models using human feedback
- Fine-tuning with human feedback significantly improves the image-text alignment of a text-to-image model
- Learned reward function predicts human assessments of the quality more accurately than the CLIP score
- Careful investigations on several design choices are important in balancing alignment-fidelity tradeoffs
Related work
- Variational auto-encoders, generative adversarial networks, auto-regressive models, and diffusion models have been proposed for image distributions
- Combined with language encoders, these models have shown impressive results in text-to-image generation
- Text-to-image models struggle to generate images that are well-aligned with text prompts
- Techniques such as character-aware text encoders and structured representations of language inputs have been investigated to address these issues
- Human feedback has been used to improve various AI systems
- We propose a fine-tuning method with human feedback for improving text-to-image models
- Various evaluation protocols have been proposed to measure image-text alignment
- We train a reward function that is better aligned with human evaluations by exploiting pre-trained representations and human feedback data
Main method
- Generate a set of diverse images from text prompts
- Human raters provide binary feedback on images
- Train a reward model to predict human feedback
- Fine-tune text-to-image model using reward-weighted log likelihood
Human data collection
- Generated image-text dataset with prompts combining words or phrases from three categories (count, color, background)
- Collected binary feedback from human labelers on the image-text dataset
Reward learning
- Measure image-text alignment by learning a reward function
- Data augmentation to improve data-efficiency and performance
- Generate N-1 text prompts with different semantics
- Use reward function to classify original prompt
- Auxiliary loss to encourage low values for prompts with different semantics
- Combined loss to combine penalty parameter
Updating the text-to-image model
- Update text-to-image model with parameters ฮธ by minimizing loss
- Minimize reward-weighted negative log-likelihood on model-generated dataset
- Minimize pre-training loss to reduce NLL on pre-training dataset
- Regularization in loss function enables model to generate more natural images
Experiments
- Conducted experiments to test efficacy of fine-tuning approach
- Used human feedback in experiments
Experimental setup
- Stable diffusion v1.5 model used as baseline generative model
- CLIP language encoder frozen for fine-tuning
- ViT-L/14 CLIP model used for reward model
- 2700 English prompts used to generate 27K images
- 23K samples used for training, remaining for validation
- 16K unlabeled samples used for reward-weighted loss
- 625K subset of LAION-5B used for pre-training loss
Text-image alignment results
- 120 text prompts used to measure human ratings of image alignment
- 9 independent human raters evaluate each query
- 50% of samples from our model receive at least two-thirds vote for image-text alignment
- Fine-tuning degrades image fidelity
- Issues with oversaturated and non-photorealistic images, duplication of entities, and lower-diversity images
Qualitative comparison.
Results on reward learning
- Investigating the quality of a learned reward function by evaluating its prediction of human ratings
- Comparing reward function with CLIP score
- Auxiliary loss (prompt classification) improves reward performance
- Rejection sampling technique selects best output w.r.t. the learned reward function
- Effects of human dataset size and data diversity on reward learning
- Fine-tuning with human feedback improves image-text alignment
- Balancing alignment-fidelity tradeoffs with careful investigations on design choices
- Limitations and future directions: more nuanced human feedback, diverse and large human dataset, different objectives and algorithms
- Evaluating image fidelity with FID scores and image-text alignment with reward scores and CLIP scores