Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep generative models have been used for text-to-image synthesis.
Current models often generate images that are not well-aligned with text prompts.
A fine-tuning method is proposed to improve alignment using human feedback.

Paper Content

Introduction

Deep generative models have been successful in generating high-quality images from text prompts
Scaling of deep generative models to large-scale datasets has been a factor in this success
Challenges remain in domains where large-scale text-to-image models fail to generate images that are well-aligned with text prompts
Learning from human feedback has emerged as a powerful solution for aligning model behavior with human intent
Proposed fine-tuning method for aligning text-to-image models using human feedback
Fine-tuning with human feedback significantly improves the image-text alignment of a text-to-image model
Learned reward function predicts human assessments of the quality more accurately than the CLIP score
Careful investigations on several design choices are important in balancing alignment-fidelity tradeoffs

Variational auto-encoders, generative adversarial networks, auto-regressive models, and diffusion models have been proposed for image distributions
Combined with language encoders, these models have shown impressive results in text-to-image generation
Text-to-image models struggle to generate images that are well-aligned with text prompts
Techniques such as character-aware text encoders and structured representations of language inputs have been investigated to address these issues
Human feedback has been used to improve various AI systems
We propose a fine-tuning method with human feedback for improving text-to-image models
Various evaluation protocols have been proposed to measure image-text alignment
We train a reward function that is better aligned with human evaluations by exploiting pre-trained representations and human feedback data

Main method

Generate a set of diverse images from text prompts
Human raters provide binary feedback on images
Train a reward model to predict human feedback
Fine-tune text-to-image model using reward-weighted log likelihood

Human data collection

Generated image-text dataset with prompts combining words or phrases from three categories (count, color, background)
Collected binary feedback from human labelers on the image-text dataset

Reward learning

Measure image-text alignment by learning a reward function
Data augmentation to improve data-efficiency and performance
Generate N-1 text prompts with different semantics
Use reward function to classify original prompt
Auxiliary loss to encourage low values for prompts with different semantics
Combined loss to combine penalty parameter

Updating the text-to-image model

Update text-to-image model with parameters θ by minimizing loss
Minimize reward-weighted negative log-likelihood on model-generated dataset
Minimize pre-training loss to reduce NLL on pre-training dataset
Regularization in loss function enables model to generate more natural images

Experiments

Conducted experiments to test efficacy of fine-tuning approach
Used human feedback in experiments

Experimental setup

Stable diffusion v1.5 model used as baseline generative model
CLIP language encoder frozen for fine-tuning
ViT-L/14 CLIP model used for reward model
2700 English prompts used to generate 27K images
23K samples used for training, remaining for validation
16K unlabeled samples used for reward-weighted loss
625K subset of LAION-5B used for pre-training loss

Text-image alignment results

120 text prompts used to measure human ratings of image alignment
9 independent human raters evaluate each query
50% of samples from our model receive at least two-thirds vote for image-text alignment
Fine-tuning degrades image fidelity
Issues with oversaturated and non-photorealistic images, duplication of entities, and lower-diversity images

Qualitative comparison.

Results on reward learning

Investigating the quality of a learned reward function by evaluating its prediction of human ratings
Comparing reward function with CLIP score
Auxiliary loss (prompt classification) improves reward performance
Rejection sampling technique selects best output w.r.t. the learned reward function
Effects of human dataset size and data diversity on reward learning
Fine-tuning with human feedback improves image-text alignment
Balancing alignment-fidelity tradeoffs with careful investigations on design choices
Limitations and future directions: more nuanced human feedback, diverse and large human dataset, different objectives and algorithms
Evaluating image fidelity with FID scores and image-text alignment with reward scores and CLIP scores

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Main method#

Human data collection#

Reward learning#

Updating the text-to-image model#

Experiments#

Experimental setup#

Text-image alignment results#

Qualitative comparison.#

Results on reward learning#

Link to paper

Abstract

Paper Content

Introduction

Related work

Main method

Human data collection

Reward learning

Updating the text-to-image model

Experiments

Experimental setup

Text-image alignment results

Qualitative comparison.

Results on reward learning