Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep generative models have been used for text-to-image synthesis.
  • Current models often generate images that are not well-aligned with text prompts.
  • A fine-tuning method is proposed to improve alignment using human feedback.

Paper Content

Introduction

  • Deep generative models have been successful in generating high-quality images from text prompts
  • Scaling of deep generative models to large-scale datasets has been a factor in this success
  • Challenges remain in domains where large-scale text-to-image models fail to generate images that are well-aligned with text prompts
  • Learning from human feedback has emerged as a powerful solution for aligning model behavior with human intent
  • Proposed fine-tuning method for aligning text-to-image models using human feedback
  • Fine-tuning with human feedback significantly improves the image-text alignment of a text-to-image model
  • Learned reward function predicts human assessments of the quality more accurately than the CLIP score
  • Careful investigations on several design choices are important in balancing alignment-fidelity tradeoffs
  • Variational auto-encoders, generative adversarial networks, auto-regressive models, and diffusion models have been proposed for image distributions
  • Combined with language encoders, these models have shown impressive results in text-to-image generation
  • Text-to-image models struggle to generate images that are well-aligned with text prompts
  • Techniques such as character-aware text encoders and structured representations of language inputs have been investigated to address these issues
  • Human feedback has been used to improve various AI systems
  • We propose a fine-tuning method with human feedback for improving text-to-image models
  • Various evaluation protocols have been proposed to measure image-text alignment
  • We train a reward function that is better aligned with human evaluations by exploiting pre-trained representations and human feedback data

Main method

  • Generate a set of diverse images from text prompts
  • Human raters provide binary feedback on images
  • Train a reward model to predict human feedback
  • Fine-tune text-to-image model using reward-weighted log likelihood

Human data collection

  • Generated image-text dataset with prompts combining words or phrases from three categories (count, color, background)
  • Collected binary feedback from human labelers on the image-text dataset

Reward learning

  • Measure image-text alignment by learning a reward function
  • Data augmentation to improve data-efficiency and performance
  • Generate N-1 text prompts with different semantics
  • Use reward function to classify original prompt
  • Auxiliary loss to encourage low values for prompts with different semantics
  • Combined loss to combine penalty parameter

Updating the text-to-image model

  • Update text-to-image model with parameters ฮธ by minimizing loss
  • Minimize reward-weighted negative log-likelihood on model-generated dataset
  • Minimize pre-training loss to reduce NLL on pre-training dataset
  • Regularization in loss function enables model to generate more natural images

Experiments

  • Conducted experiments to test efficacy of fine-tuning approach
  • Used human feedback in experiments

Experimental setup

  • Stable diffusion v1.5 model used as baseline generative model
  • CLIP language encoder frozen for fine-tuning
  • ViT-L/14 CLIP model used for reward model
  • 2700 English prompts used to generate 27K images
  • 23K samples used for training, remaining for validation
  • 16K unlabeled samples used for reward-weighted loss
  • 625K subset of LAION-5B used for pre-training loss

Text-image alignment results

  • 120 text prompts used to measure human ratings of image alignment
  • 9 independent human raters evaluate each query
  • 50% of samples from our model receive at least two-thirds vote for image-text alignment
  • Fine-tuning degrades image fidelity
  • Issues with oversaturated and non-photorealistic images, duplication of entities, and lower-diversity images

Qualitative comparison.

Results on reward learning

  • Investigating the quality of a learned reward function by evaluating its prediction of human ratings
  • Comparing reward function with CLIP score
  • Auxiliary loss (prompt classification) improves reward performance
  • Rejection sampling technique selects best output w.r.t. the learned reward function
  • Effects of human dataset size and data diversity on reward learning
  • Fine-tuning with human feedback improves image-text alignment
  • Balancing alignment-fidelity tradeoffs with careful investigations on design choices
  • Limitations and future directions: more nuanced human feedback, diverse and large human dataset, different objectives and algorithms
  • Evaluating image fidelity with FID scores and image-text alignment with reward scores and CLIP scores