Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative AI can generate text and images that look like they were made by humans.
  • MAUVE is a family of comparison measures to measure how close generated data is to real data.
  • There are four approaches to estimate these scores.
  • MAUVE can be used to measure the gap between human-written text and modern neural language models.

Paper Content

Introduction

  • Large-scale generative AI models can produce human-like text and realistic images
  • Models such as ChatGPT, Stable Diffusion, DALL-E, and GPT-3 can produce original content
  • Evaluating these models requires substantial effort
  • Automatic measures can reduce the cost of evaluation
  • Comparing a model’s distribution with the target distribution requires considering two types of errors
  • MAUVE scores measure the gap between human-written text and modern neural language models
  • MAUVE scores can also be used to compare image distributions

Contributions

  • Provide a scalar summary of the discrepancy between a generative model Q and the target distribution P
  • Propose three scalar statistical summaries of divergence frontiers
  • Propose four methods for estimating divergence frontiers from i.i.d. samples
  • Develop new error bounds for the first quantization approach
  • Give an error bound that allows for long tails and countable support of the distribution P
  • Give a statistical error bound on the integral summary
  • Give a similar bound for general f-divergences
  • Apply add-constant smoothing to estimate the two distributions
  • Give a statistical error bound for the add-constant estimators
  • Show that there exists a quantization scheme with error O(1/k)
  • Combine the statistical and quantization error bounds
  • Study the effectiveness of the proposed measure for comparing text distributions
  • Compare the 4 estimation methods
  • Compare different variants of the proposed measure
  • Demonstrate that the proposed measures correlate with human quality judgments
  • Demonstrate that the proposed measures quantify known properties of generated text
  • Show that the measure correlates perfectly with the widely used Fréchet distance in the image domain

Background and setup

  • Discussing basics of open-ended text generation
  • Setting up problem of comparing multiple generative models

Language modeling and open-ended text generation

  • Neural autoregressive language models form the backbone of prevailing approaches to text generation
  • Language models are used to model the conditional distribution over the next token in a sequence
  • Neural language models are based on the transformer architecture
  • Training objective is to minimize the KL divergence between the distributions assigned by humans and the language model
  • Open-ended text generation task is to output text in continuation of a given context
  • Goal is to generate text that is coherent, fluent, creative, and engaging
  • Text generation system is modeled as a probability distribution
  • Text is generated in a serial, left-to-right fashion
  • Decoding algorithms are used to reshape the language model to promote more conservative outputs
  • Temperature rescaling, top-K sampling, and nucleus sampling are popular decoding algorithms

Comparing generative models

  • Evaluating a text generation model usually involves comparing the output of the model to human-written text.
  • Open-ended generation is difficult to evaluate because there can be multiple correct outputs.
  • The goal of open-ended text generation is to generate text that is human-like.
  • The goal of image generation is to generate photorealistic images.
  • The evaluation of the generative model is measuring the gap between the model distribution and the target distribution.

Information divergences

  • Definition of f-divergences
  • f is convex and nonnegative with f(1) = 0
  • Conjugate generator to f is also convex
  • Examples of f-divergences (KL, Interpolated KL, Jensen-Shannon, Interpolated χ2)
  • Definition of f-divergence frontiers

Tradeoff curves to evaluate generative models

  • Generative model Q attempts to model target distribution P
  • Two types of costs to evaluate Q with respect to P: type I cost (mass of Q that has low/zero probability mass under P) and type II cost (mass of P that Q does not adequately capture)
  • Type I cost measured by surrogate KL(Q R)
  • Type II cost measured by KL(P R)
  • Pareto frontier of multi-objective optimization min R KL(P R), KL(Q R)
  • Divergence frontier F(P, Q) carved out by mixtures R λ = λP + (1 − λ)Q
  • f -divergence frontier F f (P, Q) for two distributions P, Q and divergence generator function f
  • Each coordinate of f -divergence frontier is itself an f -divergence
  • D f λ is a valid f -divergence
  • If f is twice differentiable with f (t) > 0 for all t > 0, then f λ is strictly convex with f λ (t) > 0 for all t > 0

Scalar summaries of divergence frontiers

  • Area Summary: A measure of similarity between two models, bounded between 0 and 1, with larger values denoting greater similarity.
  • Integral Summary: Average linearized cost over a range of values, bounded above by 1.
  • Mid-point Summary: Linearized cost with weight of 1/2, recovers Jensen-Shannon and Le Cam divergences.

Properties of divergence frontier summaries

  • Study properties of area summary MAUVE
  • Fix f-divergence Df with c > 0
  • MAUVE satisfies 0 ≤ MAUVE f (P, Q) < 1
  • Integral summary FI of f-divergence frontier generated by convex function
  • KL divergence frontier generated by fKL
  • χ2 divergence frontier generated by FIχ2
  • Mid-point summary Mid f of f-divergence frontier generated by convex function f1/2
  • Estimate summaries MAUVE, FI, and Mid using i.i.d. samples
  • Vector quantization, nearest-neighbor estimation, classifier-based estimation, and parametric approximation methods used to estimate f-divergences

Estimation via vector quantization

  • Define quantization of P over S
  • Approximate intractable divergence frontier with lower-dimensional counterpart
  • Estimate frontier with plug-in estimator
  • Best quantization schemes are data-dependent
  • Missing mass phenomenon can lead to poor quality estimators
  • Add-constant smoothing technique used to address challenge
  • Nearest neighbor estimation requires continuous distributions
  • Smooth distributions by convolving with Gaussian
  • Estimator always underestimates f-divergence
  • Assumptions on f and f* for statistical error bound
  • Naïve upper bound on absolute error
  • Bound independent of p*
  • Parametric rate of convergence
  • Distribution-free bound
  • Estimate entire f-divergence frontier
  • Bounds hold for KL divergence frontier

Estimation via nearest neighbors

  • Estimate divergence frontier and its summaries by counting nearest neighbors in an embedding space
  • Define metric ρ on data space X
  • Estimate f-divergence with estimator
  • Estimate divergence frontier with kernel density estimator
  • Estimate divergence with plug-in estimate

Estimation via classification

  • Estimate likelihood ratio with probabilistic classifier
  • Set up binary classification problem to discriminate between two distributions
  • Estimate likelihood ratio with Bayes rule
  • Estimate f-divergence with Monte Carlo estimate
  • Train classifier with logistic regression model
  • Avoid issue with linear model on frozen embeddings

Estimation via parametric approximations

  • Estimate f-divergence with parametric approximation
  • Approximate probability density function with multivariate Gaussian distribution
  • Evaluate integration with Monte Carlo approach
  • Focus on information divergence based scores to evaluate generative models
  • Active research area analyzing generative models and establishing theoretical results
  • Line of research on statistical trade-off curves
  • Information divergence based scores for texts and images
  • Theoretical results on statistical estimation of information divergences

Divergence frontiers for generative models

  • Sajjadi et al. (2018) and Kynkäänniemi et al. (2019) proposed to account for errors of generative modeling using trade-off curves.
  • Djolonga et al. (2020) proposed information divergence frontiers based on Rényi divergences.
  • Djolonga et al. (2020) showed how to compute the divergence frontiers in special cases.
  • Pillutla et al. (2021) showed that the original MAUVE score compares favorably to other automatic metrics for evaluating neural text.
  • Pimentel et al. (2022) showed correlation between MAUVE score and human judgment.

Divergence measures for text

  • Three broad categories of measures of similarity/divergence between machine text and human text: reference-based, statistics-based, and language modeling
  • Reference-based metrics evaluate generated text with respect to a (small set of) reference text sample(s)
  • Classical metrics for n-gram matching capture similarities in the surface form of the generated text and the human references
  • More recent reference-based metrics capture distributional semantics beyond superficial n-gram statistics
  • Statistics-based metrics compare the model distribution Q with respect to the human distribution P on the basis of some statistic T (P ) and T (Q)
  • Language modeling metrics calculate how (un)likely human text x ∼ P is under the model distribution Q
  • Automatic metrics have been proposed for specific domains
  • MAUVE compares machine and human text in a domain-agnostic manner
  • HUSE combines human judgments of Type I errors with Type II errors measured using perplexity under Q

Divergence measures for images

  • Evaluation of generative models is an active area of research in computer vision
  • Inception Score is based on large-scale supervised classification tasks
  • Fréchet Inception Distance and Kernel Inception Distance are used for evaluating generative models
  • Fréchet distance fails to capture dependence on text length, while proposed approach can

Statistical estimation of information divergences

  • Problem of estimating functionals of discrete distributions
  • Estimation of KL divergences studied in fixed and large alphabet regimes
  • Naïve plug-in estimator has infinite minimax quadratic risk
  • Missing mass phenomenon especially prominent in large alphabet regime
  • Add-constant smoothing used to address challenge
  • Deep neural networks used to find data-dependent vector quantization
  • Rich literature on statistical estimation of f-divergences using other methods

Experiments: setup

  • Open-ended text generation tasks require a model to generate text based on a given prompt.
  • The prompt is usually short (35-50 tokens) while the generated text is much longer (500-1000 tokens).

Task domains and models

  • Three different text domains are considered: web text, news, and stories
  • For each domain, size-based variants of transformer language models are used
  • Web Text Generation task involves generating articles from the Webtext dataset using pre-trained GPT-2 models
  • News Generation task involves generating the body of a news article given the title and metadata
  • Story Continuation task involves continuing a story given a situation and a starting of the story as a prompt

Decoding algorithms

  • Greedy decoding attempts to maximize likelihood of text
  • Ancestral sampling produces unbiased samples from model distribution
  • Nucleus sampling truncates tail of distribution
  • Adversarial perplexity sampling generates low-quality text to match perplexity of human text

Baseline metrics

  • Compared proposed measures to automatic evaluation metrics used previously
  • Measured perplexity of generated text
  • Measured Zipf coefficient of rank versus unigram frequency plot
  • Measured fraction of generations which devolved into repetitions
  • Measured fraction of distinct n-grams from all possible n-grams across all generations

Human judgements and evaluation of automatic metrics

  • An effective metric should yield judgments that correlate highly with human judgments.
  • Human evaluation involves choosing a particular (model, decoder) setting based on the resultant generations.
  • Human evaluation is done on a 5-point Likert scale.
  • Evaluation of automatic metrics is done by comparing the ranking it induces to that obtained by the human evaluation using the Spearman rank correlation.
  • The end result is a correlation score in [-1, 1], with higher values meaning that quality judgments using the automatic metric correlate with quality judgments made by human evaluators.

Hyperparameters

  • MAUVE KL is computed using k-means vector quantization with 500 buckets
  • Krichevsky-Trofimov (add-1/2) smoothing is used
  • MAUVE KL is different from the version used in Pillutla et al., 2021