Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generative AI can generate text and images that look like they were made by humans.
MAUVE is a family of comparison measures to measure how close generated data is to real data.
There are four approaches to estimate these scores.
MAUVE can be used to measure the gap between human-written text and modern neural language models.

Paper Content

Introduction

Large-scale generative AI models can produce human-like text and realistic images
Models such as ChatGPT, Stable Diffusion, DALL-E, and GPT-3 can produce original content
Evaluating these models requires substantial effort
Automatic measures can reduce the cost of evaluation
Comparing a model’s distribution with the target distribution requires considering two types of errors
MAUVE scores measure the gap between human-written text and modern neural language models
MAUVE scores can also be used to compare image distributions

Contributions

Provide a scalar summary of the discrepancy between a generative model Q and the target distribution P
Propose three scalar statistical summaries of divergence frontiers
Propose four methods for estimating divergence frontiers from i.i.d. samples
Develop new error bounds for the first quantization approach
Give an error bound that allows for long tails and countable support of the distribution P
Give a statistical error bound on the integral summary
Give a similar bound for general f-divergences
Apply add-constant smoothing to estimate the two distributions
Give a statistical error bound for the add-constant estimators
Show that there exists a quantization scheme with error O(1/k)
Combine the statistical and quantization error bounds
Study the effectiveness of the proposed measure for comparing text distributions
Compare the 4 estimation methods
Compare different variants of the proposed measure
Demonstrate that the proposed measures correlate with human quality judgments
Demonstrate that the proposed measures quantify known properties of generated text
Show that the measure correlates perfectly with the widely used Fréchet distance in the image domain

Background and setup

Discussing basics of open-ended text generation
Setting up problem of comparing multiple generative models

Language modeling and open-ended text generation

Neural autoregressive language models form the backbone of prevailing approaches to text generation
Language models are used to model the conditional distribution over the next token in a sequence
Neural language models are based on the transformer architecture
Training objective is to minimize the KL divergence between the distributions assigned by humans and the language model
Open-ended text generation task is to output text in continuation of a given context
Goal is to generate text that is coherent, fluent, creative, and engaging
Text generation system is modeled as a probability distribution
Text is generated in a serial, left-to-right fashion
Decoding algorithms are used to reshape the language model to promote more conservative outputs
Temperature rescaling, top-K sampling, and nucleus sampling are popular decoding algorithms

Comparing generative models

Evaluating a text generation model usually involves comparing the output of the model to human-written text.
Open-ended generation is difficult to evaluate because there can be multiple correct outputs.
The goal of open-ended text generation is to generate text that is human-like.
The goal of image generation is to generate photorealistic images.
The evaluation of the generative model is measuring the gap between the model distribution and the target distribution.

Information divergences

Definition of f-divergences
f is convex and nonnegative with f(1) = 0
Conjugate generator to f is also convex
Examples of f-divergences (KL, Interpolated KL, Jensen-Shannon, Interpolated χ2)
Definition of f-divergence frontiers

Tradeoff curves to evaluate generative models

Generative model Q attempts to model target distribution P
Two types of costs to evaluate Q with respect to P: type I cost (mass of Q that has low/zero probability mass under P) and type II cost (mass of P that Q does not adequately capture)
Type I cost measured by surrogate KL(Q R)
Type II cost measured by KL(P R)
Pareto frontier of multi-objective optimization min R KL(P R), KL(Q R)
Divergence frontier F(P, Q) carved out by mixtures R λ = λP + (1 − λ)Q
f -divergence frontier F f (P, Q) for two distributions P, Q and divergence generator function f
Each coordinate of f -divergence frontier is itself an f -divergence
D f λ is a valid f -divergence
If f is twice differentiable with f (t) > 0 for all t > 0, then f λ is strictly convex with f λ (t) > 0 for all t > 0

Scalar summaries of divergence frontiers

Area Summary: A measure of similarity between two models, bounded between 0 and 1, with larger values denoting greater similarity.
Integral Summary: Average linearized cost over a range of values, bounded above by 1.
Mid-point Summary: Linearized cost with weight of 1/2, recovers Jensen-Shannon and Le Cam divergences.

Properties of divergence frontier summaries

Study properties of area summary MAUVE
Fix f-divergence Df with c > 0
MAUVE satisfies 0 ≤ MAUVE f (P, Q) < 1
Integral summary FI of f-divergence frontier generated by convex function
KL divergence frontier generated by fKL
χ2 divergence frontier generated by FIχ2
Mid-point summary Mid f of f-divergence frontier generated by convex function f1/2
Estimate summaries MAUVE, FI, and Mid using i.i.d. samples
Vector quantization, nearest-neighbor estimation, classifier-based estimation, and parametric approximation methods used to estimate f-divergences

Estimation via vector quantization

Define quantization of P over S
Approximate intractable divergence frontier with lower-dimensional counterpart
Estimate frontier with plug-in estimator
Best quantization schemes are data-dependent
Missing mass phenomenon can lead to poor quality estimators
Add-constant smoothing technique used to address challenge
Nearest neighbor estimation requires continuous distributions
Smooth distributions by convolving with Gaussian
Estimator always underestimates f-divergence
Assumptions on f and f* for statistical error bound
Naïve upper bound on absolute error
Bound independent of p*
Parametric rate of convergence
Distribution-free bound
Estimate entire f-divergence frontier
Bounds hold for KL divergence frontier

Estimation via nearest neighbors

Estimate divergence frontier and its summaries by counting nearest neighbors in an embedding space
Define metric ρ on data space X
Estimate f-divergence with estimator
Estimate divergence frontier with kernel density estimator
Estimate divergence with plug-in estimate

Estimation via classification

Estimate likelihood ratio with probabilistic classifier
Set up binary classification problem to discriminate between two distributions
Estimate likelihood ratio with Bayes rule
Estimate f-divergence with Monte Carlo estimate
Train classifier with logistic regression model
Avoid issue with linear model on frozen embeddings

Estimation via parametric approximations

Estimate f-divergence with parametric approximation
Approximate probability density function with multivariate Gaussian distribution
Evaluate integration with Monte Carlo approach

Focus on information divergence based scores to evaluate generative models
Active research area analyzing generative models and establishing theoretical results
Line of research on statistical trade-off curves
Information divergence based scores for texts and images
Theoretical results on statistical estimation of information divergences

Divergence frontiers for generative models

Sajjadi et al. (2018) and Kynkäänniemi et al. (2019) proposed to account for errors of generative modeling using trade-off curves.
Djolonga et al. (2020) proposed information divergence frontiers based on Rényi divergences.
Djolonga et al. (2020) showed how to compute the divergence frontiers in special cases.
Pillutla et al. (2021) showed that the original MAUVE score compares favorably to other automatic metrics for evaluating neural text.
Pimentel et al. (2022) showed correlation between MAUVE score and human judgment.

Divergence measures for text

Three broad categories of measures of similarity/divergence between machine text and human text: reference-based, statistics-based, and language modeling
Reference-based metrics evaluate generated text with respect to a (small set of) reference text sample(s)
Classical metrics for n-gram matching capture similarities in the surface form of the generated text and the human references
More recent reference-based metrics capture distributional semantics beyond superficial n-gram statistics
Statistics-based metrics compare the model distribution Q with respect to the human distribution P on the basis of some statistic T (P ) and T (Q)
Language modeling metrics calculate how (un)likely human text x ∼ P is under the model distribution Q
Automatic metrics have been proposed for specific domains
MAUVE compares machine and human text in a domain-agnostic manner
HUSE combines human judgments of Type I errors with Type II errors measured using perplexity under Q

Divergence measures for images

Evaluation of generative models is an active area of research in computer vision
Inception Score is based on large-scale supervised classification tasks
Fréchet Inception Distance and Kernel Inception Distance are used for evaluating generative models
Fréchet distance fails to capture dependence on text length, while proposed approach can

Statistical estimation of information divergences

Problem of estimating functionals of discrete distributions
Estimation of KL divergences studied in fixed and large alphabet regimes
Naïve plug-in estimator has infinite minimax quadratic risk
Missing mass phenomenon especially prominent in large alphabet regime
Add-constant smoothing used to address challenge
Deep neural networks used to find data-dependent vector quantization
Rich literature on statistical estimation of f-divergences using other methods

Experiments: setup

Open-ended text generation tasks require a model to generate text based on a given prompt.
The prompt is usually short (35-50 tokens) while the generated text is much longer (500-1000 tokens).

Task domains and models

Three different text domains are considered: web text, news, and stories
For each domain, size-based variants of transformer language models are used
Web Text Generation task involves generating articles from the Webtext dataset using pre-trained GPT-2 models
News Generation task involves generating the body of a news article given the title and metadata
Story Continuation task involves continuing a story given a situation and a starting of the story as a prompt

Decoding algorithms

Greedy decoding attempts to maximize likelihood of text
Ancestral sampling produces unbiased samples from model distribution
Nucleus sampling truncates tail of distribution
Adversarial perplexity sampling generates low-quality text to match perplexity of human text

Baseline metrics

Compared proposed measures to automatic evaluation metrics used previously
Measured perplexity of generated text
Measured Zipf coefficient of rank versus unigram frequency plot
Measured fraction of generations which devolved into repetitions
Measured fraction of distinct n-grams from all possible n-grams across all generations

Human judgements and evaluation of automatic metrics

An effective metric should yield judgments that correlate highly with human judgments.
Human evaluation involves choosing a particular (model, decoder) setting based on the resultant generations.
Human evaluation is done on a 5-point Likert scale.
Evaluation of automatic metrics is done by comparing the ranking it induces to that obtained by the human evaluation using the Spearman rank correlation.
The end result is a correlation score in [-1, 1], with higher values meaning that quality judgments using the automatic metric correlate with quality judgments made by human evaluators.

Hyperparameters

MAUVE KL is computed using k-means vector quantization with 500 buckets
Krichevsky-Trofimov (add-1/2) smoothing is used
MAUVE KL is different from the version used in Pillutla et al., 2021

Link to paper#

Abstract#

Paper Content#

Introduction#

Contributions#

Background and setup#

Language modeling and open-ended text generation#

Comparing generative models#

Information divergences#

Tradeoff curves to evaluate generative models#

Scalar summaries of divergence frontiers#

Properties of divergence frontier summaries#

Estimation via vector quantization#

Estimation via nearest neighbors#

Estimation via classification#

Estimation via parametric approximations#

Related work#

Divergence frontiers for generative models#

Divergence measures for text#

Divergence measures for images#

Statistical estimation of information divergences#

Experiments: setup#

Task domains and models#

Decoding algorithms#

Baseline metrics#

Human judgements and evaluation of automatic metrics#

Hyperparameters#

Link to paper

Abstract

Paper Content

Introduction

Contributions

Background and setup

Language modeling and open-ended text generation

Comparing generative models

Information divergences

Tradeoff curves to evaluate generative models

Scalar summaries of divergence frontiers

Properties of divergence frontier summaries

Estimation via vector quantization

Estimation via nearest neighbors

Estimation via classification

Estimation via parametric approximations

Related work

Divergence frontiers for generative models

Divergence measures for text

Divergence measures for images

Statistical estimation of information divergences

Experiments: setup

Task domains and models

Decoding algorithms

Baseline metrics

Human judgements and evaluation of automatic metrics

Hyperparameters