Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative AI can generate text and images that look like they were made by humans.
- MAUVE is a family of comparison measures to measure how close generated data is to real data.
- There are four approaches to estimate these scores.
- MAUVE can be used to measure the gap between human-written text and modern neural language models.
Paper Content
Introduction
- Large-scale generative AI models can produce human-like text and realistic images
- Models such as ChatGPT, Stable Diffusion, DALL-E, and GPT-3 can produce original content
- Evaluating these models requires substantial effort
- Automatic measures can reduce the cost of evaluation
- Comparing a model’s distribution with the target distribution requires considering two types of errors
- MAUVE scores measure the gap between human-written text and modern neural language models
- MAUVE scores can also be used to compare image distributions
Contributions
- Provide a scalar summary of the discrepancy between a generative model Q and the target distribution P
- Propose three scalar statistical summaries of divergence frontiers
- Propose four methods for estimating divergence frontiers from i.i.d. samples
- Develop new error bounds for the first quantization approach
- Give an error bound that allows for long tails and countable support of the distribution P
- Give a statistical error bound on the integral summary
- Give a similar bound for general f-divergences
- Apply add-constant smoothing to estimate the two distributions
- Give a statistical error bound for the add-constant estimators
- Show that there exists a quantization scheme with error O(1/k)
- Combine the statistical and quantization error bounds
- Study the effectiveness of the proposed measure for comparing text distributions
- Compare the 4 estimation methods
- Compare different variants of the proposed measure
- Demonstrate that the proposed measures correlate with human quality judgments
- Demonstrate that the proposed measures quantify known properties of generated text
- Show that the measure correlates perfectly with the widely used Fréchet distance in the image domain
Background and setup
- Discussing basics of open-ended text generation
- Setting up problem of comparing multiple generative models
Language modeling and open-ended text generation
- Neural autoregressive language models form the backbone of prevailing approaches to text generation
- Language models are used to model the conditional distribution over the next token in a sequence
- Neural language models are based on the transformer architecture
- Training objective is to minimize the KL divergence between the distributions assigned by humans and the language model
- Open-ended text generation task is to output text in continuation of a given context
- Goal is to generate text that is coherent, fluent, creative, and engaging
- Text generation system is modeled as a probability distribution
- Text is generated in a serial, left-to-right fashion
- Decoding algorithms are used to reshape the language model to promote more conservative outputs
- Temperature rescaling, top-K sampling, and nucleus sampling are popular decoding algorithms
Comparing generative models
- Evaluating a text generation model usually involves comparing the output of the model to human-written text.
- Open-ended generation is difficult to evaluate because there can be multiple correct outputs.
- The goal of open-ended text generation is to generate text that is human-like.
- The goal of image generation is to generate photorealistic images.
- The evaluation of the generative model is measuring the gap between the model distribution and the target distribution.
Information divergences
- Definition of f-divergences
- f is convex and nonnegative with f(1) = 0
- Conjugate generator to f is also convex
- Examples of f-divergences (KL, Interpolated KL, Jensen-Shannon, Interpolated χ2)
- Definition of f-divergence frontiers
Tradeoff curves to evaluate generative models
- Generative model Q attempts to model target distribution P
- Two types of costs to evaluate Q with respect to P: type I cost (mass of Q that has low/zero probability mass under P) and type II cost (mass of P that Q does not adequately capture)
- Type I cost measured by surrogate KL(Q R)
- Type II cost measured by KL(P R)
- Pareto frontier of multi-objective optimization min R KL(P R), KL(Q R)
- Divergence frontier F(P, Q) carved out by mixtures R λ = λP + (1 − λ)Q
- f -divergence frontier F f (P, Q) for two distributions P, Q and divergence generator function f
- Each coordinate of f -divergence frontier is itself an f -divergence
- D f λ is a valid f -divergence
- If f is twice differentiable with f (t) > 0 for all t > 0, then f λ is strictly convex with f λ (t) > 0 for all t > 0
Scalar summaries of divergence frontiers
- Area Summary: A measure of similarity between two models, bounded between 0 and 1, with larger values denoting greater similarity.
- Integral Summary: Average linearized cost over a range of values, bounded above by 1.
- Mid-point Summary: Linearized cost with weight of 1/2, recovers Jensen-Shannon and Le Cam divergences.
Properties of divergence frontier summaries
- Study properties of area summary MAUVE
- Fix f-divergence Df with c > 0
- MAUVE satisfies 0 ≤ MAUVE f (P, Q) < 1
- Integral summary FI of f-divergence frontier generated by convex function
- KL divergence frontier generated by fKL
- χ2 divergence frontier generated by FIχ2
- Mid-point summary Mid f of f-divergence frontier generated by convex function f1/2
- Estimate summaries MAUVE, FI, and Mid using i.i.d. samples
- Vector quantization, nearest-neighbor estimation, classifier-based estimation, and parametric approximation methods used to estimate f-divergences
Estimation via vector quantization
- Define quantization of P over S
- Approximate intractable divergence frontier with lower-dimensional counterpart
- Estimate frontier with plug-in estimator
- Best quantization schemes are data-dependent
- Missing mass phenomenon can lead to poor quality estimators
- Add-constant smoothing technique used to address challenge
- Nearest neighbor estimation requires continuous distributions
- Smooth distributions by convolving with Gaussian
- Estimator always underestimates f-divergence
- Assumptions on f and f* for statistical error bound
- Naïve upper bound on absolute error
- Bound independent of p*
- Parametric rate of convergence
- Distribution-free bound
- Estimate entire f-divergence frontier
- Bounds hold for KL divergence frontier
Estimation via nearest neighbors
- Estimate divergence frontier and its summaries by counting nearest neighbors in an embedding space
- Define metric ρ on data space X
- Estimate f-divergence with estimator
- Estimate divergence frontier with kernel density estimator
- Estimate divergence with plug-in estimate
Estimation via classification
- Estimate likelihood ratio with probabilistic classifier
- Set up binary classification problem to discriminate between two distributions
- Estimate likelihood ratio with Bayes rule
- Estimate f-divergence with Monte Carlo estimate
- Train classifier with logistic regression model
- Avoid issue with linear model on frozen embeddings
Estimation via parametric approximations
- Estimate f-divergence with parametric approximation
- Approximate probability density function with multivariate Gaussian distribution
- Evaluate integration with Monte Carlo approach
Related work
- Focus on information divergence based scores to evaluate generative models
- Active research area analyzing generative models and establishing theoretical results
- Line of research on statistical trade-off curves
- Information divergence based scores for texts and images
- Theoretical results on statistical estimation of information divergences
Divergence frontiers for generative models
- Sajjadi et al. (2018) and Kynkäänniemi et al. (2019) proposed to account for errors of generative modeling using trade-off curves.
- Djolonga et al. (2020) proposed information divergence frontiers based on Rényi divergences.
- Djolonga et al. (2020) showed how to compute the divergence frontiers in special cases.
- Pillutla et al. (2021) showed that the original MAUVE score compares favorably to other automatic metrics for evaluating neural text.
- Pimentel et al. (2022) showed correlation between MAUVE score and human judgment.
Divergence measures for text
- Three broad categories of measures of similarity/divergence between machine text and human text: reference-based, statistics-based, and language modeling
- Reference-based metrics evaluate generated text with respect to a (small set of) reference text sample(s)
- Classical metrics for n-gram matching capture similarities in the surface form of the generated text and the human references
- More recent reference-based metrics capture distributional semantics beyond superficial n-gram statistics
- Statistics-based metrics compare the model distribution Q with respect to the human distribution P on the basis of some statistic T (P ) and T (Q)
- Language modeling metrics calculate how (un)likely human text x ∼ P is under the model distribution Q
- Automatic metrics have been proposed for specific domains
- MAUVE compares machine and human text in a domain-agnostic manner
- HUSE combines human judgments of Type I errors with Type II errors measured using perplexity under Q
Divergence measures for images
- Evaluation of generative models is an active area of research in computer vision
- Inception Score is based on large-scale supervised classification tasks
- Fréchet Inception Distance and Kernel Inception Distance are used for evaluating generative models
- Fréchet distance fails to capture dependence on text length, while proposed approach can
Statistical estimation of information divergences
- Problem of estimating functionals of discrete distributions
- Estimation of KL divergences studied in fixed and large alphabet regimes
- Naïve plug-in estimator has infinite minimax quadratic risk
- Missing mass phenomenon especially prominent in large alphabet regime
- Add-constant smoothing used to address challenge
- Deep neural networks used to find data-dependent vector quantization
- Rich literature on statistical estimation of f-divergences using other methods
Experiments: setup
- Open-ended text generation tasks require a model to generate text based on a given prompt.
- The prompt is usually short (35-50 tokens) while the generated text is much longer (500-1000 tokens).
Task domains and models
- Three different text domains are considered: web text, news, and stories
- For each domain, size-based variants of transformer language models are used
- Web Text Generation task involves generating articles from the Webtext dataset using pre-trained GPT-2 models
- News Generation task involves generating the body of a news article given the title and metadata
- Story Continuation task involves continuing a story given a situation and a starting of the story as a prompt
Decoding algorithms
- Greedy decoding attempts to maximize likelihood of text
- Ancestral sampling produces unbiased samples from model distribution
- Nucleus sampling truncates tail of distribution
- Adversarial perplexity sampling generates low-quality text to match perplexity of human text
Baseline metrics
- Compared proposed measures to automatic evaluation metrics used previously
- Measured perplexity of generated text
- Measured Zipf coefficient of rank versus unigram frequency plot
- Measured fraction of generations which devolved into repetitions
- Measured fraction of distinct n-grams from all possible n-grams across all generations
Human judgements and evaluation of automatic metrics
- An effective metric should yield judgments that correlate highly with human judgments.
- Human evaluation involves choosing a particular (model, decoder) setting based on the resultant generations.
- Human evaluation is done on a 5-point Likert scale.
- Evaluation of automatic metrics is done by comparing the ranking it induces to that obtained by the human evaluation using the Spearman rank correlation.
- The end result is a correlation score in [-1, 1], with higher values meaning that quality judgments using the automatic metric correlate with quality judgments made by human evaluators.
Hyperparameters
- MAUVE KL is computed using k-means vector quantization with 500 buckets
- Krichevsky-Trofimov (add-1/2) smoothing is used
- MAUVE KL is different from the version used in Pillutla et al., 2021