Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • A framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system
  • No assumptions are made on the machine-learning algorithm
  • Higher accuracy of predictions leads to smaller confidence intervals
  • Algorithms for computing valid confidence intervals for statistical objects
  • Demonstrated with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology

Paper Content

Introduction

  • Machine-learning algorithms are used to make predictions
  • Predictions can be used to generate predictions for entities not studied experimentally
  • Examples of predictions include molecular activity, tumor prognoses, and microclimatic modeling
  • Analysis was done to form a confidence interval for the fraction of the Amazon rainforest lost between 2000 and 2015
  • Gold-standard labels were collected through field visits, an expensive process
  • Machine-learning predictions of forest cover were also available, based on satellite imagery
  • Two natural alternatives for constructing confidence intervals were used
  • Imputation approach yields a small confidence interval that fails to cover the true deforestation fraction
  • Classical approach covers the truth at the expense of a wider interval
  • Second example used to construct a confidence interval for an odds ratio between phosphorylation and disorder
  • Imputation approach significantly overestimates the true odds ratio
  • Prediction-powered inference framework provides an affirmative answer to the question of whether predictions can improve inferential quality
  • Confidence intervals cover the truth and are smaller than those obtained using the classical approach

General principle

  • Goal is to estimate a quantity θ*
  • Have access to small gold-standard data set and large unlabeled data set
  • Use predictions from machine-learning algorithm to estimate θ*
  • Introduce problem-specific measure of prediction error called rectifier, ∆f
  • Use gold-standard data to construct confidence set for rectifier, R
  • Construct confidence set C PP by rectifying θf with each value in R
  • Confidence intervals and p-values for a broad class of statistical problems

Further preliminaries

  • Labeled data set is denoted as (X, Y) ∈ (X ×Y) n
  • Unlabeled data set is denoted as ( X, Y ) ∈ (X × Y) N
  • Data sets are assumed to be i.i.d. samples from a common distribution, P
  • Estimand of interest is denoted as θ*
  • Prediction rule is denoted as f : X → Y
  • Rectifier is a measure of the prediction rule’s accuracy
  • Empirical rectifier is an estimate of the rectifier based on labeled data

Warmup: mean estimation

  • Goal is to give valid confidence interval for average outcome
  • Classical estimate of average outcome is sample average of labeled data set
  • Prediction-powered estimate leads to tighter confidence intervals if prediction rule is accurate
  • Prediction-powered confidence interval is better than classical interval when model is good
  • Related to inference under misspecification
  • Leverages auxiliary data to improve inference from surveys
  • Mean estimator in Section 1.3 is the difference estimator
  • Predictive models fit on separate data
  • Handles wider range of estimands
  • Related to semisupervised learning and missing data literature

Main theory: convex estimation

  • Our main contribution is a technique for inference on estimands that can be expressed as the solution to a convex optimization problem.
  • We consider estimands of the form for a loss function θ : X × Y → R that is convex in θ ∈ R p.
  • We define a rectifier to capture a notion of prediction error.
  • We create a confidence set for the rectifier.
  • We form a confidence set for θ * by combining the rectifier confidence set with a term that accounts for finite-sample fluctuations.
  • We present a valid confidence set for θ * without assumptions about the data distribution or the machine-learning model.

Algorithms

  • Mean estimation algorithm and guarantee
  • Quantile estimation algorithm and guarantee
  • Logistic regression algorithm and guarantee
  • Linear regression algorithm and guarantee
  • Algorithms rely on confidence intervals derived from the central limit theorem
  • Mean estimation expressed as minimizer of average squared loss
  • Quantile estimation expressed as solution to variational form
  • Logistic regression target of inference defined by logistic loss
  • Linear regression natural estimator is linear in Y
  • Algorithms use z 1−δ to denote 1 − δ quantile of standard normal distribution

Applications

  • Demonstrated prediction-powered inference on real tasks
  • Computed prediction-powered confidence interval and compared to two alternatives
  • Showed that imputed interval does not account for prediction errors
  • Compared widths of prediction-powered and classical intervals
  • Studied audits of electronic voting in an election with two candidates
  • Constructed confidence interval for proportion of people voting for each candidate
  • Used optical ballot labeling system and small number of ballots labeled by hand
  • Used Algorithm 1 to construct prediction-powered confidence interval
  • Prediction-powered interval has roughly 1/4 the width of classical counterpart
  • Imputed interval is invalid, does not cover ground truth

Relating protein structure and post-translational modifications

  • Demonstrate how prediction-powered confidence intervals can be used to construct confidence intervals for the odds ratio
  • Goal of analysis is to characterize the structural context of post-translational modifications (PTMs)
  • Bludau et al. studied the relationship between PTMs and IDRs on a proteome-wide scale
  • Used AlphaFold-predicted structures to predict IDRs
  • Constructed 1-α/2 prediction-powered confidence intervals for µ 0 and µ 1
  • By union bound, C PP contains θ* with probability at least 1-α
  • Used census data to investigate the quantitative effects of age and sex on income
  • Used XGBoost to impute predictions and Algorithm 4 to construct intervals
  • Used census data to study the effect of income on the procurement of private health insurance
  • Used Algorithm 3 to construct intervals
  • Used prediction-powered confidence intervals on quantiles to study the effects of regulatory DNA on gene expression
  • Used XGBoost to impute predictions and Algorithm 2 to construct intervals
  • Used Imaging FlowCytobot to count plankton
  • Used label shift technique to construct prediction-powered confidence interval on frequency of observed plankton
  • Propagated confidence interval into a count

Extensions

  • Prediction-powered inference can be applied to settings beyond i.i.d. convex estimation
  • Strategy for prediction-powered inference when θ * is the optimum of any optimization problem
  • Prediction-powered inference under certain forms of distribution shift

Beyond convex estimation

  • Tools developed in Section 2 tailored to unconstrained convex optimization problems
  • Inferential targets can be defined in terms of nonconvex losses or have nonconvex constraints
  • Generalized approach to broad class of risk minimizers
  • Problem subsumes all previously studied settings
  • Solution handles problems of the form (8) in full generality
  • Additional step of data splitting needed
  • Theorem 2 applies to discrete and continuous cases, Tukey’s biweight robust mean, and model selection

Inference under distribution shift

  • We focus on forming prediction-powered confidence intervals when the labeled and unlabeled data come from different distributions.
  • We handle all estimation problems previously studied for covariate shift and certain types of linear problems for label shift.
  • We assume that the unlabeled data defines the target of inference θ*.
  • We assume that the distributions are related by either a label shift or a covariate shift.
  • We use the Radon-Nikodym derivative to relate risk minimizers on Q to risk minimizers on P.
  • We explain the approach in detail for convex risk minimizers.
  • We use the confusion matrix to estimate the label shift.
  • We provide corollaries for mean, quantile, and logistic regression estimation.
  • We note that linear regression can also be used, but it is not recommended in practice.

B.3 regularity conditions

  • Algorithms rely on confidence intervals derived from the central limit theorem
  • Require two quantities to have at least the first two moments
  • Weak conditions for classical linear regression intervals to cover the target
  • Data points labeled as one of {deforestation, no deforestation}
  • Train a histogram-based gradient boosting tree to predict deforestation labels
  • Use Corollary A.1 and Proposition D.4 to produce confidence intervals
  • Standard binomial confidence interval for imputation approach

E.2 relating protein structure and post-translational modifications

  • Model for predicting disorder: logistic regression model maps relative solvent-accessible surface area to probability of disorder