Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

A framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system
No assumptions are made on the machine-learning algorithm
Higher accuracy of predictions leads to smaller confidence intervals
Algorithms for computing valid confidence intervals for statistical objects
Demonstrated with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology

Paper Content

Introduction

Machine-learning algorithms are used to make predictions
Predictions can be used to generate predictions for entities not studied experimentally
Examples of predictions include molecular activity, tumor prognoses, and microclimatic modeling
Analysis was done to form a confidence interval for the fraction of the Amazon rainforest lost between 2000 and 2015
Gold-standard labels were collected through field visits, an expensive process
Machine-learning predictions of forest cover were also available, based on satellite imagery
Two natural alternatives for constructing confidence intervals were used
Imputation approach yields a small confidence interval that fails to cover the true deforestation fraction
Classical approach covers the truth at the expense of a wider interval
Second example used to construct a confidence interval for an odds ratio between phosphorylation and disorder
Imputation approach significantly overestimates the true odds ratio
Prediction-powered inference framework provides an affirmative answer to the question of whether predictions can improve inferential quality
Confidence intervals cover the truth and are smaller than those obtained using the classical approach

General principle

Goal is to estimate a quantity θ*
Have access to small gold-standard data set and large unlabeled data set
Use predictions from machine-learning algorithm to estimate θ*
Introduce problem-specific measure of prediction error called rectifier, ∆f
Use gold-standard data to construct confidence set for rectifier, R
Construct confidence set C PP by rectifying θf with each value in R
Confidence intervals and p-values for a broad class of statistical problems

Further preliminaries

Labeled data set is denoted as (X, Y) ∈ (X ×Y) n
Unlabeled data set is denoted as ( X, Y ) ∈ (X × Y) N
Data sets are assumed to be i.i.d. samples from a common distribution, P
Estimand of interest is denoted as θ*
Prediction rule is denoted as f : X → Y
Rectifier is a measure of the prediction rule’s accuracy
Empirical rectifier is an estimate of the rectifier based on labeled data

Warmup: mean estimation

Goal is to give valid confidence interval for average outcome
Classical estimate of average outcome is sample average of labeled data set
Prediction-powered estimate leads to tighter confidence intervals if prediction rule is accurate
Prediction-powered confidence interval is better than classical interval when model is good

Related to inference under misspecification
Leverages auxiliary data to improve inference from surveys
Mean estimator in Section 1.3 is the difference estimator
Predictive models fit on separate data
Handles wider range of estimands
Related to semisupervised learning and missing data literature

Main theory: convex estimation

Our main contribution is a technique for inference on estimands that can be expressed as the solution to a convex optimization problem.
We consider estimands of the form for a loss function θ : X × Y → R that is convex in θ ∈ R p.
We define a rectifier to capture a notion of prediction error.
We create a confidence set for the rectifier.
We form a confidence set for θ * by combining the rectifier confidence set with a term that accounts for finite-sample fluctuations.
We present a valid confidence set for θ * without assumptions about the data distribution or the machine-learning model.

Algorithms

Mean estimation algorithm and guarantee
Quantile estimation algorithm and guarantee
Logistic regression algorithm and guarantee
Linear regression algorithm and guarantee
Algorithms rely on confidence intervals derived from the central limit theorem
Mean estimation expressed as minimizer of average squared loss
Quantile estimation expressed as solution to variational form
Logistic regression target of inference defined by logistic loss
Linear regression natural estimator is linear in Y
Algorithms use z 1−δ to denote 1 − δ quantile of standard normal distribution

Applications

Demonstrated prediction-powered inference on real tasks
Computed prediction-powered confidence interval and compared to two alternatives
Showed that imputed interval does not account for prediction errors
Compared widths of prediction-powered and classical intervals
Studied audits of electronic voting in an election with two candidates
Constructed confidence interval for proportion of people voting for each candidate
Used optical ballot labeling system and small number of ballots labeled by hand
Used Algorithm 1 to construct prediction-powered confidence interval
Prediction-powered interval has roughly 1/4 the width of classical counterpart
Imputed interval is invalid, does not cover ground truth

Relating protein structure and post-translational modifications

Demonstrate how prediction-powered confidence intervals can be used to construct confidence intervals for the odds ratio
Goal of analysis is to characterize the structural context of post-translational modifications (PTMs)
Bludau et al. studied the relationship between PTMs and IDRs on a proteome-wide scale
Used AlphaFold-predicted structures to predict IDRs
Constructed 1-α/2 prediction-powered confidence intervals for µ 0 and µ 1
By union bound, C PP contains θ* with probability at least 1-α
Used census data to investigate the quantitative effects of age and sex on income
Used XGBoost to impute predictions and Algorithm 4 to construct intervals
Used census data to study the effect of income on the procurement of private health insurance
Used Algorithm 3 to construct intervals
Used prediction-powered confidence intervals on quantiles to study the effects of regulatory DNA on gene expression
Used XGBoost to impute predictions and Algorithm 2 to construct intervals
Used Imaging FlowCytobot to count plankton
Used label shift technique to construct prediction-powered confidence interval on frequency of observed plankton
Propagated confidence interval into a count

Extensions

Prediction-powered inference can be applied to settings beyond i.i.d. convex estimation
Strategy for prediction-powered inference when θ * is the optimum of any optimization problem
Prediction-powered inference under certain forms of distribution shift

Beyond convex estimation

Tools developed in Section 2 tailored to unconstrained convex optimization problems
Inferential targets can be defined in terms of nonconvex losses or have nonconvex constraints
Generalized approach to broad class of risk minimizers
Problem subsumes all previously studied settings
Solution handles problems of the form (8) in full generality
Additional step of data splitting needed
Theorem 2 applies to discrete and continuous cases, Tukey’s biweight robust mean, and model selection

Inference under distribution shift

We focus on forming prediction-powered confidence intervals when the labeled and unlabeled data come from different distributions.
We handle all estimation problems previously studied for covariate shift and certain types of linear problems for label shift.
We assume that the unlabeled data defines the target of inference θ*.
We assume that the distributions are related by either a label shift or a covariate shift.
We use the Radon-Nikodym derivative to relate risk minimizers on Q to risk minimizers on P.
We explain the approach in detail for convex risk minimizers.
We use the confusion matrix to estimate the label shift.
We provide corollaries for mean, quantile, and logistic regression estimation.
We note that linear regression can also be used, but it is not recommended in practice.

B.3 regularity conditions

Algorithms rely on confidence intervals derived from the central limit theorem
Require two quantities to have at least the first two moments
Weak conditions for classical linear regression intervals to cover the target
Data points labeled as one of {deforestation, no deforestation}
Train a histogram-based gradient boosting tree to predict deforestation labels
Use Corollary A.1 and Proposition D.4 to produce confidence intervals
Standard binomial confidence interval for imputation approach

E.2 relating protein structure and post-translational modifications

Model for predicting disorder: logistic regression model maps relative solvent-accessible surface area to probability of disorder

Link to paper#

Abstract#

Paper Content#

Introduction#

General principle#

Further preliminaries#

Warmup: mean estimation#

Related work#

Main theory: convex estimation#

Algorithms#

Applications#

Relating protein structure and post-translational modifications#

Extensions#

Beyond convex estimation#

Inference under distribution shift#

B.3 regularity conditions#

E.2 relating protein structure and post-translational modifications#

Link to paper

Abstract

Paper Content

Introduction

General principle

Further preliminaries

Warmup: mean estimation

Related work

Main theory: convex estimation

Algorithms

Applications

Relating protein structure and post-translational modifications

Extensions

Beyond convex estimation

Inference under distribution shift

B.3 regularity conditions

E.2 relating protein structure and post-translational modifications