Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new method for calibrating predictors of heterogeneous treatment effects
  • Introduced a data-efficient variant of calibration that avoids the need for hold-out calibration sets
  • Established that proposed method achieves fast doubly-robust calibration rates
  • Wrapping proposed method around any black-box learning algorithm provides strong calibration guarantees while preserving predictive performance

Paper Content

Introduction

  • Estimation of causal effects is important for understanding interventions and informing policy
  • Treatment effect heterogeneity can provide more insights than overall population effects
  • Applications of HTEs include prioritizing treatment and individualizing treatment assignments
  • CATE estimation is of great interest in statistics and data science
  • CATE estimators build upon estimators of the conditional mean outcome and the probability of treatment given covariates
  • CATE estimation can be challenging due to non-smooth, high-dimensional nuisance parameters
  • Predictions from a given treatment effect predictor can still be useful for decision-making
  • Theoretical guarantees for rational decision-making typically hinge on the predictor being a good approximation of the true CATE
  • Calibration is a desirable property of a treatment effect predictor
  • Calibration has been widely used to enhance prediction models for classification and regression
  • Little research has been done on calibration of treatment effect predictors
  • This paper proposes a nonparametric doubly-robust method for calibrating treatment effect predictors

Statistical setup

Notation and definitions

  • Data unit O consists of three components: W, A, and Y
  • W is a vector of baseline covariates
  • A is a binary indicator of treatment
  • Y is an outcome
  • Dn is the observed dataset
  • π0 is the propensity score
  • µ0 is the potential outcome
  • Higher values of Y1-Y0 are desirable
  • τ0 is the true CATE
  • γ0 is the conditional mean of the individual treatment effect
  • Solution to isotonic regression problem is non-unique
  • Solution follows Groeneboom and Lopuhaa (1993)

Measuring calibration and the calibration-distortion decomposition

  • Various definitions of risk predictor calibration have been proposed
  • Outline definition of calibration and rationale
  • Best predictor of individual treatment effect is w → γ 0 (τ, w)
  • Perfect calibration cannot be achieved in finite samples
  • Calibration measure is 2 -expected calibration error
  • Calibration measure plays role in mean squared error between treatment predictor and true CATE
  • Calibration-distortion decomposition shows better-calibrated treatment effect predictors have lower mean-squared error

Calibrating predictors: desiderata and classical methods

  • Calibration methods aim to find a function that makes a predictor more accurate
  • Platt’s scaling is used for binary outcomes and is based on strong parametric assumptions
  • Histogram binning partitions the sorted values of the predictor into a fixed number of bins
  • Bayesian binning considers multiple binning models and their combinations
  • Isotonic calibration learns the bins from data using isotonic regression
  • Isotonic calibration satisfies a distribution-free calibration guarantee and is at least as predictive as the original predictor

Causal isotonic calibration

  • Inspired by isotonic calibration, a doubly-robust calibration method for treatment effects is proposed, called causal isotonic calibration
  • Takes a given predictor trained on some dataset and performs calibration using an independent (or hold-out) dataset
  • Automatically learns uncalibrated regions of the given predictor
  • Consolidates individual predictions within each region into a single value using a doubly-robust estimator of the ATE
  • Introduces a novel data-efficient variant of calibration called crosscalibration
  • Cross-fitted predictors are used and a single calibrated predictor is obtained using all available data
  • Implemented using standard isotonic regression software
  • Estimate χ 0 of χ 0 is obtained using E m
  • Isotonic regression is used to find and refer to χ 0 (O) as a pseudo-outcome
  • Calibrated predictor is given by θ n τ
  • Sample splitting or cross-fitting is recommended to obtain pseudo-outcomes
  • Algorithm 2 provides a means to fully utilize the entire dataset for both fitting an initial estimate of τ 0 and calibration
  • Algorithm 3 is a computationally simpler variant of Algorithm 2

Sample theoretical properties

  • Algorithm 1 and Algorithm 2 are presented for causal isotonic calibration
  • Properties 1 and 2 are argued to be satisfied
  • Data is split into a training dataset and a calibration dataset
  • Conditions 1-5 are assumed
  • Theorem 1 establishes the calibration rate of the calibrated predictor
  • Theorem 2 states that the pointwise median preserves calibration
  • Theorem 3 states that the mean squared error is not inflated much

Data-generating mechanisms

  • Examined behavior of proposal under two data-generating mechanisms
  • Scenario 1: binary outcome, 4 confounders, treatment interactions
  • Scenario 2: continuous outcome, linear on covariates, 20 true confounders
  • Propensity score follows logistic regression model
  • Covariates independent and uniformly distributed on (-1, +1)
  • Sample sizes of 1,000, 2,000, 5,000 and 10,000

Cate estimation

  • Implemented GBRT, RF, GLMnet, GAM, and MARS for Scenario 1
  • Implemented RF, GLMnet, and combination of variable screening with lasso regularization for Scenario 2
  • Used R package sl3 for implementation of estimators
  • Used causal isotonic cross-calibration for calibration

Performance metrics

  • Compared performance of calibrated and uncalibrated versions of a causal isotonic calibrator
  • Used 3 metrics to compare performance: calibration measure, mean squared error, and calibration bias within bins
  • Estimated metric empirically using an independent sample of size 10,000
  • Averaged metric estimates across 500 simulations

Simulation results

  • GLMnet, RF, GAM, and MARS were well-calibrated and did not benefit from calibration
  • GBRT benefited from calibration, reducing calibration error and improving mean squared error
  • RF and GBRT with GLMnet screening were poorly calibrated and benefited from calibration
  • Cross-calibration improved mean squared error and calibration more than conventional calibration

Conclusion

  • Proposed causal isotonic calibration as a novel method to calibrate treatment effect predictors
  • Established that the pointwise median of calibrated predictors is also calibrated
  • Developed a data-efficient variant of causal isotonic calibration using cross-fitted predictors
  • Calibration error vanishes at a fast rate of -2/3 with little or no loss in predictive power
  • Directly calibrate HTE predictors without requiring trial data or parametric assumptions
  • Potential applications of method include data-driven decision-making with strong robustness guarantees
  • Limitations: need to estimate µ 0 or π 0 sufficiently well
  • Found that calibration generally preserves predictive power, and in some cases improves accuracy
  • Found that cross-calibration substantially improved mean squared error
  • Theoretical arguments can be adapted to provide guarantees for isotonic calibration in regression and classification problems
  • Implementation of algorithms in R provided in Github package causalCalibration