Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce CIRCE, a measure of conditional independence for multivariate continuous-valued variables.
  • Used to learn neural features of data while being conditionally independent of a distractor given a target.
  • Requires single ridge regression from target to kernelized features of distractor.
  • Estimation properties and consistency guarantees.
  • Established that CIRCE is zero if and only if features are independent of distractor given target.
  • Experiments show superior performance to previous methods.

Paper Content

Introduction

  • We consider a learning setting where we would like to predict labels Y from features X and be ‘invariant’ to metadata Z.
  • Aim is to learn a representation function ϕ for the features such that ϕ(X) is independent of Z given Y.
  • Three motivating settings: fairness, domain invariant learning, and causal representation learning.
  • Challenge is to learn a representation ϕ that satisfies the target condition when X is high-dimensional and Y and Z are continuous or moderately high-dimensional.
  • Main contribution is a technique that reduces the problem of learning a conditionally independent representation to the problem of learning a marginally independent representation.
  • Construct a statistic ζ(Y, Z) such that enforcing the marginal independence ϕ(X) is a measure of conditional independence.
  • CIRCE is suitable for any setting where the conditional independence relation ϕ(X) ⊥ ⊥ Z | Y should be enforced.
  • Demonstrate CIRCE in two practical settings: counterfactual invariance and image data extraction.

Efficient conditional independence regularizer

  • Characterization of conditional independence introduced
  • CIRCE introduced as a conditional independence criterion
  • Finite sample estimate with convergence guarantees provided
  • Strategies for efficient estimation from data provided

Conditional independence

  • X and Z are Y-conditionally independent
  • Equivalent formulation of X and Z being Y-conditionally independent
  • Reduction to g not depending on Y is crucial for method
  • Evaluating conditional expectations requires impractically many samples

Conditional independence regression covariance (circe)

  • Conditional independence is impractical to check.
  • Kernel methods are used to transform the condition into an easy-to-estimate measure.
  • A kernel is a symmetric positive-definite function.
  • A kernel can be represented as an inner product.
  • A Hilbert-Schmidt operator is a linear operator with a finite Hilbert-Schmidt norm.
  • The CIRCE operator reproduces the condition of conditional independence.
  • The Hilbert-Schmidt norm of the CIRCE operator characterizes conditional independence.
  • CIRCE is defined as the Hilbert-Schmidt norm of the CIRCE operator.
  • A differentiable estimator of CIRCE is constructed from samples.

Empirical circe estimate and its use as a conditional independence

Regularizer

  • Estimate CIRCE by using two datasets: a holdout set and a main set
  • Estimate conditional expectation with kernel ridge regression
  • Choose ridge parameter and kernel parameters with leave-one-out crossvalidation
  • Estimator of CIRCE is consistent as number of training samples increases
  • Algorithm is summarized in Algorithm 2
  • Use empirical CIRCE as regularizer for conditionally independent regularization learning
  • Sun et al. (2007) introduced a measure of conditional dependence called the conditional kernel cross-covariance
  • Zhang et al. (2011) proposed a kernel-based conditional independence test (KCI)
  • Quinzan et al. (2022) introduced a variant of the Hilbert-Schmidt Conditional Independence Criterion (HSCIC)
  • Fukumizu et al. (2008) introduced the Hilbert-Schmidt norm of the normalized cross-covariance
  • Huang et al. (2020) proposed using the ratio of the maximum mean discrepancy (MMD)
  • Shah & Peters (2020) proposed the Generalized Covariance Measure (GCM)

Experiments

  • Conduct experiments on synthetic and image data
  • Compare performance with HSCIC and GCM
  • Measure in-domain MSE loss and counterfactual invariance
  • Vary number of dimensions in multivariate cases
  • Plot Pareto front of MSE loss and VCF
  • CIRCE and HSCIC have similar trade-off profile
  • GCM sacrifices more in-domain performance
  • HSCIC becomes less efficient with increasing dimensions
  • Task is to learn predictor that is conditionally independent of Z

Dsprites

  • Chosen y-coordinate as target and x-coordinate as distractor
  • Neural network with 3 convolutional layers and 3 fully-connected layers
  • Linear dependence between x and y coordinates
  • CIRCE performs best, followed by HSCIC, with GCM doing poorly
  • Non-linear dependence between x and y coordinates, GCM not suitable

Extended yale-b

  • Evaluated CIRCE as a regressor for supervised tasks on natural image dataset of Extended Yale-B Faces
  • Task is to estimate camera pose Y from image X while being conditionally independent of illumination Z
  • Used ResNet-18 model pre-trained on ImageNet to extract image features
  • Sampled training data according to non-linear relation Z = 0.5(Y + εY 2 )
  • CIRCE showed small advantage over HSCIC in OOD performance for best regularizer choice

Discussion

  • CIRCE is a kernel-based measure of conditional independence
  • CIRCE can be used as a regularizer to enforce conditional independence between a network’s predictions and a pre-specified variable
  • CIRCE can be used in many applications, including fairness, domain invariant learning, and causal representation learning
  • CIRCE enforces conditional independence via a marginal independence requirement during representation learning
  • Alternative conditional independence regularizers require an additional regression step on each minibatch, resulting in a higher variance criterion

Appendices a conditional independence definitions

  • Theorem A.1 (Theorem 1 of Daudin 1980) states two conditions are equivalent
  • Corollary A.2 (Equation 3.8 of Daudin 1980) states two conditions are equivalent

Sufficient condition: e[gh

  • Theorem A.1 and Corollary A.3 (Equation 3.9 of Daudin 1980) are equivalent
  • Proof of Theorem 2.5 involves “pulling out” the Y expectation and applying conditional independence
  • L 2 -universal kernel is dense in L 2
  • Proof of Theorem 2.5 involves applying Cauchy-Schwarz

C proofs for estimators c.1 estimating the conditional mean embedding

  • Estimate of term E Z [ψ(Z, Y ) | Y ] is a function of Y
  • Established results on conditional feature mean estimation exist
  • To learn E [ψ(Q) | Y ] for some feature map ψ(q) ∈ H Q and random variable Q, minimize the following loss
  • Q := (Z, Y ) and ψ(Z, Y ) = ψ(Z) ⊗ ψ(Y )
  • Estimator has O(1/B) bias and O p (1/ √ B) deviation from the mean
  • McDiarmid’s inequality used
  • For bounded kernels over X, Z, Y and a (β, p)-kernel over Y, estimator deviates from true value as O p (1/M (β−1)/(2(β+p)) )