Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce CIRCE, a measure of conditional independence for multivariate continuous-valued variables.
- Used to learn neural features of data while being conditionally independent of a distractor given a target.
- Requires single ridge regression from target to kernelized features of distractor.
- Estimation properties and consistency guarantees.
- Established that CIRCE is zero if and only if features are independent of distractor given target.
- Experiments show superior performance to previous methods.
Paper Content
Introduction
- We consider a learning setting where we would like to predict labels Y from features X and be ‘invariant’ to metadata Z.
- Aim is to learn a representation function ϕ for the features such that ϕ(X) is independent of Z given Y.
- Three motivating settings: fairness, domain invariant learning, and causal representation learning.
- Challenge is to learn a representation ϕ that satisfies the target condition when X is high-dimensional and Y and Z are continuous or moderately high-dimensional.
- Main contribution is a technique that reduces the problem of learning a conditionally independent representation to the problem of learning a marginally independent representation.
- Construct a statistic ζ(Y, Z) such that enforcing the marginal independence ϕ(X) is a measure of conditional independence.
- CIRCE is suitable for any setting where the conditional independence relation ϕ(X) ⊥ ⊥ Z | Y should be enforced.
- Demonstrate CIRCE in two practical settings: counterfactual invariance and image data extraction.
Efficient conditional independence regularizer
- Characterization of conditional independence introduced
- CIRCE introduced as a conditional independence criterion
- Finite sample estimate with convergence guarantees provided
- Strategies for efficient estimation from data provided
Conditional independence
- X and Z are Y-conditionally independent
- Equivalent formulation of X and Z being Y-conditionally independent
- Reduction to g not depending on Y is crucial for method
- Evaluating conditional expectations requires impractically many samples
Conditional independence regression covariance (circe)
- Conditional independence is impractical to check.
- Kernel methods are used to transform the condition into an easy-to-estimate measure.
- A kernel is a symmetric positive-definite function.
- A kernel can be represented as an inner product.
- A Hilbert-Schmidt operator is a linear operator with a finite Hilbert-Schmidt norm.
- The CIRCE operator reproduces the condition of conditional independence.
- The Hilbert-Schmidt norm of the CIRCE operator characterizes conditional independence.
- CIRCE is defined as the Hilbert-Schmidt norm of the CIRCE operator.
- A differentiable estimator of CIRCE is constructed from samples.
Empirical circe estimate and its use as a conditional independence
Regularizer
- Estimate CIRCE by using two datasets: a holdout set and a main set
- Estimate conditional expectation with kernel ridge regression
- Choose ridge parameter and kernel parameters with leave-one-out crossvalidation
- Estimator of CIRCE is consistent as number of training samples increases
- Algorithm is summarized in Algorithm 2
- Use empirical CIRCE as regularizer for conditionally independent regularization learning
Related work
- Sun et al. (2007) introduced a measure of conditional dependence called the conditional kernel cross-covariance
- Zhang et al. (2011) proposed a kernel-based conditional independence test (KCI)
- Quinzan et al. (2022) introduced a variant of the Hilbert-Schmidt Conditional Independence Criterion (HSCIC)
- Fukumizu et al. (2008) introduced the Hilbert-Schmidt norm of the normalized cross-covariance
- Huang et al. (2020) proposed using the ratio of the maximum mean discrepancy (MMD)
- Shah & Peters (2020) proposed the Generalized Covariance Measure (GCM)
Experiments
- Conduct experiments on synthetic and image data
- Compare performance with HSCIC and GCM
- Measure in-domain MSE loss and counterfactual invariance
- Vary number of dimensions in multivariate cases
- Plot Pareto front of MSE loss and VCF
- CIRCE and HSCIC have similar trade-off profile
- GCM sacrifices more in-domain performance
- HSCIC becomes less efficient with increasing dimensions
- Task is to learn predictor that is conditionally independent of Z
Dsprites
- Chosen y-coordinate as target and x-coordinate as distractor
- Neural network with 3 convolutional layers and 3 fully-connected layers
- Linear dependence between x and y coordinates
- CIRCE performs best, followed by HSCIC, with GCM doing poorly
- Non-linear dependence between x and y coordinates, GCM not suitable
Extended yale-b
- Evaluated CIRCE as a regressor for supervised tasks on natural image dataset of Extended Yale-B Faces
- Task is to estimate camera pose Y from image X while being conditionally independent of illumination Z
- Used ResNet-18 model pre-trained on ImageNet to extract image features
- Sampled training data according to non-linear relation Z = 0.5(Y + εY 2 )
- CIRCE showed small advantage over HSCIC in OOD performance for best regularizer choice
Discussion
- CIRCE is a kernel-based measure of conditional independence
- CIRCE can be used as a regularizer to enforce conditional independence between a network’s predictions and a pre-specified variable
- CIRCE can be used in many applications, including fairness, domain invariant learning, and causal representation learning
- CIRCE enforces conditional independence via a marginal independence requirement during representation learning
- Alternative conditional independence regularizers require an additional regression step on each minibatch, resulting in a higher variance criterion
Appendices a conditional independence definitions
- Theorem A.1 (Theorem 1 of Daudin 1980) states two conditions are equivalent
- Corollary A.2 (Equation 3.8 of Daudin 1980) states two conditions are equivalent
Sufficient condition: e[gh
- Theorem A.1 and Corollary A.3 (Equation 3.9 of Daudin 1980) are equivalent
- Proof of Theorem 2.5 involves “pulling out” the Y expectation and applying conditional independence
- L 2 -universal kernel is dense in L 2
- Proof of Theorem 2.5 involves applying Cauchy-Schwarz
C proofs for estimators c.1 estimating the conditional mean embedding
- Estimate of term E Z [ψ(Z, Y ) | Y ] is a function of Y
- Established results on conditional feature mean estimation exist
- To learn E [ψ(Q) | Y ] for some feature map ψ(q) ∈ H Q and random variable Q, minimize the following loss
- Q := (Z, Y ) and ψ(Z, Y ) = ψ(Z) ⊗ ψ(Y )
- Estimator has O(1/B) bias and O p (1/ √ B) deviation from the mean
- McDiarmid’s inequality used
- For bounded kernels over X, Z, Y and a (β, p)-kernel over Y, estimator deviates from true value as O p (1/M (β−1)/(2(β+p)) )