Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Explanations of AI models must be both human-intelligible and consistent with the model’s internal structure.
Theory of causal abstraction provides the mathematical foundations for these explanations.
Contributions include generalizing causal abstraction to cyclic structures, using multi-source interventions, defining approximate causal abstraction, and formalizing XAI methods.

Paper Content

Introduction

XAI seeks to explain why deep learning models make the predictions they do
Causal analysis is the gold standard for explaining model behavior and internal reasoning
Low-level causal explanations of behavior and internal reasoning can be easily provided, but are not interpretable to humans
High-level explanations are easier to interpret, but difficult to trust
Causal abstraction provides a framework for analyzing a system at multiple levels of detail simultaneously
Causal abstraction has been applied to deep learning AI models, weather patterns, and human brains
This paper develops the theory of causal abstraction as a mathematical framework for XAI
Low-level variables are partitioned into clusters, each associated with a high-level variable
Approximate causal abstraction is explored, connecting interchange intervention analysis with existing definitions

Faithful and interpretable causal explanations of ai

Causal explanations are privileged when explaining how an artifact works
Causal explanations allow for manipulation and control of the system
Appropriate level of abstraction is important for causal explanations
Intervention is a fundamental operation of causal explanations
Causal abstraction supports interpretable explanations of AI
Faithfulness is defined as the degree to which an explanation accurately represents the ’true reasoning process behind a model’s behavior'

Methods for explaining ai behavior

AI model behavior is a function from inputs to outputs
Behavior can be represented by a two-variable causal model
XAI methods learn interpretable models to approximate uninterpretable models
XAI methods are model-agnostic and provide same explanations for models with same behavior
Need to ground notions of faithfulness in causality to compare XAI methods

Methods for explaining the internal structure of ai

AI models have internal reasoning that can be represented as a program or algorithm
Recent research aims to understand the causal mechanisms inside black box models
Causal abstraction provides a mathematical foundation for understanding the high-level semantics of neural representations
Interchange interventions are used to show that neural representations represent propositional content
Iterative nullspace projection is used to evaluate whether neural representations encode concepts with ‘mental’ causes and effects
Causal mediation analysis is used to analyze gender bias in pretrained language models
Circuit-based explanations reverse engineer the mechanisms of a network at the level of individual neurons
Probing is used to determine whether a concept is present in a neural representation
Feature attribution methods ascribe scores to neural representations to capture their ‘impact’ on model behavior

Causal models

Notation: V denotes a set of variables, X denotes a variable, x denotes a value, Val(X) denotes the range of possible values for X
No two variables can take on the same value
Capital letters denote variables, lower case letters denote values, bold letters denote sets of variables/values
Domain(f), Uniform(X), ½[ϕ] are useful constructs
Projection: given a partial setting u for a set of variables U, Proj(u, X) is the restriction of u to the variables in X
Definition 4: causal model is a pair (V, F) where V is a set of variables and F is a set of structural functions
Remark 5: no explicit reference to a graphical structure defining a causal ordering on the variables
Remark 6: acyclic model notation
Definition 7: set of solutions is the set of all v ∈ Val(V) such that all equations v = f V (v) are satisfied
Definition 8: intervention is a partial setting i ∈ Val(I) for I ⊆ V, M i is just like M except f X is replaced with constant function v → Proj(i, X) for each X ∈ I

Example of causal models: a symbolic algorithm and neural network

Two causal models are defined to demonstrate potential to model a variety of computational processes
The first model is a tree-structured algorithm
The second model is a fully-connected feed-forward neural network
Both models solve the same task

Hierarchical equality task

Hierarchical equality task is to determine if two pairs of objects have identical relations
Input is two pairs of objects, output is True if both pairs are equal or unequal, False otherwise
Domain of objects consists of triangle, square, and pentagon
Obvious tree-structured symbolic algorithm solves the task
Equality reasoning is ubiquitous and has been studied for broader questions about relational reasoning
Hierarchical equality serves as a case study for explaining how abstract tree-structured composition can be implemented by a fully-connected neural network
Neural network is trained to implement the hierarchical equality task

A tree-structured algorithm for hierarchical equality

Algorithm A consists of four input variables and one output variable
Acyclic causal graph is depicted in Figure 1a
Each f Xi is a constant function
Default total setting is [ , , , , True, True, True]
Counterfactual result is [ , , △, , True, True, True]

A fully connected neural network for hierarchical equality

Neural network N consists of 8 input neurons
Values for each variable are real numbers R
4 sets of variables for first 4 layers
Constant function f R k for 1 ≤ k ≤ 8
Output neurons determined by network weights
Network outputs True/False based on output logit values

Causal abstraction and interchange intervention analysis

Structural conditions must be in place for H to be a high-level abstraction of the low-level model L
N and A must be present from the previous section

Alignments between causal models

Abstraction involves associating high-level variables with clusters of low-level variables
Alignment between low-level and high-level causal models is introduced
Alignment consists of a partition and a family of maps
Alignment induces a unique translation
Translation is a partial function from low-level interventions to high-level interventions
Low-level interventions that correspond to high-level interventions are defined by cell-wise maps

Causal consistency and constructive abstraction

Definition 10: An alignment between two models is consistent if the high-level intervention corresponding to a low-level intervention results in the same high-level total settings.
Definition 11: A constructive abstraction is when the causal consistency condition is satisfied.
Remark 12: Constructive abstraction was introduced in Beckers and Halpern (2019) and Beckers et al. (2019).
Remark 13: Abstract interpretation is a special case of causal abstraction.
Definition 14: A typing is a function that assigns types to variables and an equivalence relation between values of the same type.
Definition 15: Typed causal abstraction is when an alignment is both causally and type consistent.

Interchange intervention analysis

Interchange intervention analysis is a method of operationalizing claims of causal abstraction.
Geiger et al. (2022b) provides a specialized theory of interchange interventions that only covers cases with a single intermediate variable.
This paper expands on this work, presenting a general theory of interchange interventions for high-level causal models with multiple intermediate variables.
Alignment between input and output variables is stipulated by the researcher.
Alignment between intermediate variables must be searched for.
Interchange interventions are limited to those that fix the entirety of a low-level partition cell or fix none of the cell.

Explanation and generalization

Interchange interventions set variables to values based on an input.
Definition 10 requires a commuting diagram to hold for all interventions in the range of the partial function τ.
Explanations should generalize to unseen real-world input.
Generalizing from training to testing data is a central question of machine learning.

Decomposing constructive causal abstraction

Marginalization removes a set of variables from a causal model
Variable merge collapses a partition of variables from a causal model
Value merge collapses a partition of values for each variable from a causal model
Marginalization links the parents and children of each variable
Variable merge and value merge are valid only if the partition cells respect the causal dynamics of the model
Marginalization guarantees perfect insensitivity/stability
Value merge alters the value space of each variable
Variable merge determines the children of their partition
Value merge is viable when collapsed values play the same role in the model
Constructive abstraction is a matter of being able to construct the high-level model from the low-level model with marginalization, variable merge, and value merge

Example of causal abstraction: tree-structure in neural computation

Hierarchical equality task is important for understanding relationship between artificial and biological neural networks and modular symbolic algorithms
Tree-structured algorithm and neural network trained to implement it (Geiger et al., 2022b)
Causal abstraction theory explains implementation relationship between network and algorithm

An alignment between the algorithm and the neural network

Neural network parameters have no obvious relationship to algorithm A
Network N was constructed to be abstracted by algorithm A
Intervention i has output values from real numbers and input values from {r , r , r △ } 4
Intermediate neurons are assigned high-level alignment by stipulation
Constructive abstraction will hold only if alignments to intermediate variables do not violate causal laws of A

The algorithm abstracts the neural network

Inputs are a sequence of four shapes from the set {△, , }
Domain of τ is restricted to 34 input interventions
Neural network was created using interchange intervention training with the alignment Π, τ and the high-level model A
Relation of constructive causal abstraction holds between the high-level model A and the low-level model N
Code provided for interchange intervention training and verifying that the network N is abstracted by the algorithm A
Interchange intervention performed on A and N with the base input ( , , △, ) and a single source input ( , , △, △)
Network and algorithm have the same counterfactual behavior

The algorithm can be constructed from the neural network

Network N can be transformed into algorithm A
Transformation involves marginalization, variable merge, and value merge
Visual depiction of transformation in Figure 5

Approximate abstraction and interchange intervention accuracy

Constructive causal abstraction is an all-or-nothing notion
Early applications of interchange interventions found subsets of the input space on which the causal abstraction holds
Geiger et al. proposed interchange intervention accuracy, which is the proportion of interchange interventions where the neural network and high-level algorithm have the same input-output behavior
Geiger et al. proposed a new notion of approximate abstraction, α-on-average constructive abstraction, which is tightly connected to interchange intervention accuracy
Theorem 31 states that if interchange intervention accuracy is α, then the high-level algorithm is a α-on-average constructive abstraction of the neural network

Xai methods grounded in causal abstraction

Causal abstraction can be used as a general theoretical foundation for XAI.
Many popular XAI methods can be viewed as special cases of causal abstraction analysis.
Causal abstraction can capture a variety of popular XAI methods with high-level models containing no more than three variables.

Lime: behavioral fidelity as approximate abstraction by a two-variable chain

LIME is a popular XAI method
LIME learns an interpretable model A that approximates the behavior of an uninterpretable model N
The fidelity of the explainer model A is a measure of how the input-output behavior of A differs from that of N
Iterative nullspace projection attempts to determine whether a concept is used by a model
Deep learning models are highly non-linear and can make decisions using information that is not linearly accessible
Elazar et al. (2022) and Lovering and Pavlick (2022) present methods to mitigate these concerns

Causal effect estimation as abstraction by a two-variable chain

CEBaB benchmark evaluates explainer models on their ability to estimate the causal effect of changing the quality of food, service, ambiance, and noise in a real-world dining experience on the prediction of a sentiment classifier.
CEBaB is represented by a single causal model with real-valued vectors for input data, prediction output, and neural representations.
Interested in the causal effect of food quality on model output, the model is marginalized to two endogenous variables.

Causal mediation as abstraction by a three-variable chain

Changing the value of a variable X affects a second variable Y
Causal mediation analysis determines how this effect is mediated by a third variable Z
Total, direct, and indirect effects can be defined with interchange interventions
This method has been applied to the analysis of neural networks
Goal is to identify sets of neurons that completely mediate the causal effect
Models in causal effect estimation are probabilistic models
Partial mediation is approximate abstraction by a three variable chain

Iterative nullspace projection as abstraction by a three variable chain

Iterative nullspace projection is a method of removing a concept C from a target hidden representation H of a neural network N.
The performance of N is measured to determine if it makes use of the concept C.
A three-variable causal model is used to model iterative nullspace projection as abstraction.
The high-level causal model is an abstraction of the low-level neural model under alignment.

Operationalizing circuit-based explanations with causal abstraction

Linear combinations of neural activations encode high-level concepts
Circuits defined by model weights encode meaningful algorithms over high-level concepts
Low-level causal model encodes neural representations and circuits
High-level causal model encodes high-level concepts and meaningful algorithms

Interchange interventions from integrated gradients

Integrated gradients is a neural network analysis method that assigns values to neurons based on their impact on model predictions.
Integrated gradients can be used to compute interchange interventions.

Future applications: types, infinite variables, and cycles

Causal abstraction is a general purpose framework
Existing XAI methods are limited to finite and acyclic models
Demonstrating expressive capacity of causal abstraction to support arbitrary symbolic algorithms
Articulating conditions for recursive deep learning model to implement bubble sort algorithm
Defining a causal model S with infinite variables and values
Structural equations of S defined for any natural numbers
Abstraction of S can be verified through behavioral evaluation
Abstraction with a model that encodes variables with types of integer and Boolean

Coda: abstraction for probabilistic models

Focus on deterministic neural models
Probabilistic models are quadruples (V, U, F, P)
Solutions to probabilistic models are probability distributions on total settings
Existing treatments of causal abstraction for probabilistic models require agreement between low-level and high-level models with respect to interventional distributions
Counterfactual distributions include important quantities such as probability of necessity
Preservation of interventional probabilities is not enough to preserve causal explanations
Example of two models with different explanations for outcomes
Alternative characterizations of abstraction relation explored in future work

Conclusion

Causal abstraction is a theoretical framework for XAI
Constructive causal abstraction can be decomposed into operations of marginalizing variables, merging variables, and merging values
Interchange interventions provide high-level interpretations for neural representations
Popular XAI methods can be seen as special cases of causal abstraction analysis
Causal abstraction lays groundwork for future development of XAI methods
Constructive probabilistic abstraction involves sets of interventions
Counterfactual quantities reduce to interventional quantities in the deterministic setting
Any sequence of variable merges, value merges, and marginalizations will produce a model that is a constructive abstraction of the original
Constructive abstraction is transitive
Aligned interchange intervention can be used to compare low-level and high-level models
Causal structure of ∆(Π(ǫB∪Y(S))) represents input-output behavior of bubble sort
Definition 42 (Constructive Probabilistic Abstraction) proposed
For all sets of interventions I ⊆ Domain(τ ), the following commutes: π(Solve(M i )) = Solve(Π(M) π(i) )

Link to paper#

Abstract#

Paper Content#

Introduction#

Faithful and interpretable causal explanations of ai#

Methods for explaining ai behavior#

Methods for explaining the internal structure of ai#

Causal models#

Example of causal models: a symbolic algorithm and neural network#

Hierarchical equality task#

A tree-structured algorithm for hierarchical equality#

A fully connected neural network for hierarchical equality#

Causal abstraction and interchange intervention analysis#

Alignments between causal models#

Causal consistency and constructive abstraction#

Interchange intervention analysis#

Explanation and generalization#

Decomposing constructive causal abstraction#

Example of causal abstraction: tree-structure in neural computation#

An alignment between the algorithm and the neural network#

The algorithm abstracts the neural network#

The algorithm can be constructed from the neural network#

Approximate abstraction and interchange intervention accuracy#

Xai methods grounded in causal abstraction#

Lime: behavioral fidelity as approximate abstraction by a two-variable chain#

Causal effect estimation as abstraction by a two-variable chain#

Causal mediation as abstraction by a three-variable chain#

Iterative nullspace projection as abstraction by a three variable chain#

Operationalizing circuit-based explanations with causal abstraction#

Interchange interventions from integrated gradients#

Future applications: types, infinite variables, and cycles#

Coda: abstraction for probabilistic models#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Faithful and interpretable causal explanations of ai

Methods for explaining ai behavior

Methods for explaining the internal structure of ai

Causal models

Example of causal models: a symbolic algorithm and neural network

Hierarchical equality task

A tree-structured algorithm for hierarchical equality

A fully connected neural network for hierarchical equality

Causal abstraction and interchange intervention analysis

Alignments between causal models

Causal consistency and constructive abstraction

Interchange intervention analysis

Explanation and generalization

Decomposing constructive causal abstraction

Example of causal abstraction: tree-structure in neural computation

An alignment between the algorithm and the neural network

The algorithm abstracts the neural network

The algorithm can be constructed from the neural network

Approximate abstraction and interchange intervention accuracy

Xai methods grounded in causal abstraction

Lime: behavioral fidelity as approximate abstraction by a two-variable chain

Causal effect estimation as abstraction by a two-variable chain

Causal mediation as abstraction by a three-variable chain

Iterative nullspace projection as abstraction by a three variable chain

Operationalizing circuit-based explanations with causal abstraction

Interchange interventions from integrated gradients

Future applications: types, infinite variables, and cycles

Coda: abstraction for probabilistic models

Conclusion