Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Explanations of AI models must be both human-intelligible and consistent with the model’s internal structure.
  • Theory of causal abstraction provides the mathematical foundations for these explanations.
  • Contributions include generalizing causal abstraction to cyclic structures, using multi-source interventions, defining approximate causal abstraction, and formalizing XAI methods.

Paper Content


  • XAI seeks to explain why deep learning models make the predictions they do
  • Causal analysis is the gold standard for explaining model behavior and internal reasoning
  • Low-level causal explanations of behavior and internal reasoning can be easily provided, but are not interpretable to humans
  • High-level explanations are easier to interpret, but difficult to trust
  • Causal abstraction provides a framework for analyzing a system at multiple levels of detail simultaneously
  • Causal abstraction has been applied to deep learning AI models, weather patterns, and human brains
  • This paper develops the theory of causal abstraction as a mathematical framework for XAI
  • Low-level variables are partitioned into clusters, each associated with a high-level variable
  • Approximate causal abstraction is explored, connecting interchange intervention analysis with existing definitions

Faithful and interpretable causal explanations of ai

  • Causal explanations are privileged when explaining how an artifact works
  • Causal explanations allow for manipulation and control of the system
  • Appropriate level of abstraction is important for causal explanations
  • Intervention is a fundamental operation of causal explanations
  • Causal abstraction supports interpretable explanations of AI
  • Faithfulness is defined as the degree to which an explanation accurately represents the ’true reasoning process behind a model’s behavior'

Methods for explaining ai behavior

  • AI model behavior is a function from inputs to outputs
  • Behavior can be represented by a two-variable causal model
  • XAI methods learn interpretable models to approximate uninterpretable models
  • XAI methods are model-agnostic and provide same explanations for models with same behavior
  • Need to ground notions of faithfulness in causality to compare XAI methods

Methods for explaining the internal structure of ai

  • AI models have internal reasoning that can be represented as a program or algorithm
  • Recent research aims to understand the causal mechanisms inside black box models
  • Causal abstraction provides a mathematical foundation for understanding the high-level semantics of neural representations
  • Interchange interventions are used to show that neural representations represent propositional content
  • Iterative nullspace projection is used to evaluate whether neural representations encode concepts with ‘mental’ causes and effects
  • Causal mediation analysis is used to analyze gender bias in pretrained language models
  • Circuit-based explanations reverse engineer the mechanisms of a network at the level of individual neurons
  • Probing is used to determine whether a concept is present in a neural representation
  • Feature attribution methods ascribe scores to neural representations to capture their ‘impact’ on model behavior

Causal models

  • Notation: V denotes a set of variables, X denotes a variable, x denotes a value, Val(X) denotes the range of possible values for X
  • No two variables can take on the same value
  • Capital letters denote variables, lower case letters denote values, bold letters denote sets of variables/values
  • Domain(f), Uniform(X), ½[ϕ] are useful constructs
  • Projection: given a partial setting u for a set of variables U, Proj(u, X) is the restriction of u to the variables in X
  • Definition 4: causal model is a pair (V, F) where V is a set of variables and F is a set of structural functions
  • Remark 5: no explicit reference to a graphical structure defining a causal ordering on the variables
  • Remark 6: acyclic model notation
  • Definition 7: set of solutions is the set of all v ∈ Val(V) such that all equations v = f V (v) are satisfied
  • Definition 8: intervention is a partial setting i ∈ Val(I) for I ⊆ V, M i is just like M except f X is replaced with constant function v → Proj(i, X) for each X ∈ I

Example of causal models: a symbolic algorithm and neural network

  • Two causal models are defined to demonstrate potential to model a variety of computational processes
  • The first model is a tree-structured algorithm
  • The second model is a fully-connected feed-forward neural network
  • Both models solve the same task

Hierarchical equality task

  • Hierarchical equality task is to determine if two pairs of objects have identical relations
  • Input is two pairs of objects, output is True if both pairs are equal or unequal, False otherwise
  • Domain of objects consists of triangle, square, and pentagon
  • Obvious tree-structured symbolic algorithm solves the task
  • Equality reasoning is ubiquitous and has been studied for broader questions about relational reasoning
  • Hierarchical equality serves as a case study for explaining how abstract tree-structured composition can be implemented by a fully-connected neural network
  • Neural network is trained to implement the hierarchical equality task

A tree-structured algorithm for hierarchical equality

  • Algorithm A consists of four input variables and one output variable
  • Acyclic causal graph is depicted in Figure 1a
  • Each f Xi is a constant function
  • Default total setting is [ , , , , True, True, True]
  • Counterfactual result is [ , , △, , True, True, True]

A fully connected neural network for hierarchical equality

  • Neural network N consists of 8 input neurons
  • Values for each variable are real numbers R
  • 4 sets of variables for first 4 layers
  • Constant function f R k for 1 ≤ k ≤ 8
  • Output neurons determined by network weights
  • Network outputs True/False based on output logit values

Causal abstraction and interchange intervention analysis

  • Structural conditions must be in place for H to be a high-level abstraction of the low-level model L
  • N and A must be present from the previous section

Alignments between causal models

  • Abstraction involves associating high-level variables with clusters of low-level variables
  • Alignment between low-level and high-level causal models is introduced
  • Alignment consists of a partition and a family of maps
  • Alignment induces a unique translation
  • Translation is a partial function from low-level interventions to high-level interventions
  • Low-level interventions that correspond to high-level interventions are defined by cell-wise maps

Causal consistency and constructive abstraction

  • Definition 10: An alignment between two models is consistent if the high-level intervention corresponding to a low-level intervention results in the same high-level total settings.
  • Definition 11: A constructive abstraction is when the causal consistency condition is satisfied.
  • Remark 12: Constructive abstraction was introduced in Beckers and Halpern (2019) and Beckers et al. (2019).
  • Remark 13: Abstract interpretation is a special case of causal abstraction.
  • Definition 14: A typing is a function that assigns types to variables and an equivalence relation between values of the same type.
  • Definition 15: Typed causal abstraction is when an alignment is both causally and type consistent.

Interchange intervention analysis

  • Interchange intervention analysis is a method of operationalizing claims of causal abstraction.
  • Geiger et al. (2022b) provides a specialized theory of interchange interventions that only covers cases with a single intermediate variable.
  • This paper expands on this work, presenting a general theory of interchange interventions for high-level causal models with multiple intermediate variables.
  • Alignment between input and output variables is stipulated by the researcher.
  • Alignment between intermediate variables must be searched for.
  • Interchange interventions are limited to those that fix the entirety of a low-level partition cell or fix none of the cell.

Explanation and generalization

  • Interchange interventions set variables to values based on an input.
  • Definition 10 requires a commuting diagram to hold for all interventions in the range of the partial function τ.
  • Explanations should generalize to unseen real-world input.
  • Generalizing from training to testing data is a central question of machine learning.

Decomposing constructive causal abstraction

  • Marginalization removes a set of variables from a causal model
  • Variable merge collapses a partition of variables from a causal model
  • Value merge collapses a partition of values for each variable from a causal model
  • Marginalization links the parents and children of each variable
  • Variable merge and value merge are valid only if the partition cells respect the causal dynamics of the model
  • Marginalization guarantees perfect insensitivity/stability
  • Value merge alters the value space of each variable
  • Variable merge determines the children of their partition
  • Value merge is viable when collapsed values play the same role in the model
  • Constructive abstraction is a matter of being able to construct the high-level model from the low-level model with marginalization, variable merge, and value merge

Example of causal abstraction: tree-structure in neural computation

  • Hierarchical equality task is important for understanding relationship between artificial and biological neural networks and modular symbolic algorithms
  • Tree-structured algorithm and neural network trained to implement it (Geiger et al., 2022b)
  • Causal abstraction theory explains implementation relationship between network and algorithm

An alignment between the algorithm and the neural network

  • Neural network parameters have no obvious relationship to algorithm A
  • Network N was constructed to be abstracted by algorithm A
  • Intervention i has output values from real numbers and input values from {r , r , r △ } 4
  • Intermediate neurons are assigned high-level alignment by stipulation
  • Constructive abstraction will hold only if alignments to intermediate variables do not violate causal laws of A

The algorithm abstracts the neural network

  • Inputs are a sequence of four shapes from the set {△, , }
  • Domain of τ is restricted to 34 input interventions
  • Neural network was created using interchange intervention training with the alignment Π, τ and the high-level model A
  • Relation of constructive causal abstraction holds between the high-level model A and the low-level model N
  • Code provided for interchange intervention training and verifying that the network N is abstracted by the algorithm A
  • Interchange intervention performed on A and N with the base input ( , , △, ) and a single source input ( , , △, △)
  • Network and algorithm have the same counterfactual behavior

The algorithm can be constructed from the neural network

  • Network N can be transformed into algorithm A
  • Transformation involves marginalization, variable merge, and value merge
  • Visual depiction of transformation in Figure 5

Approximate abstraction and interchange intervention accuracy

  • Constructive causal abstraction is an all-or-nothing notion
  • Early applications of interchange interventions found subsets of the input space on which the causal abstraction holds
  • Geiger et al. proposed interchange intervention accuracy, which is the proportion of interchange interventions where the neural network and high-level algorithm have the same input-output behavior
  • Geiger et al. proposed a new notion of approximate abstraction, α-on-average constructive abstraction, which is tightly connected to interchange intervention accuracy
  • Theorem 31 states that if interchange intervention accuracy is α, then the high-level algorithm is a α-on-average constructive abstraction of the neural network

Xai methods grounded in causal abstraction

  • Causal abstraction can be used as a general theoretical foundation for XAI.
  • Many popular XAI methods can be viewed as special cases of causal abstraction analysis.
  • Causal abstraction can capture a variety of popular XAI methods with high-level models containing no more than three variables.

Lime: behavioral fidelity as approximate abstraction by a two-variable chain

  • LIME is a popular XAI method
  • LIME learns an interpretable model A that approximates the behavior of an uninterpretable model N
  • The fidelity of the explainer model A is a measure of how the input-output behavior of A differs from that of N
  • Iterative nullspace projection attempts to determine whether a concept is used by a model
  • Deep learning models are highly non-linear and can make decisions using information that is not linearly accessible
  • Elazar et al. (2022) and Lovering and Pavlick (2022) present methods to mitigate these concerns

Causal effect estimation as abstraction by a two-variable chain

  • CEBaB benchmark evaluates explainer models on their ability to estimate the causal effect of changing the quality of food, service, ambiance, and noise in a real-world dining experience on the prediction of a sentiment classifier.
  • CEBaB is represented by a single causal model with real-valued vectors for input data, prediction output, and neural representations.
  • Interested in the causal effect of food quality on model output, the model is marginalized to two endogenous variables.

Causal mediation as abstraction by a three-variable chain

  • Changing the value of a variable X affects a second variable Y
  • Causal mediation analysis determines how this effect is mediated by a third variable Z
  • Total, direct, and indirect effects can be defined with interchange interventions
  • This method has been applied to the analysis of neural networks
  • Goal is to identify sets of neurons that completely mediate the causal effect
  • Models in causal effect estimation are probabilistic models
  • Partial mediation is approximate abstraction by a three variable chain

Iterative nullspace projection as abstraction by a three variable chain

  • Iterative nullspace projection is a method of removing a concept C from a target hidden representation H of a neural network N.
  • The performance of N is measured to determine if it makes use of the concept C.
  • A three-variable causal model is used to model iterative nullspace projection as abstraction.
  • The high-level causal model is an abstraction of the low-level neural model under alignment.

Operationalizing circuit-based explanations with causal abstraction

  • Linear combinations of neural activations encode high-level concepts
  • Circuits defined by model weights encode meaningful algorithms over high-level concepts
  • Low-level causal model encodes neural representations and circuits
  • High-level causal model encodes high-level concepts and meaningful algorithms

Interchange interventions from integrated gradients

  • Integrated gradients is a neural network analysis method that assigns values to neurons based on their impact on model predictions.
  • Integrated gradients can be used to compute interchange interventions.

Future applications: types, infinite variables, and cycles

  • Causal abstraction is a general purpose framework
  • Existing XAI methods are limited to finite and acyclic models
  • Demonstrating expressive capacity of causal abstraction to support arbitrary symbolic algorithms
  • Articulating conditions for recursive deep learning model to implement bubble sort algorithm
  • Defining a causal model S with infinite variables and values
  • Structural equations of S defined for any natural numbers
  • Abstraction of S can be verified through behavioral evaluation
  • Abstraction with a model that encodes variables with types of integer and Boolean

Coda: abstraction for probabilistic models

  • Focus on deterministic neural models
  • Probabilistic models are quadruples (V, U, F, P)
  • Solutions to probabilistic models are probability distributions on total settings
  • Existing treatments of causal abstraction for probabilistic models require agreement between low-level and high-level models with respect to interventional distributions
  • Counterfactual distributions include important quantities such as probability of necessity
  • Preservation of interventional probabilities is not enough to preserve causal explanations
  • Example of two models with different explanations for outcomes
  • Alternative characterizations of abstraction relation explored in future work


  • Causal abstraction is a theoretical framework for XAI
  • Constructive causal abstraction can be decomposed into operations of marginalizing variables, merging variables, and merging values
  • Interchange interventions provide high-level interpretations for neural representations
  • Popular XAI methods can be seen as special cases of causal abstraction analysis
  • Causal abstraction lays groundwork for future development of XAI methods
  • Constructive probabilistic abstraction involves sets of interventions
  • Counterfactual quantities reduce to interventional quantities in the deterministic setting
  • Any sequence of variable merges, value merges, and marginalizations will produce a model that is a constructive abstraction of the original
  • Constructive abstraction is transitive
  • Aligned interchange intervention can be used to compare low-level and high-level models
  • Causal structure of ∆(Π(ǫB∪Y(S))) represents input-output behavior of bubble sort
  • Definition 42 (Constructive Probabilistic Abstraction) proposed
  • For all sets of interventions I ⊆ Domain(τ ), the following commutes: π(Solve(M i )) = Solve(Π(M) π(i) )