Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations.
Adversarial attacks cannot be used directly in a counterfactual explanation perspective.
The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks.
The paper’s key idea is to build attacks through a diffusion model to polish them.

Paper Content

Introduction

Research branch of explainable AI has yielded remarkable results
Counterfactual explanations (CE) is a promising pipeline for explainability
CE answers what needs to change to alter the prediction
CE are easy to understand and have been adopted by companies
CE should be valid, sparse, proximal, diverse, and realistic
Adversarial attacks share a common goal with CE
ACE is a novel methodology based on adversarial attacks to generate semantically coherent CE
ACE performs competitively with other methods
ACE produces actionable modifications in real-world scenarios to flip the classifier decision

Ad-Hoc and Post-Hoc methods are two branches of Explainable AI
Post-Hoc methods can be divided into global and local explanations
Local explanations include saliency maps, concept attribution and model distillation
Counterfactual explanations try to explain how a model took a decision
Adversarial attacks and counterfactual explanations share the same goal
Adversarial attacks can be white-box or black-box
Adversarial attacks can be used to create semantic changes in undefended models

Adversarial counterfactual explanations

ACE produces counterfactual images in two steps
Step 1: Producing pre-explanation images by minimizing a loss function using an adversarial attack
Step 2: Bringing pre-explanations closer to the input images by using a binary mask
Filtering process F robustifies the classifier to generate semantic changes without modifying its weights

Pre-explanation generation with ddpms

DDPMs have two key properties: removing high-frequency information and producing in-distribution images.
DDPMs rely on two Markov chains, one forward and one reverse.
The forward chain adds noise from a state t into t + 1 while the reverse chain removes it from t + 1 to t.
ACE pre-explanation generation involves applying the forward DDPM process and then denoising it recursively.

Bringing the pre-explanations closer to the input images

DDPM generates a normal distribution
Post-processing phase can help keep irrelevant parts of the image untouched
Binary mask m delineates regions that qualify for modifications
Inpainting methods used to fuse CE inside the mask
Final image is counterfactual explanation

Experimentation

Evaluation protocols and datasets

Evaluated ACE on CelebA, CelebA HQ, and BDD100k
CelebA images are 128x128, CelebA HQ images are 256x256, BDD100k images are 512x256
Classifier is DenseNet121 for all datasets
Evaluated validity of explanations with Flip Rate (FR)
Evaluated diversity with average LPIPS distance between pairs of counterfactuals (σ L )
Evaluated sparsity/proximity with Mean Number of Attributes Changed (MNAC) and Cosine Distance (CD)
Evaluated realism with FID and sFID
Used DDPM model with DiME weights for CelebA, re-spaced time steps for CelebA HQ, and trained diffusion model on 10,000 image subset of BDD100k
Used ℓ1 or ℓ2 distance for distance function and PGD for attack optimization

Comparison against the state-of-the-art

ACE outperforms previous State-of-the-Art methods on all datasets
DiME’s success rate increased from 41% to 97% on CelebA HQ when parameters were augmented
Fewer steps improved the quality on BDD100k
ACE outmatches DiME on all metrics in CelebA
DiME outperforms ACE on COUT and CD metrics in CelebA HQ
ACE consistently outperforms DiME and STEEX on BDD100k
ℓ 1 generates sparser modifications, while ℓ 2 tends to generate broader editing
Extensions for FVA and FID metrics validated
ACE tested on a small subset of classes on ImageNet with a ResNet50

Diversity assessment

ACE is able to generate diverse explanations
Diffusion models are designed to generate distributions of images
ACE and DiME are compared using a diversity score
ACE does not go deep into the forward noising chain to avoid changing the original class
Re-spacing is used to maintain accuracy and diversity

Qualitative results

Qualitative results shown in Figure 2 for all datasets, including ImageNet examples
Sparser or coarser characteristics depending on attribute
Different distance losses impose different types of explanations
ℓ 1 loss exposes most local and concrete explanations
ℓ 2 loss generates coarser editing
Generated mask useful to spot out location of changes
Supplementary material includes more qualitative results

Actionability

Counterfactual explanations are expected to teach the user how to modify the classifier’s prediction
We studied a batch of counterfactual-input tuples generated with our method
We studied the CelebA HQ classifier for the age and smile attributes
We identified two interesting results: frowning can change the prediction of the classifier and the classifier uses the morphological trait of high cheekbones to classify someone as smiling
We tested these hypotheses in real life and were successful
We compared pre-explanation and refined explanations quantitatively and qualitatively
We explored the effects of using different adversarial attacks
We showed that the S 3 metric gives similar results as the FVA
We proposed ACE, an approach to generate counterfactual explanations using adversarial attacks
ACE is capable of changing key features while avoiding modifying unwanted structures
ACE is faster than previous literature
ACE is capable of showing natural features to find sparse and actionable modifications in real life

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Adversarial counterfactual explanations#

Pre-explanation generation with ddpms#

Bringing the pre-explanations closer to the input images#

Experimentation#

Evaluation protocols and datasets#

Comparison against the state-of-the-art#

Diversity assessment#

Qualitative results#

Actionability#

Link to paper

Abstract

Paper Content

Introduction

Related work

Adversarial counterfactual explanations

Pre-explanation generation with ddpms

Bringing the pre-explanations closer to the input images

Experimentation

Evaluation protocols and datasets

Comparison against the state-of-the-art

Diversity assessment

Qualitative results

Actionability