Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations.
- Adversarial attacks cannot be used directly in a counterfactual explanation perspective.
- The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks.
- The paper’s key idea is to build attacks through a diffusion model to polish them.
Paper Content
Introduction
- Research branch of explainable AI has yielded remarkable results
- Counterfactual explanations (CE) is a promising pipeline for explainability
- CE answers what needs to change to alter the prediction
- CE are easy to understand and have been adopted by companies
- CE should be valid, sparse, proximal, diverse, and realistic
- Adversarial attacks share a common goal with CE
- ACE is a novel methodology based on adversarial attacks to generate semantically coherent CE
- ACE performs competitively with other methods
- ACE produces actionable modifications in real-world scenarios to flip the classifier decision
Related work
- Ad-Hoc and Post-Hoc methods are two branches of Explainable AI
- Post-Hoc methods can be divided into global and local explanations
- Local explanations include saliency maps, concept attribution and model distillation
- Counterfactual explanations try to explain how a model took a decision
- Adversarial attacks and counterfactual explanations share the same goal
- Adversarial attacks can be white-box or black-box
- Adversarial attacks can be used to create semantic changes in undefended models
Adversarial counterfactual explanations
- ACE produces counterfactual images in two steps
- Step 1: Producing pre-explanation images by minimizing a loss function using an adversarial attack
- Step 2: Bringing pre-explanations closer to the input images by using a binary mask
- Filtering process F robustifies the classifier to generate semantic changes without modifying its weights
Pre-explanation generation with ddpms
- DDPMs have two key properties: removing high-frequency information and producing in-distribution images.
- DDPMs rely on two Markov chains, one forward and one reverse.
- The forward chain adds noise from a state t into t + 1 while the reverse chain removes it from t + 1 to t.
- ACE pre-explanation generation involves applying the forward DDPM process and then denoising it recursively.
Bringing the pre-explanations closer to the input images
- DDPM generates a normal distribution
- Post-processing phase can help keep irrelevant parts of the image untouched
- Binary mask m delineates regions that qualify for modifications
- Inpainting methods used to fuse CE inside the mask
- Final image is counterfactual explanation
Experimentation
Evaluation protocols and datasets
- Evaluated ACE on CelebA, CelebA HQ, and BDD100k
- CelebA images are 128x128, CelebA HQ images are 256x256, BDD100k images are 512x256
- Classifier is DenseNet121 for all datasets
- Evaluated validity of explanations with Flip Rate (FR)
- Evaluated diversity with average LPIPS distance between pairs of counterfactuals (σ L )
- Evaluated sparsity/proximity with Mean Number of Attributes Changed (MNAC) and Cosine Distance (CD)
- Evaluated realism with FID and sFID
- Used DDPM model with DiME weights for CelebA, re-spaced time steps for CelebA HQ, and trained diffusion model on 10,000 image subset of BDD100k
- Used ℓ1 or ℓ2 distance for distance function and PGD for attack optimization
Comparison against the state-of-the-art
- ACE outperforms previous State-of-the-Art methods on all datasets
- DiME’s success rate increased from 41% to 97% on CelebA HQ when parameters were augmented
- Fewer steps improved the quality on BDD100k
- ACE outmatches DiME on all metrics in CelebA
- DiME outperforms ACE on COUT and CD metrics in CelebA HQ
- ACE consistently outperforms DiME and STEEX on BDD100k
- ℓ 1 generates sparser modifications, while ℓ 2 tends to generate broader editing
- Extensions for FVA and FID metrics validated
- ACE tested on a small subset of classes on ImageNet with a ResNet50
Diversity assessment
- ACE is able to generate diverse explanations
- Diffusion models are designed to generate distributions of images
- ACE and DiME are compared using a diversity score
- ACE does not go deep into the forward noising chain to avoid changing the original class
- Re-spacing is used to maintain accuracy and diversity
Qualitative results
- Qualitative results shown in Figure 2 for all datasets, including ImageNet examples
- Sparser or coarser characteristics depending on attribute
- Different distance losses impose different types of explanations
- ℓ 1 loss exposes most local and concrete explanations
- ℓ 2 loss generates coarser editing
- Generated mask useful to spot out location of changes
- Supplementary material includes more qualitative results
Actionability
- Counterfactual explanations are expected to teach the user how to modify the classifier’s prediction
- We studied a batch of counterfactual-input tuples generated with our method
- We studied the CelebA HQ classifier for the age and smile attributes
- We identified two interesting results: frowning can change the prediction of the classifier and the classifier uses the morphological trait of high cheekbones to classify someone as smiling
- We tested these hypotheses in real life and were successful
- We compared pre-explanation and refined explanations quantitatively and qualitatively
- We explored the effects of using different adversarial attacks
- We showed that the S 3 metric gives similar results as the FVA
- We proposed ACE, an approach to generate counterfactual explanations using adversarial attacks
- ACE is capable of changing key features while avoiding modifying unwanted structures
- ACE is faster than previous literature
- ACE is capable of showing natural features to find sparse and actionable modifications in real life