Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

XAI methods can explain predictions of DNNs
XAI methods have been applied in climate science
Missing ground truth explanations complicate evaluation and validation of XAI methods
This work introduces XAI evaluation in the context of climate research
Evaluation properties assessed: robustness, faithfulness, randomization, complexity, localization
MLP and CNN trained to predict decade based on temperature maps
Multiple XAI methods applied and performance quantified for each evaluation property
XAI methods Integrated Gradients, Layer-wise relevance propagation, and InputGradients show considerable robustness, faithfulness, and complexity
Explanations using input perturbations do not improve robustness and faithfulness

Paper Content

Introduction

Deep learning is used in climate science for tasks such as nowcasting, monitoring, forecasting, model enhancement, and upsampling of satellite data
Deep neural networks are considered a black box and lack transparency
Explainable artificial intelligence (XAI) can validate DNNs and provide researchers with new insights
XAI can be categorized using three aspects: local/global decision-making, self-explaining models, and model-aware/model-agnostic methods
Output of XAI can differ in terms of meaning
XAI evaluation quantitatively assesses the reliability of an explanation
XAI evaluation properties include robustness, complexity, localization, randomization, and faithfulness
Workflow includes training a model, applying XAI methods, and using XAI evaluation to compare and rank methods
Evaluation metrics are assessed for compatibility with climate data properties
Guideline established for using XAI evaluation to choose an optimal explanation

Data and methods

Data

Data is simulated by the general climate model, CESM1
Data consists of 40 ensemble members
Data is global 2-m air temperature maps from 1920 to 2080
Data is processed by computing annual averages and applying a bilinear interpolation
Data is standardized by removing the multi-year 1920-2080 mean and dividing by the corresponding standard deviation

Networks

MLP and CNN are trained to solve a fuzzy classification problem
MLP takes flattened temperature maps as input
MLP assigns each map to one of 20 different classes
Regression is used to predict the year of the input
MLP and CNN have comparable number of parameters
Datasets include a training and test set, 80% of data is split into training and validation set

Explainable artificial intelligence (xai)

Model-aware explanation methods in climate science are presented
Model-agnostic explanation methods are not considered due to computational time
Gradient/Saliency explains network decision by computing first partial derivative of output with respect to input
InputGradient extends information content towards input image
Integrated Gradients introduces baseline datapoint and computes explanation based on difference to baseline
Layerwise Relevance Propagation computes relevance for each input feature by feeding network’s prediction backwards
SmoothGrad, NoiseGrad, and FusionGrad perturb input features and/or network weights to account for uncertainties

Evaluation techniques

XAI research has developed metrics to assess different properties of explanation methods
Five different evaluation properties have been analyzed, based on a classification task from Labe and Barnes [2021]

Faithfulness

Table 5 refers to a perturbation function called ‘Indices’
‘Indices’ refers to the replacement of the highest value pixels in the explanation
‘Linear’ refers to noisy linear imputation

Randomisation

Calculations for MPT score use ‘bottom up’ approach from output layer to input layer
Pearson correlation used as similarity function for both metrics
Top-k considers 10% most relevant pixels of all pixels in temperature map
Hyperparameters of XAI methods and evaluation metrics reported in Tables 4 and 5 respectively
Maximum and minimum values of temperature maps in dataset denoted as xmax and xmin

Localisation

Quality of an explanation is measured based on agreement with user-defined region of interest
Localization metrics assume that ROI should be mainly responsible for network decision
Top-k-pixel and relevance-rank-accuracy are used to measure localization
Complexity assesses how evidence values are distributed across explanation map

Complexity

Complexity is a measure of conciseness
Explanations should consist of few strong features
Complexity and sparseness are used as metric functions
Low entropy is desirable

Network predictions, explanations and motivating example

Evaluated network performance and discussed application of explanation methods for both network architectures
Fixed hyperparameters and fuzzy classification setup for MLP and CNN during training
MLP and CNN have similar performance compared to primary publication
Classification accuracy of both networks agrees within error bounds
Calculated explanation maps for all temperature maps correctly predicted
Applied XAI methods to explain predictions of both MLP and CNN
Different XAI methods provide different relevances

Assessment of xai metrics

Evaluated XAI evaluation properties for classification task on MLP
Analyzed two representative metrics for each property
Based analysis on three criteria: coherence, score stability, and information value
Provided artificial random explanation baseline for each metric
Robustness metrics pass random baseline test
LRP-α-β has highest robustness scores
FusionGrad and NoiseGrad have lowest robustness scores
AS and LLE scores do not align
FC passes random baseline test, ROAD scores of NoiseGrad and FusionGrad overlap with random baseline
MPT and RL metrics evaluated, random baseline has lowest scores
Complexity and Sparseness metrics evaluated, LRP-α-β has highest complexity score, InputGradients and LRP-z have highest sparseness scores
Localization metrics evaluated, FusionGrad has highest score, all other explanation methods have lower but similar scores

Network-based comparison

MLP and CNN networks compared using one metric per property
Challenges in defining meaningful ROI for localization and defining localization as an explanation property
Table 3 displays results for both networks across all properties
Similarities in ranking across every category, but differences in localization and complexity due to structural differences in learned patterns
Input contribution methods (Integrated Gradients, Input-Gradients, LRP) best in faithfulness, robustness, and complexity
Gradient-based methods (Gradient, SmoothGrad, NoiseGrad, FusionGrad) best in randomization
LRP-α-β and LRP-composite low rankings in faithfulness category
Explanation-enhancing procedures (SmoothGrad, Integrated Gradients, FusionGrad, NoiseGrad) no improvement of explanation performance
Spyder plot (Table 3 and Figure 8) used to determine best-performing XAI method

Choosing a xai method

XAI evaluation can be used to select an appropriate XAI method.
Practitioners should determine which explanation properties are essential for their specific network task.
XAI evaluation scores can be used to rank XAI methods and determine the optimal one for the given task.

Discussion and conclusion

XAI methods aim to improve understanding of complex relationships learned by DNNs
XAI methods can provide novel insights into climate AI research
Increasing number of available XAI methods raises two questions: Which explanation method is trustworthy and which is an appropriate choice for a given task?
XAI evaluation introduced to climate science to address these questions
Evaluate various local, model-aware explanation methods
Evaluate based on five different properties: robustness, faithfulness, randomization complexity, localization
Normalized evaluation scores across properties calculated for different XAI methods
Results indicate that explanation methods considering input contributions perform better
XAI evaluation facilitates more trustworthy interpretation of explained evidence
XAI evaluation offers thorough and novel information about structural properties of explanation methods
XAI evaluation guideline proposed to choose optimal explanation method for specific research task
Robustness score affected by increased data and network noise
Explainations using averages across perturbations do not increase robustness, faithfulness and complexity
Gradient-based methods capture network parameter influence more reliably

Link to paper#

Abstract#

Paper Content#

Introduction#

Data and methods#

Data#

Networks#

Explainable artificial intelligence (xai)#

Evaluation techniques#

Faithfulness#

Randomisation#

Localisation#

Complexity#

Network predictions, explanations and motivating example#

Assessment of xai metrics#

Network-based comparison#

Choosing a xai method#

Discussion and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Data and methods

Data

Networks

Explainable artificial intelligence (xai)

Evaluation techniques

Faithfulness

Randomisation

Localisation

Complexity

Network predictions, explanations and motivating example

Assessment of xai metrics

Network-based comparison

Choosing a xai method

Discussion and conclusion