Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Identification of cancer genes is a challenging problem in cancer genomics research
Computational methods, including deep neural networks, have been developed to address this issue
These methods fail to exploit gene-gene interactions and provide little explanation for their predictions
Proposed EMGNN approach leverages multiple gene-gene interaction networks and multi-omics data
EMGNN outperforms existing approaches and provides valuable biological insights into its predictions

Paper Content

Introduction

Understanding gene function and disease pathogenicity depends on gene properties and interactions
High-throughput experiments enable profiling of genetic and molecular properties
Computational methods predict gene functions by combining gene properties and network connectivity
Predicting gene pathogenicity in disease-specific contexts is challenging
Cancer sequencing projects generate data for identifying novel cancer genes
EMOGI models multi-omics features of cancer genes in PPI networks to predict novel cancer genes
EMGNN proposed to address challenge of functional properties irrelevant to cancer disease physiology
EMGNN maximizes concordance of functional gene relationships with unknown disease physiology
EMGNN achieves state-of-the-art performance by combining information from 6 PPI networks
EMGNN identifies most important multi-omics features and most influential PPI networks

Datasets

Trained proposed model with 6 PPI Networks
Used mutation, copy number, DNA methylation and gene expression data from 29,446 samples from TCGA
Data from 16 different cancer types

Multilayer graph neural network

Graph neural networks (GNNs) are used to leverage both network structure and node features.
GNNs use a message-passing scheme with two steps: aggregating representations of neighbors and updating own representation.
Popular architectures include Graph Convolution Networks (GCNs) and Graph Attention Networks (GAT).
Multilayer Graph Construction method is used to handle multiple networks.
Model is trained using cross-entropy loss function and ADAM optimizer.
Data is divided into training, testing and validation sets.
Initial GNN has 3 layers with hidden dimension of 64, meta-GNN has 1 layer with hidden dimension of 64.

Model interpretation

Captum is a tool for understanding and interpreting machine learning models
It offers a range of interpretability methods to analyze predictions
Integrated gradient (IG) module assigns importance score to each input feature
IG interprets decisions of neural networks by estimating contribution of each input feature
Baseline input is typically chosen to be neutral or meaningless
Traditional IG method not applicable to graph neural networks
Modified approach proposed to compute IG in graph neural networks
Node feature interpretation and edge feature interpretation analysis used

Novel cancer gene discovery

Trained EMGNN model was used to predict novel cancer genes in 14019 unlabeled genes
Genes were ranked by their predicted cancer gene probability
EMOGI models were applied to each unlabeled gene to predict probability of it being a cancer gene
Results of 6591 unlabeled genes were analyzed

Gene set enrichment analysis

Used gene set enrichment analysis to analyze functional enrichment of important gene features in cancer pathways
Aggregated maximum feature importance of each node using Captum’s feature explanation results
Excluded genes with zero importance from analysis
Ranked neighboring gene nodes based on importance
Used ranked gene list as input for GSEA
Computed enrichment p-value and multiple testing corrected FDR against cancer hallmark gene sets

Results

Overview of emgnn framework

Developed a graph neural network model EMGNN
Input is a feature vector for each gene and multiple graphs
Model updates graph representation within each graph layer
Introduces a meta graph layer to combine layer-wise node representations
Multi-layer perceptron takes meta node representations and performs node classification
EMGNN generalizes single graph GNN by capitalizing on complementary information stored in multiple graphs

Multilayered graph improves emgnn performance

EMGNN was used to predict cancer genes using a dataset of 887 labeled cancer genes, 7753 non-cancer genes and 14019 unlabeled genes.
Six PPI networks were binarized to keep only high-confidence edges.
Performance increased as the number of input networks increased.
EMGNN achieved state-of-the-art performance for all test sets.
EMGNN outperformed EMOGI by a margin of over 5% AURPC.

Evaluating the performance of different gnn architectures and graph ablations.

Performed ablation study to assess performance of EMGNN model using different GNN architectures and input perturbations
Found that GCN is the best-performing GNN architecture in all datasets
Examined performance of EMGNN with GCN architecture under different types of input perturbations
Found that EMGNN decreased in performance for both random and all-one node features
Removal of edges slightly decreased EMGNN performance

Explaining emgnn reveals biological insights of cancer gene pathogenicity

Explainable and trustworthy models are essential for understanding cancer genes and discovering new ones.
EMGNN was used to analyze the relative contributions from each PPI network to cancer gene predictions.
ANOVA test showed significant difference in contributions from different PPI networks.
DNA methylation was found to be significantly more important for known cancer gene prediction than other omics data.

Emgnn identifies novel cancer genes by integrating multilayer graphs

Trained EMGNN model used to predict cancer genes on unlabeled genes
Non-trivial number of unlabeled genes with high probability of predicted novel cancer genes
EMGNN achieved accurate and unified novel cancer gene prediction by integrating multilayer graphs
Predictions of EMOGI models trained on individual PPI networks showed substantial divergence
EMGNN provided more accurate, unified predictions of cancer genes
EMGNN predictions of COL5A1 as cancer gene with high confidence
All individual PPI networks contributed similarly to EMGNN predictions
Gene set enrichment analysis used to illustrate potential biological mechanisms of COL5A1

Discussion

Biomedical and biological domain contains a wealth of information represented and analyzed using graph structures
Gene interaction and protein-protein interaction networks describe functional relationships of genes and proteins
Graph construction and integration methods render distinct predictive powers
Developed a new graph learning framework, EMGNN, to jointly model multilayered graphs
EMGNN outperforms previous graph neural networks trained on single graphs
EMGNN leverages complementary information from different graph layers and omics features to predict cancer genes
EMGNN recovers cancer genes missed by previous state-of-the-art predictors
EMGNN integrates homogeneous, undirected graphs
EMGNN can be extended to various types of graphs and to perform cross-data modality integration
EMGNN reveals molecular aberrations that may be leveraged for screening and re-purposing of drugs

Link to paper#

Abstract#

Paper Content#

Introduction#

Datasets#

Multilayer graph neural network#

Model interpretation#

Novel cancer gene discovery#

Gene set enrichment analysis#

Results#

Overview of emgnn framework#

Multilayered graph improves emgnn performance#

Evaluating the performance of different gnn architectures and graph ablations.#

Explaining emgnn reveals biological insights of cancer gene pathogenicity#

Emgnn identifies novel cancer genes by integrating multilayer graphs#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Datasets

Multilayer graph neural network

Model interpretation

Novel cancer gene discovery

Gene set enrichment analysis

Results

Overview of emgnn framework

Multilayered graph improves emgnn performance

Evaluating the performance of different gnn architectures and graph ablations.

Explaining emgnn reveals biological insights of cancer gene pathogenicity

Emgnn identifies novel cancer genes by integrating multilayer graphs

Discussion