Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Identification of cancer genes is a challenging problem in cancer genomics research
- Computational methods, including deep neural networks, have been developed to address this issue
- These methods fail to exploit gene-gene interactions and provide little explanation for their predictions
- Proposed EMGNN approach leverages multiple gene-gene interaction networks and multi-omics data
- EMGNN outperforms existing approaches and provides valuable biological insights into its predictions
Paper Content
Introduction
- Understanding gene function and disease pathogenicity depends on gene properties and interactions
- High-throughput experiments enable profiling of genetic and molecular properties
- Computational methods predict gene functions by combining gene properties and network connectivity
- Predicting gene pathogenicity in disease-specific contexts is challenging
- Cancer sequencing projects generate data for identifying novel cancer genes
- EMOGI models multi-omics features of cancer genes in PPI networks to predict novel cancer genes
- EMGNN proposed to address challenge of functional properties irrelevant to cancer disease physiology
- EMGNN maximizes concordance of functional gene relationships with unknown disease physiology
- EMGNN achieves state-of-the-art performance by combining information from 6 PPI networks
- EMGNN identifies most important multi-omics features and most influential PPI networks
Datasets
- Trained proposed model with 6 PPI Networks
- Used mutation, copy number, DNA methylation and gene expression data from 29,446 samples from TCGA
- Data from 16 different cancer types
Multilayer graph neural network
- Graph neural networks (GNNs) are used to leverage both network structure and node features.
- GNNs use a message-passing scheme with two steps: aggregating representations of neighbors and updating own representation.
- Popular architectures include Graph Convolution Networks (GCNs) and Graph Attention Networks (GAT).
- Multilayer Graph Construction method is used to handle multiple networks.
- Model is trained using cross-entropy loss function and ADAM optimizer.
- Data is divided into training, testing and validation sets.
- Initial GNN has 3 layers with hidden dimension of 64, meta-GNN has 1 layer with hidden dimension of 64.
Model interpretation
- Captum is a tool for understanding and interpreting machine learning models
- It offers a range of interpretability methods to analyze predictions
- Integrated gradient (IG) module assigns importance score to each input feature
- IG interprets decisions of neural networks by estimating contribution of each input feature
- Baseline input is typically chosen to be neutral or meaningless
- Traditional IG method not applicable to graph neural networks
- Modified approach proposed to compute IG in graph neural networks
- Node feature interpretation and edge feature interpretation analysis used
Novel cancer gene discovery
- Trained EMGNN model was used to predict novel cancer genes in 14019 unlabeled genes
- Genes were ranked by their predicted cancer gene probability
- EMOGI models were applied to each unlabeled gene to predict probability of it being a cancer gene
- Results of 6591 unlabeled genes were analyzed
Gene set enrichment analysis
- Used gene set enrichment analysis to analyze functional enrichment of important gene features in cancer pathways
- Aggregated maximum feature importance of each node using Captum’s feature explanation results
- Excluded genes with zero importance from analysis
- Ranked neighboring gene nodes based on importance
- Used ranked gene list as input for GSEA
- Computed enrichment p-value and multiple testing corrected FDR against cancer hallmark gene sets
Results
Overview of emgnn framework
- Developed a graph neural network model EMGNN
- Input is a feature vector for each gene and multiple graphs
- Model updates graph representation within each graph layer
- Introduces a meta graph layer to combine layer-wise node representations
- Multi-layer perceptron takes meta node representations and performs node classification
- EMGNN generalizes single graph GNN by capitalizing on complementary information stored in multiple graphs
Multilayered graph improves emgnn performance
- EMGNN was used to predict cancer genes using a dataset of 887 labeled cancer genes, 7753 non-cancer genes and 14019 unlabeled genes.
- Six PPI networks were binarized to keep only high-confidence edges.
- Performance increased as the number of input networks increased.
- EMGNN achieved state-of-the-art performance for all test sets.
- EMGNN outperformed EMOGI by a margin of over 5% AURPC.
Evaluating the performance of different gnn architectures and graph ablations.
- Performed ablation study to assess performance of EMGNN model using different GNN architectures and input perturbations
- Found that GCN is the best-performing GNN architecture in all datasets
- Examined performance of EMGNN with GCN architecture under different types of input perturbations
- Found that EMGNN decreased in performance for both random and all-one node features
- Removal of edges slightly decreased EMGNN performance
Explaining emgnn reveals biological insights of cancer gene pathogenicity
- Explainable and trustworthy models are essential for understanding cancer genes and discovering new ones.
- EMGNN was used to analyze the relative contributions from each PPI network to cancer gene predictions.
- ANOVA test showed significant difference in contributions from different PPI networks.
- DNA methylation was found to be significantly more important for known cancer gene prediction than other omics data.
Emgnn identifies novel cancer genes by integrating multilayer graphs
- Trained EMGNN model used to predict cancer genes on unlabeled genes
- Non-trivial number of unlabeled genes with high probability of predicted novel cancer genes
- EMGNN achieved accurate and unified novel cancer gene prediction by integrating multilayer graphs
- Predictions of EMOGI models trained on individual PPI networks showed substantial divergence
- EMGNN provided more accurate, unified predictions of cancer genes
- EMGNN predictions of COL5A1 as cancer gene with high confidence
- All individual PPI networks contributed similarly to EMGNN predictions
- Gene set enrichment analysis used to illustrate potential biological mechanisms of COL5A1
Discussion
- Biomedical and biological domain contains a wealth of information represented and analyzed using graph structures
- Gene interaction and protein-protein interaction networks describe functional relationships of genes and proteins
- Graph construction and integration methods render distinct predictive powers
- Developed a new graph learning framework, EMGNN, to jointly model multilayered graphs
- EMGNN outperforms previous graph neural networks trained on single graphs
- EMGNN leverages complementary information from different graph layers and omics features to predict cancer genes
- EMGNN recovers cancer genes missed by previous state-of-the-art predictors
- EMGNN integrates homogeneous, undirected graphs
- EMGNN can be extended to various types of graphs and to perform cross-data modality integration
- EMGNN reveals molecular aberrations that may be leveraged for screening and re-purposing of drugs