Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Identification of cancer genes is a challenging problem in cancer genomics research
  • Computational methods, including deep neural networks, have been developed to address this issue
  • These methods fail to exploit gene-gene interactions and provide little explanation for their predictions
  • Proposed EMGNN approach leverages multiple gene-gene interaction networks and multi-omics data
  • EMGNN outperforms existing approaches and provides valuable biological insights into its predictions

Paper Content

Introduction

  • Understanding gene function and disease pathogenicity depends on gene properties and interactions
  • High-throughput experiments enable profiling of genetic and molecular properties
  • Computational methods predict gene functions by combining gene properties and network connectivity
  • Predicting gene pathogenicity in disease-specific contexts is challenging
  • Cancer sequencing projects generate data for identifying novel cancer genes
  • EMOGI models multi-omics features of cancer genes in PPI networks to predict novel cancer genes
  • EMGNN proposed to address challenge of functional properties irrelevant to cancer disease physiology
  • EMGNN maximizes concordance of functional gene relationships with unknown disease physiology
  • EMGNN achieves state-of-the-art performance by combining information from 6 PPI networks
  • EMGNN identifies most important multi-omics features and most influential PPI networks

Datasets

  • Trained proposed model with 6 PPI Networks
  • Used mutation, copy number, DNA methylation and gene expression data from 29,446 samples from TCGA
  • Data from 16 different cancer types

Multilayer graph neural network

  • Graph neural networks (GNNs) are used to leverage both network structure and node features.
  • GNNs use a message-passing scheme with two steps: aggregating representations of neighbors and updating own representation.
  • Popular architectures include Graph Convolution Networks (GCNs) and Graph Attention Networks (GAT).
  • Multilayer Graph Construction method is used to handle multiple networks.
  • Model is trained using cross-entropy loss function and ADAM optimizer.
  • Data is divided into training, testing and validation sets.
  • Initial GNN has 3 layers with hidden dimension of 64, meta-GNN has 1 layer with hidden dimension of 64.

Model interpretation

  • Captum is a tool for understanding and interpreting machine learning models
  • It offers a range of interpretability methods to analyze predictions
  • Integrated gradient (IG) module assigns importance score to each input feature
  • IG interprets decisions of neural networks by estimating contribution of each input feature
  • Baseline input is typically chosen to be neutral or meaningless
  • Traditional IG method not applicable to graph neural networks
  • Modified approach proposed to compute IG in graph neural networks
  • Node feature interpretation and edge feature interpretation analysis used

Novel cancer gene discovery

  • Trained EMGNN model was used to predict novel cancer genes in 14019 unlabeled genes
  • Genes were ranked by their predicted cancer gene probability
  • EMOGI models were applied to each unlabeled gene to predict probability of it being a cancer gene
  • Results of 6591 unlabeled genes were analyzed

Gene set enrichment analysis

  • Used gene set enrichment analysis to analyze functional enrichment of important gene features in cancer pathways
  • Aggregated maximum feature importance of each node using Captum’s feature explanation results
  • Excluded genes with zero importance from analysis
  • Ranked neighboring gene nodes based on importance
  • Used ranked gene list as input for GSEA
  • Computed enrichment p-value and multiple testing corrected FDR against cancer hallmark gene sets

Results

Overview of emgnn framework

  • Developed a graph neural network model EMGNN
  • Input is a feature vector for each gene and multiple graphs
  • Model updates graph representation within each graph layer
  • Introduces a meta graph layer to combine layer-wise node representations
  • Multi-layer perceptron takes meta node representations and performs node classification
  • EMGNN generalizes single graph GNN by capitalizing on complementary information stored in multiple graphs

Multilayered graph improves emgnn performance

  • EMGNN was used to predict cancer genes using a dataset of 887 labeled cancer genes, 7753 non-cancer genes and 14019 unlabeled genes.
  • Six PPI networks were binarized to keep only high-confidence edges.
  • Performance increased as the number of input networks increased.
  • EMGNN achieved state-of-the-art performance for all test sets.
  • EMGNN outperformed EMOGI by a margin of over 5% AURPC.

Evaluating the performance of different gnn architectures and graph ablations.

  • Performed ablation study to assess performance of EMGNN model using different GNN architectures and input perturbations
  • Found that GCN is the best-performing GNN architecture in all datasets
  • Examined performance of EMGNN with GCN architecture under different types of input perturbations
  • Found that EMGNN decreased in performance for both random and all-one node features
  • Removal of edges slightly decreased EMGNN performance

Explaining emgnn reveals biological insights of cancer gene pathogenicity

  • Explainable and trustworthy models are essential for understanding cancer genes and discovering new ones.
  • EMGNN was used to analyze the relative contributions from each PPI network to cancer gene predictions.
  • ANOVA test showed significant difference in contributions from different PPI networks.
  • DNA methylation was found to be significantly more important for known cancer gene prediction than other omics data.

Emgnn identifies novel cancer genes by integrating multilayer graphs

  • Trained EMGNN model used to predict cancer genes on unlabeled genes
  • Non-trivial number of unlabeled genes with high probability of predicted novel cancer genes
  • EMGNN achieved accurate and unified novel cancer gene prediction by integrating multilayer graphs
  • Predictions of EMOGI models trained on individual PPI networks showed substantial divergence
  • EMGNN provided more accurate, unified predictions of cancer genes
  • EMGNN predictions of COL5A1 as cancer gene with high confidence
  • All individual PPI networks contributed similarly to EMGNN predictions
  • Gene set enrichment analysis used to illustrate potential biological mechanisms of COL5A1

Discussion

  • Biomedical and biological domain contains a wealth of information represented and analyzed using graph structures
  • Gene interaction and protein-protein interaction networks describe functional relationships of genes and proteins
  • Graph construction and integration methods render distinct predictive powers
  • Developed a new graph learning framework, EMGNN, to jointly model multilayered graphs
  • EMGNN outperforms previous graph neural networks trained on single graphs
  • EMGNN leverages complementary information from different graph layers and omics features to predict cancer genes
  • EMGNN recovers cancer genes missed by previous state-of-the-art predictors
  • EMGNN integrates homogeneous, undirected graphs
  • EMGNN can be extended to various types of graphs and to perform cross-data modality integration
  • EMGNN reveals molecular aberrations that may be leveraged for screening and re-purposing of drugs