Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • GNNs have potential in graph representation learning
  • Standard GNNs have two major limitations
  • ViT/MLP-Mixer architectures can solve these limitations but increase computational cost
  • Graph MLP-Mixer captures long-range dependency and mitigates over-squashing
  • Graph MLP-Mixer is faster and more memory efficient than related models
  • Graph MLP-Mixer is highly expressive and can distinguish non-isomorphic graphs

Paper Content

Generalizing vit/ml-mixer to graphs overcome mp-gnn limitations

  • MLP-Mixer architecture is designed to capture long-range interaction while keeping low computational cost
  • Generalizing MLP-Mixer from grids and sequences to arbitrary graph topology is challenging
  • Main contribution is to design a novel GNN architecture that captures long-range interaction, keeps low computational complexity, and is isomorphically expressive
  • GNNs have linear learning/inference complexities but low representation power and poor long-range dependency
  • Graph MLP-Mixer overcomes the computational bottleneck of Graph Transformers and solves the issue of long-distance dependency
  • Competitive results on multiple benchmarks
  • Capacity to capture long-range dependency with SOTA performance while keeping low complexity
  • Forms a bridge between CV, NLP and graphs under a unified architecture

Generalization challenges

  • MLP-Mixer is adapted from images to graphs
  • Table 1 summarizes the differences between standard MLP-Mixer and Graph MLP-Mixer
  • Graphs cannot be uniformly divided into similar patches across all examples in the dataset
  • Graph patches need to be transformed into a fixed-length vectorial representation
  • Graph patches are unordered and nodes in graph tokens are naturally unordered
  • MLP-Mixer architectures are known to be strong overfitters

Overview

  • Graph MLP-Mixer is a computer science architecture
  • Graph MLP-Mixer is composed of a patch extraction module, patch embedding module, mixer layers, global average pooling layer, and a fully-connected layer
  • Graphs are represented by a set of nodes (V) and edges (E)
  • Graphs have a pre-defined number of patches (P)
  • Graph-level vectorial representation (h G ) and graph-level target (y G ) are used for prediction

Patch extraction

  • MLP-Mixer can be generalized to graphs by extracting patches.
  • Image patches are easy to extract because they are defined on a regular grid with the same resolution.
  • Graph patches are more challenging to extract because they have different sizes and meaningful sub-graphs must be identified.
  • Graph partitioning algorithms have been studied for decades and are known to be NP-hard.
  • METIS is used to extract graph patches, but it only finds non-overlapping patches.
  • To preserve edge information, patches are expanded to their one-hop neighbourhood.

Patch encoder

  • Patch encoding can be done with a linear transformation for images.
  • Graph patch encoder is a GNN.
  • Graph patch encoder transforms graph token into fixed-size representation in 3 steps.
  • Graph patch encoder has limitation of poor long-range dependency, but this is only an issue for large graphs.

Positional information

  • Regular grids offer an implicit arrangement for image patches and pixels.
  • General graphs do not have this arrangement.
  • Two explicit positional encodings are used to increase expressivity.
  • Adjacency matrix is smoothed out to improve long-distance interactions.

Mixer layer

  • The mixer layer is a network that alternates between token and channel mixing steps.
  • The self-attention mechanism in ViT is not the only critical component for good performance on visual classification tasks.
  • The mixer layer requires less computational cost than the self-attention mechanism in ViT.
  • Positional information is introduced between graph tokens in the modified mixer layer.

Data augmentation

  • MLP-Mixer architectures are known to over-fit.
  • To reduce this effect, we propose to introduce perturbations in METIS.
  • We randomly drop a small set of edges from the original graph.
  • At each epoch, we apply METIS graph partition algorithm on the modified graph.
  • We extract graph patches from the original graph, not the modified graph.
  • Our dropping edge mechanism is different to standard data augmentation techniques.

Experiments

  • Evaluated Graph MLP-Mixer on simulated and real-world datasets
  • Datasets include CSL, EXP, SR25, TreeNeighbourMatch, ZINC, MNIST, CIFAR10, MolTOX21, MolHIV, Peptides-func, Peptides-struct
  • Summary statistics and experimental details in Appendices A and B

Comparison with mp-gnns

  • Graph MLP-Mixer improves the performance of all base MP-GNNs on various datasets.
  • Graph MLP-Mixer outperforms the base MP-GNNs by large margins on two LRGB datasets.

Comparison with sotas

  • Graph MLP-Mixer achieved good results on small and large molecular graphs
  • Graph MLP-Mixer offers better space-time complexity and scalability than other GNN models

Graph mlp-mixer can achieve high expressivity

  • Graph PEs can distinguish non-isomorphic graphs that the 1-WL test fails
  • Graph MLP-Mixer is more powerful than 1-WL
  • Experiments conducted on 3 simulation datasets
  • Graph MLP-Mixer achieves perfect accuracy, MP-GNNs fail
  • Evaluated various choices for components of architecture
  • METIS provides benefits against random graph partitioning
  • Evaluated effect of number of graph patches, patch overlapping, and two kinds of positional encoding
  • Proposed novel GNN model to improve standard MP-GNN limitations
  • Evaluated on wide range of graph benchmarks
  • Summary statistics of datasets presented
  • Used PyTorch and PyG
  • Experiments run on NVIDIA RTX A5000 GPUs
  • Used Adam optimizer
  • Used ASAM optimizer to reduce fluctuations
  • Parameter budgets for real-world datasets
  • Used GCN, GatedGCN, GINE, and Graph Transformer as baselines
  • Graph MLP-Mixer hidden size set to 128, number of GNN layers and Mixer layers set to 4
  • Results referenced from literature or reproduced using authors’ official code
  • Batch size set to 128 for SOTA models and ours
  • Node positional encoding dataset and task dependent
  • METIS used as graph partition algorithm
  • Data augmentation applied to both algorithms
  • Performance increases first and then flattens out when increasing number of graph patches
  • Graph patches overlapping increases performance
  • Positional encoding increases expressivity power of GNNs but not necessarily their generalization performance