Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

GNNs have potential in graph representation learning
Standard GNNs have two major limitations
ViT/MLP-Mixer architectures can solve these limitations but increase computational cost
Graph MLP-Mixer captures long-range dependency and mitigates over-squashing
Graph MLP-Mixer is faster and more memory efficient than related models
Graph MLP-Mixer is highly expressive and can distinguish non-isomorphic graphs

Paper Content

Generalizing vit/ml-mixer to graphs overcome mp-gnn limitations

MLP-Mixer architecture is designed to capture long-range interaction while keeping low computational cost
Generalizing MLP-Mixer from grids and sequences to arbitrary graph topology is challenging
Main contribution is to design a novel GNN architecture that captures long-range interaction, keeps low computational complexity, and is isomorphically expressive
GNNs have linear learning/inference complexities but low representation power and poor long-range dependency
Graph MLP-Mixer overcomes the computational bottleneck of Graph Transformers and solves the issue of long-distance dependency
Competitive results on multiple benchmarks
Capacity to capture long-range dependency with SOTA performance while keeping low complexity
Forms a bridge between CV, NLP and graphs under a unified architecture

Generalization challenges

MLP-Mixer is adapted from images to graphs
Table 1 summarizes the differences between standard MLP-Mixer and Graph MLP-Mixer
Graphs cannot be uniformly divided into similar patches across all examples in the dataset
Graph patches need to be transformed into a fixed-length vectorial representation
Graph patches are unordered and nodes in graph tokens are naturally unordered
MLP-Mixer architectures are known to be strong overfitters

Overview

Graph MLP-Mixer is a computer science architecture
Graph MLP-Mixer is composed of a patch extraction module, patch embedding module, mixer layers, global average pooling layer, and a fully-connected layer
Graphs are represented by a set of nodes (V) and edges (E)
Graphs have a pre-defined number of patches (P)
Graph-level vectorial representation (h G ) and graph-level target (y G ) are used for prediction

Patch extraction

MLP-Mixer can be generalized to graphs by extracting patches.
Image patches are easy to extract because they are defined on a regular grid with the same resolution.
Graph patches are more challenging to extract because they have different sizes and meaningful sub-graphs must be identified.
Graph partitioning algorithms have been studied for decades and are known to be NP-hard.
METIS is used to extract graph patches, but it only finds non-overlapping patches.
To preserve edge information, patches are expanded to their one-hop neighbourhood.

Patch encoder

Patch encoding can be done with a linear transformation for images.
Graph patch encoder is a GNN.
Graph patch encoder transforms graph token into fixed-size representation in 3 steps.
Graph patch encoder has limitation of poor long-range dependency, but this is only an issue for large graphs.

Positional information

Regular grids offer an implicit arrangement for image patches and pixels.
General graphs do not have this arrangement.
Two explicit positional encodings are used to increase expressivity.
Adjacency matrix is smoothed out to improve long-distance interactions.

Mixer layer

The mixer layer is a network that alternates between token and channel mixing steps.
The self-attention mechanism in ViT is not the only critical component for good performance on visual classification tasks.
The mixer layer requires less computational cost than the self-attention mechanism in ViT.
Positional information is introduced between graph tokens in the modified mixer layer.

Data augmentation

MLP-Mixer architectures are known to over-fit.
To reduce this effect, we propose to introduce perturbations in METIS.
We randomly drop a small set of edges from the original graph.
At each epoch, we apply METIS graph partition algorithm on the modified graph.
We extract graph patches from the original graph, not the modified graph.
Our dropping edge mechanism is different to standard data augmentation techniques.

Experiments

Evaluated Graph MLP-Mixer on simulated and real-world datasets
Datasets include CSL, EXP, SR25, TreeNeighbourMatch, ZINC, MNIST, CIFAR10, MolTOX21, MolHIV, Peptides-func, Peptides-struct
Summary statistics and experimental details in Appendices A and B

Comparison with mp-gnns

Graph MLP-Mixer improves the performance of all base MP-GNNs on various datasets.
Graph MLP-Mixer outperforms the base MP-GNNs by large margins on two LRGB datasets.

Comparison with sotas

Graph MLP-Mixer achieved good results on small and large molecular graphs
Graph MLP-Mixer offers better space-time complexity and scalability than other GNN models

Graph mlp-mixer can achieve high expressivity

Graph PEs can distinguish non-isomorphic graphs that the 1-WL test fails
Graph MLP-Mixer is more powerful than 1-WL
Experiments conducted on 3 simulation datasets
Graph MLP-Mixer achieves perfect accuracy, MP-GNNs fail
Evaluated various choices for components of architecture
METIS provides benefits against random graph partitioning
Evaluated effect of number of graph patches, patch overlapping, and two kinds of positional encoding
Proposed novel GNN model to improve standard MP-GNN limitations
Evaluated on wide range of graph benchmarks
Summary statistics of datasets presented
Used PyTorch and PyG
Experiments run on NVIDIA RTX A5000 GPUs
Used Adam optimizer
Used ASAM optimizer to reduce fluctuations
Parameter budgets for real-world datasets
Used GCN, GatedGCN, GINE, and Graph Transformer as baselines
Graph MLP-Mixer hidden size set to 128, number of GNN layers and Mixer layers set to 4
Results referenced from literature or reproduced using authors’ official code
Batch size set to 128 for SOTA models and ours
Node positional encoding dataset and task dependent
METIS used as graph partition algorithm
Data augmentation applied to both algorithms
Performance increases first and then flattens out when increasing number of graph patches
Graph patches overlapping increases performance
Positional encoding increases expressivity power of GNNs but not necessarily their generalization performance

Link to paper#

Abstract#

Paper Content#

Generalizing vit/ml-mixer to graphs overcome mp-gnn limitations#

Generalization challenges#

Overview#

Patch extraction#

Patch encoder#

Positional information#

Mixer layer#

Data augmentation#

Experiments#

Comparison with mp-gnns#

Comparison with sotas#

Graph mlp-mixer can achieve high expressivity#

Link to paper

Abstract

Paper Content

Generalizing vit/ml-mixer to graphs overcome mp-gnn limitations

Generalization challenges

Overview

Patch extraction

Patch encoder

Positional information

Mixer layer

Data augmentation

Experiments

Comparison with mp-gnns

Comparison with sotas

Graph mlp-mixer can achieve high expressivity