Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- GNNs have potential in graph representation learning
- Standard GNNs have two major limitations
- ViT/MLP-Mixer architectures can solve these limitations but increase computational cost
- Graph MLP-Mixer captures long-range dependency and mitigates over-squashing
- Graph MLP-Mixer is faster and more memory efficient than related models
- Graph MLP-Mixer is highly expressive and can distinguish non-isomorphic graphs
Paper Content
Generalizing vit/ml-mixer to graphs overcome mp-gnn limitations
- MLP-Mixer architecture is designed to capture long-range interaction while keeping low computational cost
- Generalizing MLP-Mixer from grids and sequences to arbitrary graph topology is challenging
- Main contribution is to design a novel GNN architecture that captures long-range interaction, keeps low computational complexity, and is isomorphically expressive
- GNNs have linear learning/inference complexities but low representation power and poor long-range dependency
- Graph MLP-Mixer overcomes the computational bottleneck of Graph Transformers and solves the issue of long-distance dependency
- Competitive results on multiple benchmarks
- Capacity to capture long-range dependency with SOTA performance while keeping low complexity
- Forms a bridge between CV, NLP and graphs under a unified architecture
Generalization challenges
- MLP-Mixer is adapted from images to graphs
- Table 1 summarizes the differences between standard MLP-Mixer and Graph MLP-Mixer
- Graphs cannot be uniformly divided into similar patches across all examples in the dataset
- Graph patches need to be transformed into a fixed-length vectorial representation
- Graph patches are unordered and nodes in graph tokens are naturally unordered
- MLP-Mixer architectures are known to be strong overfitters
Overview
- Graph MLP-Mixer is a computer science architecture
- Graph MLP-Mixer is composed of a patch extraction module, patch embedding module, mixer layers, global average pooling layer, and a fully-connected layer
- Graphs are represented by a set of nodes (V) and edges (E)
- Graphs have a pre-defined number of patches (P)
- Graph-level vectorial representation (h G ) and graph-level target (y G ) are used for prediction
Patch extraction
- MLP-Mixer can be generalized to graphs by extracting patches.
- Image patches are easy to extract because they are defined on a regular grid with the same resolution.
- Graph patches are more challenging to extract because they have different sizes and meaningful sub-graphs must be identified.
- Graph partitioning algorithms have been studied for decades and are known to be NP-hard.
- METIS is used to extract graph patches, but it only finds non-overlapping patches.
- To preserve edge information, patches are expanded to their one-hop neighbourhood.
Patch encoder
- Patch encoding can be done with a linear transformation for images.
- Graph patch encoder is a GNN.
- Graph patch encoder transforms graph token into fixed-size representation in 3 steps.
- Graph patch encoder has limitation of poor long-range dependency, but this is only an issue for large graphs.
Positional information
- Regular grids offer an implicit arrangement for image patches and pixels.
- General graphs do not have this arrangement.
- Two explicit positional encodings are used to increase expressivity.
- Adjacency matrix is smoothed out to improve long-distance interactions.
Mixer layer
- The mixer layer is a network that alternates between token and channel mixing steps.
- The self-attention mechanism in ViT is not the only critical component for good performance on visual classification tasks.
- The mixer layer requires less computational cost than the self-attention mechanism in ViT.
- Positional information is introduced between graph tokens in the modified mixer layer.
Data augmentation
- MLP-Mixer architectures are known to over-fit.
- To reduce this effect, we propose to introduce perturbations in METIS.
- We randomly drop a small set of edges from the original graph.
- At each epoch, we apply METIS graph partition algorithm on the modified graph.
- We extract graph patches from the original graph, not the modified graph.
- Our dropping edge mechanism is different to standard data augmentation techniques.
Experiments
- Evaluated Graph MLP-Mixer on simulated and real-world datasets
- Datasets include CSL, EXP, SR25, TreeNeighbourMatch, ZINC, MNIST, CIFAR10, MolTOX21, MolHIV, Peptides-func, Peptides-struct
- Summary statistics and experimental details in Appendices A and B
Comparison with mp-gnns
- Graph MLP-Mixer improves the performance of all base MP-GNNs on various datasets.
- Graph MLP-Mixer outperforms the base MP-GNNs by large margins on two LRGB datasets.
Comparison with sotas
- Graph MLP-Mixer achieved good results on small and large molecular graphs
- Graph MLP-Mixer offers better space-time complexity and scalability than other GNN models
Graph mlp-mixer can achieve high expressivity
- Graph PEs can distinguish non-isomorphic graphs that the 1-WL test fails
- Graph MLP-Mixer is more powerful than 1-WL
- Experiments conducted on 3 simulation datasets
- Graph MLP-Mixer achieves perfect accuracy, MP-GNNs fail
- Evaluated various choices for components of architecture
- METIS provides benefits against random graph partitioning
- Evaluated effect of number of graph patches, patch overlapping, and two kinds of positional encoding
- Proposed novel GNN model to improve standard MP-GNN limitations
- Evaluated on wide range of graph benchmarks
- Summary statistics of datasets presented
- Used PyTorch and PyG
- Experiments run on NVIDIA RTX A5000 GPUs
- Used Adam optimizer
- Used ASAM optimizer to reduce fluctuations
- Parameter budgets for real-world datasets
- Used GCN, GatedGCN, GINE, and Graph Transformer as baselines
- Graph MLP-Mixer hidden size set to 128, number of GNN layers and Mixer layers set to 4
- Results referenced from literature or reproduced using authors’ official code
- Batch size set to 128 for SOTA models and ours
- Node positional encoding dataset and task dependent
- METIS used as graph partition algorithm
- Data augmentation applied to both algorithms
- Performance increases first and then flattens out when increasing number of graph patches
- Graph patches overlapping increases performance
- Positional encoding increases expressivity power of GNNs but not necessarily their generalization performance