Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Transformer models have pushed deep learning model scale to billions of parameters.
Choosing the optimal parallel strategy requires domain expertise in both deep learning and parallel computing.
Colossal-AI system provides a unified interface to scale sequential code of model training to distributed environments.
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.

Paper Content

I. introduction

Deep learning has been successful in many applications
Neural networks can learn high-dimensional features and make predictions better than humans
Deep learning models are getting larger
Traditional training methods are no longer effective
Distributed training is necessary to enable large-scale model training
Limited fast memory of hardware is the bottleneck of scaling the model
Data parallelism is the most popular parallelism method
Many methods have been proposed to reduce memory footprint
GShard, FairScale, Megatron-LM and DeepSpeed are the current state-of-the-art systems
Colossal-AI is an open-source system to democratize complicated distributed training
Colossal-AI provides the fullest set of acceleration techniques
Optimized parallelism and heterogeneous training methods are provided
In-depth analysis was conducted to investigate suitable parallelism strategies

Ii. background

Transformer architecture has improved deep learning models
BERT and Vision Transformer are examples of this architecture

A. data parallelism

Data parallelism is the most common form of parallelism.
Model is replicated across devices and dataset is split into shards.
Each shard is allocated to a device and fed to the model.
Model weights must be synchronized for every training iteration.

B. model parallelism

Data parallel training uses a copy of the whole model weights on each GPU
Memory requirement of large-scale models is difficult to fulfill
Model parameters need to be sharded so each device only holds part of the model weights
Model parallelism proposed to tackle this problem, with three subcategories: tensor parallelism, pipeline parallelism, and sequence parallelism
Tensor parallelism shards the tensor over an array of devices and requires a distributed matrix-matrix multiplication algorithm
Megatron-LM proposed 1D tensor parallelism which splits the linear layer in the row or column dimensions
1D tensor parallelism relies on all-reduce operation for collective communication
1D tensor parallelism has redundant memory usage in layer inputs and outputs
More advanced tensor parallelism introduced for large-scale model training: 2D, 2.5D, and 3D tensor parallelism
2D tensor parallelism relies on SUMMA and Cannon matrix multiplication algorithm
Tensor split into N chunks and each chunk owned by one GPU

√

Tensor parallelism splits a tensor into chunks to reduce memory consumption
2D, 2.5D and 3D tensor parallelism methods exist
Pipeline parallelism splits the model by layer
Activation and gradients are communicated between pipeline stages
Sequence parallelism splits the input data along the sequence dimension
Ring Self Attention module is used to exchange partial query, key and value embeddings
Sequence parallelism is orthogonal to Pipeline Parallelism

C. zero redundancy data parallelism

Memory challenge caused by model parameters
Optimizer states can take up more memory than model parameters
N copies of model parameters, gradients and optimizer states in data parallel training
Zero Redundancy Optimizer partitions model data over different devices
All-gather used in forward and backward passes
Each device only holds a partition of gradients and parameters

D. heterogeneous training

GPUs have less memory capacity than CPUs.
A typical Nvidia DGX1 workstation has 8 Tesla V100 GPUs with 256 GB of memory and 512 GB of CPU memory.
Utilizing CPU memory and NVMe SSDs can increase the training capacity of a single node.
DeepSpeed proposed zero-offload to move tensors from GPU to CPU or NVMe disks.
It is possible to train a model with billions of parameters on a single GPU.

Iii. design

Computing resources have become diverse in the past decade
Difficult to have a universal method for deep learning training due to hardware variance
Megatron-LM and DeepSpeed can help with this problem
Colossal-AI has an array of acceleration techniques to cover a wide range of training settings

A. multi-dimensional model parallelism

Model parallelism reduces memory burden on single GPU
Colossal-AI provides array of model parallelism methods
Transformer model relies heavily on linear layer
Tensor parallelism suitable for acceleration of Transformer model training
Advanced tensor parallelism has lower communication cost than 1D tensor parallelism
Colossal-AI includes sequence and pipeline parallelism for hybrid parallelism

B. enhanced sharding and offloading

DeepSpeed supports zero redundancy training and offloading
Re-designed sharding and offloading module for better performance
Unified sharded tensor interface for zero redundancy data parallel training
Extra optimizations proposed by PatrickStar for memory space
Re-use fp16 storage space in memory
Adaptive hybrid Adam optimizer for lower time cost

Iv. implementation

Colossal-AI is a deep learning system implemented with PyTorch
Parallel context manager stores meta information about the environment
Automatically switches to corresponding parallel mode
User-friendly APIs for model training

A. experiment setup

Conducted experiments on different hardware
Used Megatron-LM and DeepSpeed as baselines for experiments

1) convergence

Conducted experiments with Vision Transformer on ImageNet-1k dataset
Configured ViT model with 12 Transformer layers, hidden size of 384, 6 attention heads
Used Jax initialization, AdamW optimizer, 0.003 learning rate, 0.3 weight decay
Input image shape 224, patch size 16, global batch size 4k, trained for 250 epochs
Testing accuracy curves of Multi-Dimensional tensor parallelism align with baseline, demonstrating no effect on model convergence

2) memory efficiency

2D, 2.5D, and 3D tensor parallelism partition input data, layer weight, and output activation, while 1D tensor parallelism only partitions layer weight.
2D, 2.5D, and 3D tensor parallelism have lower memory consumption than 1D tensor parallelism.
Range tests show that 2.5D and 3D tensor parallelism have 44% and 65% lower memory consumption than 1D tensor parallelism.
Experiments on System I and II show that 1D tensor parallelism has higher throughput than 2D, 2.5D, and 3D tensor parallelism.
On System II, 2D and 2.5D tensor parallelism have 40% and 20.6% higher throughput than 1D tensor parallelism.

4) throughput comparison

Tested performance of tensor parallelism with more GPUs
Adjusted ViT model configuration due to 16 GB GPU memory
Results show speedup of advanced tensor parallelism up to 2.76
Low communication volume and memory efficiency make 2D, 2.5D, and 3D tensor parallelism better for large-scale distributed training

C. sequence parallelism

Sequence Parallelism is more memory efficient than 1D tensor parallelism
Maximum batch size of Sequence Parallelism is 4.44 times larger than 1D tensor parallelism
Maximum sequence length of Sequence Parallelism is 1.18 times larger than 1D tensor parallelism
Training throughput of Sequence Parallelism is 1.43 times higher than 1D tensor parallelism

D. sharding and offloading

Evaluated own sharding and offloading module against DeepSpeed
Used DeepSpeed Stage 3 as baseline
Trained GPT-2 model with 10 billion parameters on Wikipedia dataset
Scaled data parallel training from 1 GPU to 8 GPU
Colossal-AI 2.33 times faster than DeepSpeed on 8 GPUs
Flexible system design supports easy combination of different parallelism methods

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. background#

A. data parallelism#

B. model parallelism#

√#

C. zero redundancy data parallelism#

D. heterogeneous training#

Iii. design#

A. multi-dimensional model parallelism#

B. enhanced sharding and offloading#

Iv. implementation#

A. experiment setup#

1) convergence#

2) memory efficiency#

4) throughput comparison#

C. sequence parallelism#

D. sharding and offloading#

Link to paper

Abstract

Paper Content

I. introduction

Ii. background

A. data parallelism

B. model parallelism

√

C. zero redundancy data parallelism

D. heterogeneous training

Iii. design

A. multi-dimensional model parallelism

B. enhanced sharding and offloading

Iv. implementation

A. experiment setup

1) convergence

2) memory efficiency

4) throughput comparison

C. sequence parallelism

D. sharding and offloading