Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Transformer models have pushed deep learning model scale to billions of parameters.
  • Choosing the optimal parallel strategy requires domain expertise in both deep learning and parallel computing.
  • Colossal-AI system provides a unified interface to scale sequential code of model training to distributed environments.
  • Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.

Paper Content

I. introduction

  • Deep learning has been successful in many applications
  • Neural networks can learn high-dimensional features and make predictions better than humans
  • Deep learning models are getting larger
  • Traditional training methods are no longer effective
  • Distributed training is necessary to enable large-scale model training
  • Limited fast memory of hardware is the bottleneck of scaling the model
  • Data parallelism is the most popular parallelism method
  • Many methods have been proposed to reduce memory footprint
  • GShard, FairScale, Megatron-LM and DeepSpeed are the current state-of-the-art systems
  • Colossal-AI is an open-source system to democratize complicated distributed training
  • Colossal-AI provides the fullest set of acceleration techniques
  • Optimized parallelism and heterogeneous training methods are provided
  • In-depth analysis was conducted to investigate suitable parallelism strategies

Ii. background

  • Transformer architecture has improved deep learning models
  • BERT and Vision Transformer are examples of this architecture

A. data parallelism

  • Data parallelism is the most common form of parallelism.
  • Model is replicated across devices and dataset is split into shards.
  • Each shard is allocated to a device and fed to the model.
  • Model weights must be synchronized for every training iteration.

B. model parallelism

  • Data parallel training uses a copy of the whole model weights on each GPU
  • Memory requirement of large-scale models is difficult to fulfill
  • Model parameters need to be sharded so each device only holds part of the model weights
  • Model parallelism proposed to tackle this problem, with three subcategories: tensor parallelism, pipeline parallelism, and sequence parallelism
  • Tensor parallelism shards the tensor over an array of devices and requires a distributed matrix-matrix multiplication algorithm
  • Megatron-LM proposed 1D tensor parallelism which splits the linear layer in the row or column dimensions
  • 1D tensor parallelism relies on all-reduce operation for collective communication
  • 1D tensor parallelism has redundant memory usage in layer inputs and outputs
  • More advanced tensor parallelism introduced for large-scale model training: 2D, 2.5D, and 3D tensor parallelism
  • 2D tensor parallelism relies on SUMMA and Cannon matrix multiplication algorithm
  • Tensor split into N chunks and each chunk owned by one GPU

โˆš

  • Tensor parallelism splits a tensor into chunks to reduce memory consumption
  • 2D, 2.5D and 3D tensor parallelism methods exist
  • Pipeline parallelism splits the model by layer
  • Activation and gradients are communicated between pipeline stages
  • Sequence parallelism splits the input data along the sequence dimension
  • Ring Self Attention module is used to exchange partial query, key and value embeddings
  • Sequence parallelism is orthogonal to Pipeline Parallelism

C. zero redundancy data parallelism

  • Memory challenge caused by model parameters
  • Optimizer states can take up more memory than model parameters
  • N copies of model parameters, gradients and optimizer states in data parallel training
  • Zero Redundancy Optimizer partitions model data over different devices
  • All-gather used in forward and backward passes
  • Each device only holds a partition of gradients and parameters

D. heterogeneous training

  • GPUs have less memory capacity than CPUs.
  • A typical Nvidia DGX1 workstation has 8 Tesla V100 GPUs with 256 GB of memory and 512 GB of CPU memory.
  • Utilizing CPU memory and NVMe SSDs can increase the training capacity of a single node.
  • DeepSpeed proposed zero-offload to move tensors from GPU to CPU or NVMe disks.
  • It is possible to train a model with billions of parameters on a single GPU.

Iii. design

  • Computing resources have become diverse in the past decade
  • Difficult to have a universal method for deep learning training due to hardware variance
  • Megatron-LM and DeepSpeed can help with this problem
  • Colossal-AI has an array of acceleration techniques to cover a wide range of training settings

A. multi-dimensional model parallelism

  • Model parallelism reduces memory burden on single GPU
  • Colossal-AI provides array of model parallelism methods
  • Transformer model relies heavily on linear layer
  • Tensor parallelism suitable for acceleration of Transformer model training
  • Advanced tensor parallelism has lower communication cost than 1D tensor parallelism
  • Colossal-AI includes sequence and pipeline parallelism for hybrid parallelism

B. enhanced sharding and offloading

  • DeepSpeed supports zero redundancy training and offloading
  • Re-designed sharding and offloading module for better performance
  • Unified sharded tensor interface for zero redundancy data parallel training
  • Extra optimizations proposed by PatrickStar for memory space
  • Re-use fp16 storage space in memory
  • Adaptive hybrid Adam optimizer for lower time cost

Iv. implementation

  • Colossal-AI is a deep learning system implemented with PyTorch
  • Parallel context manager stores meta information about the environment
  • Automatically switches to corresponding parallel mode
  • User-friendly APIs for model training

A. experiment setup

  • Conducted experiments on different hardware
  • Used Megatron-LM and DeepSpeed as baselines for experiments

1) convergence

  • Conducted experiments with Vision Transformer on ImageNet-1k dataset
  • Configured ViT model with 12 Transformer layers, hidden size of 384, 6 attention heads
  • Used Jax initialization, AdamW optimizer, 0.003 learning rate, 0.3 weight decay
  • Input image shape 224, patch size 16, global batch size 4k, trained for 250 epochs
  • Testing accuracy curves of Multi-Dimensional tensor parallelism align with baseline, demonstrating no effect on model convergence

2) memory efficiency

  • 2D, 2.5D, and 3D tensor parallelism partition input data, layer weight, and output activation, while 1D tensor parallelism only partitions layer weight.
  • 2D, 2.5D, and 3D tensor parallelism have lower memory consumption than 1D tensor parallelism.
  • Range tests show that 2.5D and 3D tensor parallelism have 44% and 65% lower memory consumption than 1D tensor parallelism.
  • Experiments on System I and II show that 1D tensor parallelism has higher throughput than 2D, 2.5D, and 3D tensor parallelism.
  • On System II, 2D and 2.5D tensor parallelism have 40% and 20.6% higher throughput than 1D tensor parallelism.

4) throughput comparison

  • Tested performance of tensor parallelism with more GPUs
  • Adjusted ViT model configuration due to 16 GB GPU memory
  • Results show speedup of advanced tensor parallelism up to 2.76
  • Low communication volume and memory efficiency make 2D, 2.5D, and 3D tensor parallelism better for large-scale distributed training

C. sequence parallelism

  • Sequence Parallelism is more memory efficient than 1D tensor parallelism
  • Maximum batch size of Sequence Parallelism is 4.44 times larger than 1D tensor parallelism
  • Maximum sequence length of Sequence Parallelism is 1.18 times larger than 1D tensor parallelism
  • Training throughput of Sequence Parallelism is 1.43 times higher than 1D tensor parallelism

D. sharding and offloading

  • Evaluated own sharding and offloading module against DeepSpeed
  • Used DeepSpeed Stage 3 as baseline
  • Trained GPT-2 model with 10 billion parameters on Wikipedia dataset
  • Scaled data parallel training from 1 GPU to 8 GPU
  • Colossal-AI 2.33 times faster than DeepSpeed on 8 GPUs
  • Flexible system design supports easy combination of different parallelism methods