Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Transformer models have pushed deep learning model scale to billions of parameters.
- Choosing the optimal parallel strategy requires domain expertise in both deep learning and parallel computing.
- Colossal-AI system provides a unified interface to scale sequential code of model training to distributed environments.
- Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
Paper Content
I. introduction
- Deep learning has been successful in many applications
- Neural networks can learn high-dimensional features and make predictions better than humans
- Deep learning models are getting larger
- Traditional training methods are no longer effective
- Distributed training is necessary to enable large-scale model training
- Limited fast memory of hardware is the bottleneck of scaling the model
- Data parallelism is the most popular parallelism method
- Many methods have been proposed to reduce memory footprint
- GShard, FairScale, Megatron-LM and DeepSpeed are the current state-of-the-art systems
- Colossal-AI is an open-source system to democratize complicated distributed training
- Colossal-AI provides the fullest set of acceleration techniques
- Optimized parallelism and heterogeneous training methods are provided
- In-depth analysis was conducted to investigate suitable parallelism strategies
Ii. background
- Transformer architecture has improved deep learning models
- BERT and Vision Transformer are examples of this architecture
A. data parallelism
- Data parallelism is the most common form of parallelism.
- Model is replicated across devices and dataset is split into shards.
- Each shard is allocated to a device and fed to the model.
- Model weights must be synchronized for every training iteration.
B. model parallelism
- Data parallel training uses a copy of the whole model weights on each GPU
- Memory requirement of large-scale models is difficult to fulfill
- Model parameters need to be sharded so each device only holds part of the model weights
- Model parallelism proposed to tackle this problem, with three subcategories: tensor parallelism, pipeline parallelism, and sequence parallelism
- Tensor parallelism shards the tensor over an array of devices and requires a distributed matrix-matrix multiplication algorithm
- Megatron-LM proposed 1D tensor parallelism which splits the linear layer in the row or column dimensions
- 1D tensor parallelism relies on all-reduce operation for collective communication
- 1D tensor parallelism has redundant memory usage in layer inputs and outputs
- More advanced tensor parallelism introduced for large-scale model training: 2D, 2.5D, and 3D tensor parallelism
- 2D tensor parallelism relies on SUMMA and Cannon matrix multiplication algorithm
- Tensor split into N chunks and each chunk owned by one GPU
โ
- Tensor parallelism splits a tensor into chunks to reduce memory consumption
- 2D, 2.5D and 3D tensor parallelism methods exist
- Pipeline parallelism splits the model by layer
- Activation and gradients are communicated between pipeline stages
- Sequence parallelism splits the input data along the sequence dimension
- Ring Self Attention module is used to exchange partial query, key and value embeddings
- Sequence parallelism is orthogonal to Pipeline Parallelism
C. zero redundancy data parallelism
- Memory challenge caused by model parameters
- Optimizer states can take up more memory than model parameters
- N copies of model parameters, gradients and optimizer states in data parallel training
- Zero Redundancy Optimizer partitions model data over different devices
- All-gather used in forward and backward passes
- Each device only holds a partition of gradients and parameters
D. heterogeneous training
- GPUs have less memory capacity than CPUs.
- A typical Nvidia DGX1 workstation has 8 Tesla V100 GPUs with 256 GB of memory and 512 GB of CPU memory.
- Utilizing CPU memory and NVMe SSDs can increase the training capacity of a single node.
- DeepSpeed proposed zero-offload to move tensors from GPU to CPU or NVMe disks.
- It is possible to train a model with billions of parameters on a single GPU.
Iii. design
- Computing resources have become diverse in the past decade
- Difficult to have a universal method for deep learning training due to hardware variance
- Megatron-LM and DeepSpeed can help with this problem
- Colossal-AI has an array of acceleration techniques to cover a wide range of training settings
A. multi-dimensional model parallelism
- Model parallelism reduces memory burden on single GPU
- Colossal-AI provides array of model parallelism methods
- Transformer model relies heavily on linear layer
- Tensor parallelism suitable for acceleration of Transformer model training
- Advanced tensor parallelism has lower communication cost than 1D tensor parallelism
- Colossal-AI includes sequence and pipeline parallelism for hybrid parallelism
B. enhanced sharding and offloading
- DeepSpeed supports zero redundancy training and offloading
- Re-designed sharding and offloading module for better performance
- Unified sharded tensor interface for zero redundancy data parallel training
- Extra optimizations proposed by PatrickStar for memory space
- Re-use fp16 storage space in memory
- Adaptive hybrid Adam optimizer for lower time cost
Iv. implementation
- Colossal-AI is a deep learning system implemented with PyTorch
- Parallel context manager stores meta information about the environment
- Automatically switches to corresponding parallel mode
- User-friendly APIs for model training
A. experiment setup
- Conducted experiments on different hardware
- Used Megatron-LM and DeepSpeed as baselines for experiments
1) convergence
- Conducted experiments with Vision Transformer on ImageNet-1k dataset
- Configured ViT model with 12 Transformer layers, hidden size of 384, 6 attention heads
- Used Jax initialization, AdamW optimizer, 0.003 learning rate, 0.3 weight decay
- Input image shape 224, patch size 16, global batch size 4k, trained for 250 epochs
- Testing accuracy curves of Multi-Dimensional tensor parallelism align with baseline, demonstrating no effect on model convergence
2) memory efficiency
- 2D, 2.5D, and 3D tensor parallelism partition input data, layer weight, and output activation, while 1D tensor parallelism only partitions layer weight.
- 2D, 2.5D, and 3D tensor parallelism have lower memory consumption than 1D tensor parallelism.
- Range tests show that 2.5D and 3D tensor parallelism have 44% and 65% lower memory consumption than 1D tensor parallelism.
- Experiments on System I and II show that 1D tensor parallelism has higher throughput than 2D, 2.5D, and 3D tensor parallelism.
- On System II, 2D and 2.5D tensor parallelism have 40% and 20.6% higher throughput than 1D tensor parallelism.
4) throughput comparison
- Tested performance of tensor parallelism with more GPUs
- Adjusted ViT model configuration due to 16 GB GPU memory
- Results show speedup of advanced tensor parallelism up to 2.76
- Low communication volume and memory efficiency make 2D, 2.5D, and 3D tensor parallelism better for large-scale distributed training
C. sequence parallelism
- Sequence Parallelism is more memory efficient than 1D tensor parallelism
- Maximum batch size of Sequence Parallelism is 4.44 times larger than 1D tensor parallelism
- Maximum sequence length of Sequence Parallelism is 1.18 times larger than 1D tensor parallelism
- Training throughput of Sequence Parallelism is 1.43 times higher than 1D tensor parallelism
D. sharding and offloading
- Evaluated own sharding and offloading module against DeepSpeed
- Used DeepSpeed Stage 3 as baseline
- Trained GPT-2 model with 10 billion parameters on Wikipedia dataset
- Scaled data parallel training from 1 GPU to 8 GPU
- Colossal-AI 2.33 times faster than DeepSpeed on 8 GPUs
- Flexible system design supports easy combination of different parallelism methods