Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep neural networks (DNN) are increasingly trained over massive GPU accelerators.
- Contemporary parallelization plan generators rely on empirical rules that couple transformation and scheduling.
- SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans.
- SuperScaler can generate empirical parallelization plans and construct new plans that achieve up to 3.5X speedup.
Paper Content
Introduction
- DNNs have grown significantly in size over the past few years
- GPUs are necessary for training large models
- Single GPU memory has not kept up with model sizes
- Parallel GPUs are used to distribute model weights
- Efficiency on distributed GPU clusters is a major research problem
- PyTorch and TensorFlow simplify expressing DNN architecture
- Data flow graph (DFG) is a directed acyclic graph (DAG)
- DFG is executed numerous times with different inputs
- Limited compute power and memory of single GPU requires partitioning large operators
- Parallelization plan is needed to balance trade-offs between spatial and temporal scheduling
- Granularity of operator aggregation and mapping to pipeline stages is important
- Existing systems use predefined, empirical parallelization plans
- SuperScaler helps developers design and generate flexible parallelization plans
- SuperScaler has 3 phases: partitioning, scheduling, and materialization
- SuperScaler can achieve 3.5x speedup compared to existing systems
Background and motivation
- Parallel deep learning training relies on parallelization plans composed by empirical rules
- Data parallelism assumes the whole model can fit in a single device
- Tensor parallelism partitions a model and places it on disjoint devices
- Pipeline parallelism groups layers of a model into stages and temporally schedules them
- Emerging DNN models require new parallelization plans
- Co-shard partitions the model along its multi-head dimension and co-locates it in one GPU
Design
- SuperScaler allows developers to focus on model partitioning and space-time scheduling.
- SuperScaler takes a DNN model computation graph as input.
- SuperScaler exploits inherent parallel of DNN model and tracks data relations.
- SuperScaler performs space-time scheduling and builds data dependency.
- SuperScaler materializes data dependency into communications and generates parallel execution.
Operator transformation
- DNN models can be partitioned into finer-grain tasks to exploit parallelism
- SuperScaler performs this operation with op-trans
- op-trans partitions an operator and its input/output data tensors into a set of functional equivalent operators and tensors
- vTensor tracks the changing data dependency during operator transformation
- op-trans partitions vTensors and leaves pTensors unchanged
- SuperScaler tracks data dependency through vTensor and masks
Space-time scheduling
- SuperScaler introduces two functions to enable flexible space-time scheduling
- SuperScaler records assignments by annotating the data flow graph
- SuperScaler performs scheduling validation to avoid potential deadlock
Dependency materialization
- Operators in a Super-Scaler graph may have mismatched input and output vTensors.
- Data dependency materialization fixes this by identifying the overlapped portion of producer and consumer vTensors, inserting split, send-recv, and concat/reduce operators, and recording the changes in the Super-Scaler graph.
Exploring more parallelization plans
- SuperScaler can support existing and new parallelization plans
- Algorithm 1 shows an example sProgram for data parallelism
- SuperScaler captures operator type and dimension information from DFG
- mBART is a language translation model with imbalance layers
- Existing pipeline parallelisms can only place different stages on disjoint devices
- Interlaced Pipeline is a tailored parallelism for mBART
- Algorithm 2 shows an sProgram for Interlaced Pipeline
- Decoupled primitives of op-trans, op-assign and op-order are expressive enough to cover all parallelization plans
Communication optimization
- SuperScaler optimizes communications by aligning with efficient communication collectives
- SuperScaler can replace a group of peer-to-peer communications with high-performance collectives
- For complex communication patterns, SuperScaler uses an algorithm to compose the communication with multiple communication primitives
- RVD representation is used to express input or output tensors
- Communication primitive search over RVD graph is used to connect producer RVD to consumer RVD
- Dijkstra algorithm is used to search the shortest path from producer RVD to consumer RVD
Implementation
- Implemented SuperScaler on PyTorch
- Captured computation graph using TorchScript
- Autograd for forward operator transformation
- Op-trans assistant for transformation algorithms
- Generate code given SuperScaler graph
- Support 15 out of 18 parallelization plans
- Memory optimization techniques supported
Evaluation
- Super-Scaler improves performance of DNN training systems
- Super-Scaler improves memory usage, computation efficiency and communication
Experimental setup
- 32 NVIDIA Tesla V100 GPUs
- Connected via NVLink
- Interconnected with 100 Gbps InfiniBand network
- 4 emerging models from diverse domains
- Swin-Transformer, GPT-3, mBART, AlphaFold2
- 4 parallel DNN training systems
- Megatron-LM, Alpa, DeepSpeed, DAP+DP
- Hyper-parameters tuned for optimal settings
End-to-end performance
- SuperScaler enables new parallelization plans to improve performance
- Co-shard partitions attention heads and feedforward hidden dimensions
- Interlaced pipeline replaces 1F1B pipeline
- 3F1B pipeline scheduling used for AlphaFold2
- Batch size set to 128 for AlphaFold2, 512 for other models
- SuperScaler achieves up to 3.5x speedup on Swin Transformer and 1.5x speedup on GPT-3
- SuperScaler achieves up to 2.8x speedup on mBART and 1.4x speedup on AlphaFold2
- Megatron-LM and Alpa require more GPUs to fit larger model size
- DeepSpeed suffers from extra communication cost
- SuperScaler leverages 3F1B to partition model weights across devices
Reducing memory with decoupling
- Decoupling transformation and scheduling allows for more flexible scheduling choices.
- Co-shard, recompute and ZeRO3-Offload are memory-saving techniques.
- Co-shard can support training on single GPU with 1.3B model parameters.
- Co-shard can support 1.2x and 1.7x the sequence length of ZeRO3-Offload and Recompute, respectively.
Improving efficiency with co-scheduling
- SuperScaler can explore new parallelization plans for better resource utilization
- Breakdown analysis of mBART shows improved resource utilization with less idle time
- SuperScaler compared to Megatron-LM and Interlaced-block
- Interlaced-block achieves 1.9x performance speedup over Megatron-LM
- SuperScaler achieves 1.5x speedup over Interlaced-block by reducing bubble time
Serving diverse communication patterns
- GPT-3 model optimized by intra-RVD and inter-RVD
- End-to-end GPT-3 training with strong scaling
- Intra-RVD significantly outperforms P2P send/recv
- Inter-RVD further improves scalability
- Inter-RVD search can improve P2P send/recv up to 57x
Related work
- Parallel DNN training is widely used
- Memory optimizations are used to exploit large-scale model training
- Systems combine multiple parallelisms and memory optimizations
- SuperScaler provides a different angle of parallelization
- Distributed tensor abstraction and communication is used
- Parallelization plan search is used to improve training performance
- SuperScaler works with other systems
Conclusions
- SuperScaler is a generation engine of parallelization plans used for parallel deep learning training
- Existing training systems rely on empirical parallelization plans which have limited flexibility
- SuperScaler decouples intertwined factors of model partitioning, scheduling, and data dependency preserving
- SuperScaler can support existing popular parallelization plans and explore new ones that improve training performance
- SuperScaler system workflow includes op-trans, vTensors, and interlaced pipeline scheduling
- Communication collectives as RVD transitions are shown in Figure 10
- End-to-end evaluation, GPT-3 single-GPU memory consumption and latency, mBART end-to-end performance breakdown, and inter-RVD latency are shown in Figures 12, 14, 15, and 17 respectively
- 18 cases of 6 categories for RVD search benchmark are shown in Figure 18