Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep neural networks (DNN) are increasingly trained over massive GPU accelerators.
Contemporary parallelization plan generators rely on empirical rules that couple transformation and scheduling.
SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans.
SuperScaler can generate empirical parallelization plans and construct new plans that achieve up to 3.5X speedup.

Paper Content

Introduction

DNNs have grown significantly in size over the past few years
GPUs are necessary for training large models
Single GPU memory has not kept up with model sizes
Parallel GPUs are used to distribute model weights
Efficiency on distributed GPU clusters is a major research problem
PyTorch and TensorFlow simplify expressing DNN architecture
Data flow graph (DFG) is a directed acyclic graph (DAG)
DFG is executed numerous times with different inputs
Limited compute power and memory of single GPU requires partitioning large operators
Parallelization plan is needed to balance trade-offs between spatial and temporal scheduling
Granularity of operator aggregation and mapping to pipeline stages is important
Existing systems use predefined, empirical parallelization plans
SuperScaler helps developers design and generate flexible parallelization plans
SuperScaler has 3 phases: partitioning, scheduling, and materialization
SuperScaler can achieve 3.5x speedup compared to existing systems

Background and motivation

Parallel deep learning training relies on parallelization plans composed by empirical rules
Data parallelism assumes the whole model can fit in a single device
Tensor parallelism partitions a model and places it on disjoint devices
Pipeline parallelism groups layers of a model into stages and temporally schedules them
Emerging DNN models require new parallelization plans
Co-shard partitions the model along its multi-head dimension and co-locates it in one GPU

Design

SuperScaler allows developers to focus on model partitioning and space-time scheduling.
SuperScaler takes a DNN model computation graph as input.
SuperScaler exploits inherent parallel of DNN model and tracks data relations.
SuperScaler performs space-time scheduling and builds data dependency.
SuperScaler materializes data dependency into communications and generates parallel execution.

Operator transformation

DNN models can be partitioned into finer-grain tasks to exploit parallelism
SuperScaler performs this operation with op-trans
op-trans partitions an operator and its input/output data tensors into a set of functional equivalent operators and tensors
vTensor tracks the changing data dependency during operator transformation
op-trans partitions vTensors and leaves pTensors unchanged
SuperScaler tracks data dependency through vTensor and masks

Space-time scheduling

SuperScaler introduces two functions to enable flexible space-time scheduling
SuperScaler records assignments by annotating the data flow graph
SuperScaler performs scheduling validation to avoid potential deadlock

Dependency materialization

Operators in a Super-Scaler graph may have mismatched input and output vTensors.
Data dependency materialization fixes this by identifying the overlapped portion of producer and consumer vTensors, inserting split, send-recv, and concat/reduce operators, and recording the changes in the Super-Scaler graph.

Exploring more parallelization plans

SuperScaler can support existing and new parallelization plans
Algorithm 1 shows an example sProgram for data parallelism
SuperScaler captures operator type and dimension information from DFG
mBART is a language translation model with imbalance layers
Existing pipeline parallelisms can only place different stages on disjoint devices
Interlaced Pipeline is a tailored parallelism for mBART
Algorithm 2 shows an sProgram for Interlaced Pipeline
Decoupled primitives of op-trans, op-assign and op-order are expressive enough to cover all parallelization plans

Communication optimization

SuperScaler optimizes communications by aligning with efficient communication collectives
SuperScaler can replace a group of peer-to-peer communications with high-performance collectives
For complex communication patterns, SuperScaler uses an algorithm to compose the communication with multiple communication primitives
RVD representation is used to express input or output tensors
Communication primitive search over RVD graph is used to connect producer RVD to consumer RVD
Dijkstra algorithm is used to search the shortest path from producer RVD to consumer RVD

Implementation

Implemented SuperScaler on PyTorch
Captured computation graph using TorchScript
Autograd for forward operator transformation
Op-trans assistant for transformation algorithms
Generate code given SuperScaler graph
Support 15 out of 18 parallelization plans
Memory optimization techniques supported

Evaluation

Super-Scaler improves performance of DNN training systems
Super-Scaler improves memory usage, computation efficiency and communication

Experimental setup

32 NVIDIA Tesla V100 GPUs
Connected via NVLink
Interconnected with 100 Gbps InfiniBand network
4 emerging models from diverse domains
Swin-Transformer, GPT-3, mBART, AlphaFold2
4 parallel DNN training systems
Megatron-LM, Alpa, DeepSpeed, DAP+DP
Hyper-parameters tuned for optimal settings

End-to-end performance

SuperScaler enables new parallelization plans to improve performance
Co-shard partitions attention heads and feedforward hidden dimensions
Interlaced pipeline replaces 1F1B pipeline
3F1B pipeline scheduling used for AlphaFold2
Batch size set to 128 for AlphaFold2, 512 for other models
SuperScaler achieves up to 3.5x speedup on Swin Transformer and 1.5x speedup on GPT-3
SuperScaler achieves up to 2.8x speedup on mBART and 1.4x speedup on AlphaFold2
Megatron-LM and Alpa require more GPUs to fit larger model size
DeepSpeed suffers from extra communication cost
SuperScaler leverages 3F1B to partition model weights across devices

Reducing memory with decoupling

Decoupling transformation and scheduling allows for more flexible scheduling choices.
Co-shard, recompute and ZeRO3-Offload are memory-saving techniques.
Co-shard can support training on single GPU with 1.3B model parameters.
Co-shard can support 1.2x and 1.7x the sequence length of ZeRO3-Offload and Recompute, respectively.

Improving efficiency with co-scheduling

SuperScaler can explore new parallelization plans for better resource utilization
Breakdown analysis of mBART shows improved resource utilization with less idle time
SuperScaler compared to Megatron-LM and Interlaced-block
Interlaced-block achieves 1.9x performance speedup over Megatron-LM
SuperScaler achieves 1.5x speedup over Interlaced-block by reducing bubble time

Serving diverse communication patterns

GPT-3 model optimized by intra-RVD and inter-RVD
End-to-end GPT-3 training with strong scaling
Intra-RVD significantly outperforms P2P send/recv
Inter-RVD further improves scalability
Inter-RVD search can improve P2P send/recv up to 57x

Parallel DNN training is widely used
Memory optimizations are used to exploit large-scale model training
Systems combine multiple parallelisms and memory optimizations
SuperScaler provides a different angle of parallelization
Distributed tensor abstraction and communication is used
Parallelization plan search is used to improve training performance
SuperScaler works with other systems

Conclusions

SuperScaler is a generation engine of parallelization plans used for parallel deep learning training
Existing training systems rely on empirical parallelization plans which have limited flexibility
SuperScaler decouples intertwined factors of model partitioning, scheduling, and data dependency preserving
SuperScaler can support existing popular parallelization plans and explore new ones that improve training performance
SuperScaler system workflow includes op-trans, vTensors, and interlaced pipeline scheduling
Communication collectives as RVD transitions are shown in Figure 10
End-to-end evaluation, GPT-3 single-GPU memory consumption and latency, mBART end-to-end performance breakdown, and inter-RVD latency are shown in Figures 12, 14, 15, and 17 respectively
18 cases of 6 categories for RVD search benchmark are shown in Figure 18

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and motivation#

Design#

Operator transformation#

Space-time scheduling#

Dependency materialization#

Exploring more parallelization plans#

Communication optimization#

Implementation#

Evaluation#

Experimental setup#

End-to-end performance#

Reducing memory with decoupling#

Improving efficiency with co-scheduling#

Serving diverse communication patterns#

Related work#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Background and motivation

Design

Operator transformation

Space-time scheduling

Dependency materialization

Exploring more parallelization plans

Communication optimization

Implementation

Evaluation

Experimental setup

End-to-end performance

Reducing memory with decoupling

Improving efficiency with co-scheduling

Serving diverse communication patterns

Related work

Conclusions