Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep neural networks (DNN) are increasingly trained over massive GPU accelerators.
  • Contemporary parallelization plan generators rely on empirical rules that couple transformation and scheduling.
  • SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans.
  • SuperScaler can generate empirical parallelization plans and construct new plans that achieve up to 3.5X speedup.

Paper Content

Introduction

  • DNNs have grown significantly in size over the past few years
  • GPUs are necessary for training large models
  • Single GPU memory has not kept up with model sizes
  • Parallel GPUs are used to distribute model weights
  • Efficiency on distributed GPU clusters is a major research problem
  • PyTorch and TensorFlow simplify expressing DNN architecture
  • Data flow graph (DFG) is a directed acyclic graph (DAG)
  • DFG is executed numerous times with different inputs
  • Limited compute power and memory of single GPU requires partitioning large operators
  • Parallelization plan is needed to balance trade-offs between spatial and temporal scheduling
  • Granularity of operator aggregation and mapping to pipeline stages is important
  • Existing systems use predefined, empirical parallelization plans
  • SuperScaler helps developers design and generate flexible parallelization plans
  • SuperScaler has 3 phases: partitioning, scheduling, and materialization
  • SuperScaler can achieve 3.5x speedup compared to existing systems

Background and motivation

  • Parallel deep learning training relies on parallelization plans composed by empirical rules
  • Data parallelism assumes the whole model can fit in a single device
  • Tensor parallelism partitions a model and places it on disjoint devices
  • Pipeline parallelism groups layers of a model into stages and temporally schedules them
  • Emerging DNN models require new parallelization plans
  • Co-shard partitions the model along its multi-head dimension and co-locates it in one GPU

Design

  • SuperScaler allows developers to focus on model partitioning and space-time scheduling.
  • SuperScaler takes a DNN model computation graph as input.
  • SuperScaler exploits inherent parallel of DNN model and tracks data relations.
  • SuperScaler performs space-time scheduling and builds data dependency.
  • SuperScaler materializes data dependency into communications and generates parallel execution.

Operator transformation

  • DNN models can be partitioned into finer-grain tasks to exploit parallelism
  • SuperScaler performs this operation with op-trans
  • op-trans partitions an operator and its input/output data tensors into a set of functional equivalent operators and tensors
  • vTensor tracks the changing data dependency during operator transformation
  • op-trans partitions vTensors and leaves pTensors unchanged
  • SuperScaler tracks data dependency through vTensor and masks

Space-time scheduling

  • SuperScaler introduces two functions to enable flexible space-time scheduling
  • SuperScaler records assignments by annotating the data flow graph
  • SuperScaler performs scheduling validation to avoid potential deadlock

Dependency materialization

  • Operators in a Super-Scaler graph may have mismatched input and output vTensors.
  • Data dependency materialization fixes this by identifying the overlapped portion of producer and consumer vTensors, inserting split, send-recv, and concat/reduce operators, and recording the changes in the Super-Scaler graph.

Exploring more parallelization plans

  • SuperScaler can support existing and new parallelization plans
  • Algorithm 1 shows an example sProgram for data parallelism
  • SuperScaler captures operator type and dimension information from DFG
  • mBART is a language translation model with imbalance layers
  • Existing pipeline parallelisms can only place different stages on disjoint devices
  • Interlaced Pipeline is a tailored parallelism for mBART
  • Algorithm 2 shows an sProgram for Interlaced Pipeline
  • Decoupled primitives of op-trans, op-assign and op-order are expressive enough to cover all parallelization plans

Communication optimization

  • SuperScaler optimizes communications by aligning with efficient communication collectives
  • SuperScaler can replace a group of peer-to-peer communications with high-performance collectives
  • For complex communication patterns, SuperScaler uses an algorithm to compose the communication with multiple communication primitives
  • RVD representation is used to express input or output tensors
  • Communication primitive search over RVD graph is used to connect producer RVD to consumer RVD
  • Dijkstra algorithm is used to search the shortest path from producer RVD to consumer RVD

Implementation

  • Implemented SuperScaler on PyTorch
  • Captured computation graph using TorchScript
  • Autograd for forward operator transformation
  • Op-trans assistant for transformation algorithms
  • Generate code given SuperScaler graph
  • Support 15 out of 18 parallelization plans
  • Memory optimization techniques supported

Evaluation

  • Super-Scaler improves performance of DNN training systems
  • Super-Scaler improves memory usage, computation efficiency and communication

Experimental setup

  • 32 NVIDIA Tesla V100 GPUs
  • Connected via NVLink
  • Interconnected with 100 Gbps InfiniBand network
  • 4 emerging models from diverse domains
  • Swin-Transformer, GPT-3, mBART, AlphaFold2
  • 4 parallel DNN training systems
  • Megatron-LM, Alpa, DeepSpeed, DAP+DP
  • Hyper-parameters tuned for optimal settings

End-to-end performance

  • SuperScaler enables new parallelization plans to improve performance
  • Co-shard partitions attention heads and feedforward hidden dimensions
  • Interlaced pipeline replaces 1F1B pipeline
  • 3F1B pipeline scheduling used for AlphaFold2
  • Batch size set to 128 for AlphaFold2, 512 for other models
  • SuperScaler achieves up to 3.5x speedup on Swin Transformer and 1.5x speedup on GPT-3
  • SuperScaler achieves up to 2.8x speedup on mBART and 1.4x speedup on AlphaFold2
  • Megatron-LM and Alpa require more GPUs to fit larger model size
  • DeepSpeed suffers from extra communication cost
  • SuperScaler leverages 3F1B to partition model weights across devices

Reducing memory with decoupling

  • Decoupling transformation and scheduling allows for more flexible scheduling choices.
  • Co-shard, recompute and ZeRO3-Offload are memory-saving techniques.
  • Co-shard can support training on single GPU with 1.3B model parameters.
  • Co-shard can support 1.2x and 1.7x the sequence length of ZeRO3-Offload and Recompute, respectively.

Improving efficiency with co-scheduling

  • SuperScaler can explore new parallelization plans for better resource utilization
  • Breakdown analysis of mBART shows improved resource utilization with less idle time
  • SuperScaler compared to Megatron-LM and Interlaced-block
  • Interlaced-block achieves 1.9x performance speedup over Megatron-LM
  • SuperScaler achieves 1.5x speedup over Interlaced-block by reducing bubble time

Serving diverse communication patterns

  • GPT-3 model optimized by intra-RVD and inter-RVD
  • End-to-end GPT-3 training with strong scaling
  • Intra-RVD significantly outperforms P2P send/recv
  • Inter-RVD further improves scalability
  • Inter-RVD search can improve P2P send/recv up to 57x
  • Parallel DNN training is widely used
  • Memory optimizations are used to exploit large-scale model training
  • Systems combine multiple parallelisms and memory optimizations
  • SuperScaler provides a different angle of parallelization
  • Distributed tensor abstraction and communication is used
  • Parallelization plan search is used to improve training performance
  • SuperScaler works with other systems

Conclusions

  • SuperScaler is a generation engine of parallelization plans used for parallel deep learning training
  • Existing training systems rely on empirical parallelization plans which have limited flexibility
  • SuperScaler decouples intertwined factors of model partitioning, scheduling, and data dependency preserving
  • SuperScaler can support existing popular parallelization plans and explore new ones that improve training performance
  • SuperScaler system workflow includes op-trans, vTensors, and interlaced pipeline scheduling
  • Communication collectives as RVD transitions are shown in Figure 10
  • End-to-end evaluation, GPT-3 single-GPU memory consumption and latency, mBART end-to-end performance breakdown, and inter-RVD latency are shown in Figures 12, 14, 15, and 17 respectively
  • 18 cases of 6 categories for RVD search benchmark are shown in Figure 18