Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Alpa automates model-parallel training of large deep learning models
  • Existing model-parallel training systems require manual creation of parallelization plans or limited space of model parallelism configurations
  • Alpa views parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms
  • Alpa designs compilation passes to automatically derive efficient parallel execution plans
  • Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed compute devices
  • Evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems

Paper Content

Introduction

  • Deep learning advances are due to increases in model size
  • Training large models on distributed clusters requires engineering effort
  • Tuning parallelization strategy can improve training performance
  • Automating parallelization of large-scale models would accelerate ML research and production
  • Complex space of plans grows exponentially with model and cluster size
  • Recent efforts to automatically parallelize model training are limited
  • Hierarchical space of plans can be used to optimize
  • Alpa is a compiler system for distributed DL on GPU clusters
  • Alpa features compilation passes, runtime architecture, and system optimizations
  • Alpa is evaluated on large models with billions of parameters
  • Alpa can match specialized systems and achieve speedups compared to hand-tuned systems

Background: distributed deep learning

  • DL computation is represented by ML frameworks as a dataflow graph.
  • Edges in the graph represent multi-dimensional tensors.
  • Nodes are computational operators that transform input tensors into output tensors.
  • Training a DL model consists of computing a loss, deriving updates, and applying updates to parameters.
  • Model developers define the dataflow graph.
  • Execution engine optimizes and executes it on a compute device.
  • ML parallelization approaches parallelize computation on distributed devices.

Conventional view of ml parallelism

  • Data parallelism: Training data is partitioned across distributed workers, model is replicated
  • Operator parallelism: Computation of a specific operator is partitioned across multiple devices
  • Pipeline parallelism: Different groups of ops from the model graph are placed on different workers
  • Manual combination of parallelisms: Hand-designed operator and data parallelism configuration
  • Automatic combination of parallelisms: Intractable space, prior explorations limited to combining data parallelism with one model parallelism approach

Intra-and inter-operator parallelisms

  • Two categories of parallelization approaches: intra-operator and inter-operator
  • Intra-operator parallelism involves partitioning operators along tensor axes
  • Examples of intra-operator parallelism include data parallelism and operator parallelism
  • Inter-operator parallelism assigns different operators of the graph to execute on distributed devices
  • Inter-operator parallelism typically uses point-to-point communication between device pairs
  • Intra-operator parallelism results in collective communication among distributed devices

Overview

  • Alpa is a compiler that generates model-parallel execution plans
  • Alpa optimizes the plan at two levels: intra-op and inter-op
  • Intra-op level minimizes cost of executing a stage on a given device mesh
  • Inter-op level minimizes inter-op parallelization latency
  • Hierarchical optimization process generates execution plan consisting of intra-op and inter-op plans
  • Developers use Python decorator @parallelize to annotate functions to be parallelized

Intra-operator parallelism

  • Alpa optimizes intra-operator parallelism within a device mesh.
  • Alpa adopts SPMD-style intra-op parallelism which partitions operators evenly across devices.
  • Alpa covers data parallelism, ZeRO, Megatron-LM’s operator parallelism, and their combinations.
  • Alpa formalizes the problem as an integer linear programming (ILP) and can be solved efficiently for graphs with tens of thousands of operators.

The space of intra-operator parallelism

  • Operator in computational graph can be parallelized in multiple ways
  • Parallelization of matrix multiplication requires 3-level for-loop
  • Parallelization of loop i, loop j, loop k, or combinations of them across devices can be done
  • Layout of input tensors must satisfy requirement, otherwise layout conversion is needed
  • Device mesh is a 2-dimensional logical view of physical devices
  • Sharding spec defines layout of tensor
  • Resharding requires cross-device communication
  • Parallel algorithms of an operator can be derived from math expression
  • Model graph is represented in XLA’s HLO format with less than 80 primitive operators

Ilp formulation

  • Total execution cost of a computational graph is the sum of compute and communication costs on all nodes and resharding costs on all edges
  • Formulated cost minimization as an ILP and solved it optimally with an off-the-shelf solver
  • For each node, there is a communication cost vector and a compute cost vector
  • Defined an one-hot decision vector to represent the algorithm used
  • Defined a resharding cost matrix for the cost between nodes
  • Objective is to minimize the compute and communication cost of nodes and the resharding cost of edges
  • Quadratic term linearized by introducing a new decision vector
  • Costs estimated by computing number of communicated bytes and dividing by mesh dimension bandwidth
  • Graph simplified by merging computationally-trivial operators
  • Post-ILP communication optimizations applied to reduce number of replicated tensors and corresponding computations

Inter-operator parallelism

  • Develop methods to slice model and device cluster into stage-mesh pairs
  • Minimize end-to-end pipeline execution latency for entire computational graph

The space for inter-operator parallelism

  • Computational graph contains a sequence of operators
  • Operators are sliced into stages
  • Each stage is assigned to a submesh of a computer cluster
  • Latency of executing stage is minimized by ILP
  • Overall latency contains two terms: total latency of all stages and latency of slowest stage
  • Two additional constraints: colocate forward and backward operators and fully cover cluster mesh

Dp formulation

  • Submeshes of sizes (1, 1), (1, 2), (1, 4) . . . (1, 2 m ) and (2, M), (3, M), . . . , (N, M) can always fully cover the cluster mesh
  • Devices are assigned to larger submeshes first and then to smaller ones
  • Submeshes with n > 1 and m < M lead to inferior results
  • Algorithm finds T* in Eq. 2 using a DP algorithm
  • Intra-op pass determines stage latency and memory required on each device
  • Complexity of DP algorithm is O(K 5 NM(N + log(M)) 2 )
  • Early pruning and operator clustering used to speed up DP algorithm

Parallelism orchestration

  • Alpa compiles each stage against its assigned device mesh
  • Automatically inserts collective communication primitives
  • Implements an additional parallelism orchestration pass to address cross-mesh communication
  • Cross-mesh resharding is a many-to-many multicast problem
  • Generates static execution instructions to launch training on clusters
  • Adopts an MPMD-style runtime to orchestrate inter-op parallel execution

Limitations and discussion

  • Advantages of Alpa’s view of parallelisms
  • Flexibility of Alpa: uneven operators, different shapes, non-uniform configuration
  • Limitations of Alpa: communication cost, hyperparameter, static linear schedule, no overlapping

Evaluation

  • Alpa is implemented using 16K LoC in Python and 6K LoC in C++.
  • Alpa uses Jax and XLA for the frontend and backend.
  • Alpa is evaluated on large-scale models with billions of parameters.
  • Alpa is compared to two state-of-the-art distributed systems.
  • Ablation studies of optimization algorithms are performed.
  • A case study of execution plans is included.

End-to-end performance

  • Target three types of models with homogeneous and heterogeneous architectures
  • Increase model size along with number of GPUs
  • Compare Alpa against strong baseline systems
  • Measure training throughput
  • GPT-3: Alpa achieves slightly better scaling than Megatron-LM
  • MoE: Alpa combines intra- and inter-operator parallelism
  • Wide-ResNet: Alpa achieves scalable performance on 32 GPUs
  • Baselines “PP-DP” and “Inter-op only” run out of memory
  • “Intra-only” requires a lot of communication on slow connections

Intra-op parallelism ablation study

  • We studied the effectiveness of an intra-operator parallelism optimization algorithm
  • We compared our solution to alternatives such as ZeRO optimizer and rule-based partitioning strategies
  • We ran a weak scaling benchmark on one AWS p3.16xlarge instance with 8 GPUs
  • We compared automatic solutions for intra-operator parallelism
  • Results showed that our solution performed best in all cases and maintained a near-linear scaling

Inter-op parallelism ablation study

  • Studied effectiveness of inter-operator parallelism optimization algorithm
  • Used “DP” algorithm for experiments
  • Compared “DP” algorithm with two rule-based algorithms
  • “DP” outperformed “Equal operator”
  • Whether “DP” outperforms “Equal layer” depends on model architecture
  • On Wide-ResNet, “DP” outperformed “Equal operator” and “Equal layer” by 2.6x and 1.6x respectively

Cross-mesh resharding

  • Evaluated generalized local all-gather optimization for cross-mesh resharding between meshes with different shapes
  • Used Wide-ResNet for evaluation
  • Synthetic case of sending 1 signal byte between stages as upper bound of performance

Case study: wide-resnet

  • Alpa finds parallelization strategies for Wide-ResNet on 16 GPUs
  • Visualizations of results on 4 and 8 GPUs are included in Appendix C
  • On 4 GPUs, Alpa uses only intra-operator parallelism
  • On 16 GPUs, Alpa slices the model into 3 stages and assigns 4, 4, 8 GPUs to each stage
  • Data parallelism is preferred in the first two stages
  • In the third stage, the ILP solver finds a non-trivial way of partitioning the convolution operators
  • Horovod and Py-TorchDDP are data-parallel training systems
  • BytePS unifies all-reduce and parameter servers
  • AutoDist uses learning-based approaches
  • ZeRO reduces replicated tensors
  • MiCS minimizes communication scale
  • Alpa reduces data parallelism to a special case
  • Mesh-TensorFlow, GSPMD and OneFlow provide annotation APIs
  • ColocRL and Gpipe use different approaches for model parallelism
  • PipeDream improves GPipe
  • Tofu, FlexFlow, TensorOpt and Varuna search for model-parallel plans
  • Memory optimization, communication compression and low-precision training can be incorporated into Alpa
  • Compilers for deep learning optimize execution of DL models
  • Distributed tensor computation used for linear algebra and stencil computations

Conclusion

  • Alpa is a new architecture for automated model-parallel distributed training
  • Alpa uses a hierarchical space and compilation passes to create efficient parallel execution plans
  • Alpa orchestrates the parallel execution on distributed compute devices on two different granularities
  • Alpa will make distributed model-parallel learning easier and faster
  • Theorem 1 proves that a solution can always be found to cover the cluster mesh
  • GPT-3 models use sequence length = 1024 and vocabulary size = 51200
  • GShard MoE models use sequence length = 1024 and vocabulary size = 32000
  • Alpa’s API for Jax uses a Python decorator @parallelize to annotate functions that need to be parallelized
  • Sharding specs of a 2-dimentional tensor on a 2 × 2 device mesh
  • Inter-op pass summary
  • Cross-mesh resharding
  • End-to-end evaluation results
  • Inter-operator parallelism ablation study
  • Alpa’s compilation time scales linearly with model size and number of GPUs
  • Compilation time breakdown for the largest GPT model
  • Parallel strategy of Wide-ResNet on 4 and 8 GPUs
  • Models used in the end-to-end evaluation