Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Alpa automates model-parallel training of large deep learning models
Existing model-parallel training systems require manual creation of parallelization plans or limited space of model parallelism configurations
Alpa views parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms
Alpa designs compilation passes to automatically derive efficient parallel execution plans
Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed compute devices
Evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems

Paper Content

Introduction

Deep learning advances are due to increases in model size
Training large models on distributed clusters requires engineering effort
Tuning parallelization strategy can improve training performance
Automating parallelization of large-scale models would accelerate ML research and production
Complex space of plans grows exponentially with model and cluster size
Recent efforts to automatically parallelize model training are limited
Hierarchical space of plans can be used to optimize
Alpa is a compiler system for distributed DL on GPU clusters
Alpa features compilation passes, runtime architecture, and system optimizations
Alpa is evaluated on large models with billions of parameters
Alpa can match specialized systems and achieve speedups compared to hand-tuned systems

Background: distributed deep learning

DL computation is represented by ML frameworks as a dataflow graph.
Edges in the graph represent multi-dimensional tensors.
Nodes are computational operators that transform input tensors into output tensors.
Training a DL model consists of computing a loss, deriving updates, and applying updates to parameters.
Model developers define the dataflow graph.
Execution engine optimizes and executes it on a compute device.
ML parallelization approaches parallelize computation on distributed devices.

Conventional view of ml parallelism

Data parallelism: Training data is partitioned across distributed workers, model is replicated
Operator parallelism: Computation of a specific operator is partitioned across multiple devices
Pipeline parallelism: Different groups of ops from the model graph are placed on different workers
Manual combination of parallelisms: Hand-designed operator and data parallelism configuration
Automatic combination of parallelisms: Intractable space, prior explorations limited to combining data parallelism with one model parallelism approach

Intra-and inter-operator parallelisms

Two categories of parallelization approaches: intra-operator and inter-operator
Intra-operator parallelism involves partitioning operators along tensor axes
Examples of intra-operator parallelism include data parallelism and operator parallelism
Inter-operator parallelism assigns different operators of the graph to execute on distributed devices
Inter-operator parallelism typically uses point-to-point communication between device pairs
Intra-operator parallelism results in collective communication among distributed devices

Overview

Alpa is a compiler that generates model-parallel execution plans
Alpa optimizes the plan at two levels: intra-op and inter-op
Intra-op level minimizes cost of executing a stage on a given device mesh
Inter-op level minimizes inter-op parallelization latency
Hierarchical optimization process generates execution plan consisting of intra-op and inter-op plans
Developers use Python decorator @parallelize to annotate functions to be parallelized

Intra-operator parallelism

Alpa optimizes intra-operator parallelism within a device mesh.
Alpa adopts SPMD-style intra-op parallelism which partitions operators evenly across devices.
Alpa covers data parallelism, ZeRO, Megatron-LM’s operator parallelism, and their combinations.
Alpa formalizes the problem as an integer linear programming (ILP) and can be solved efficiently for graphs with tens of thousands of operators.

The space of intra-operator parallelism

Operator in computational graph can be parallelized in multiple ways
Parallelization of matrix multiplication requires 3-level for-loop
Parallelization of loop i, loop j, loop k, or combinations of them across devices can be done
Layout of input tensors must satisfy requirement, otherwise layout conversion is needed
Device mesh is a 2-dimensional logical view of physical devices
Sharding spec defines layout of tensor
Resharding requires cross-device communication
Parallel algorithms of an operator can be derived from math expression
Model graph is represented in XLA’s HLO format with less than 80 primitive operators

Ilp formulation

Total execution cost of a computational graph is the sum of compute and communication costs on all nodes and resharding costs on all edges
Formulated cost minimization as an ILP and solved it optimally with an off-the-shelf solver
For each node, there is a communication cost vector and a compute cost vector
Defined an one-hot decision vector to represent the algorithm used
Defined a resharding cost matrix for the cost between nodes
Objective is to minimize the compute and communication cost of nodes and the resharding cost of edges
Quadratic term linearized by introducing a new decision vector
Costs estimated by computing number of communicated bytes and dividing by mesh dimension bandwidth
Graph simplified by merging computationally-trivial operators
Post-ILP communication optimizations applied to reduce number of replicated tensors and corresponding computations

Inter-operator parallelism

Develop methods to slice model and device cluster into stage-mesh pairs
Minimize end-to-end pipeline execution latency for entire computational graph

The space for inter-operator parallelism

Computational graph contains a sequence of operators
Operators are sliced into stages
Each stage is assigned to a submesh of a computer cluster
Latency of executing stage is minimized by ILP
Overall latency contains two terms: total latency of all stages and latency of slowest stage
Two additional constraints: colocate forward and backward operators and fully cover cluster mesh

Dp formulation

Submeshes of sizes (1, 1), (1, 2), (1, 4) . . . (1, 2 m ) and (2, M), (3, M), . . . , (N, M) can always fully cover the cluster mesh
Devices are assigned to larger submeshes first and then to smaller ones
Submeshes with n > 1 and m < M lead to inferior results
Algorithm finds T* in Eq. 2 using a DP algorithm
Intra-op pass determines stage latency and memory required on each device
Complexity of DP algorithm is O(K 5 NM(N + log(M)) 2 )
Early pruning and operator clustering used to speed up DP algorithm

Parallelism orchestration

Alpa compiles each stage against its assigned device mesh
Automatically inserts collective communication primitives
Implements an additional parallelism orchestration pass to address cross-mesh communication
Cross-mesh resharding is a many-to-many multicast problem
Generates static execution instructions to launch training on clusters
Adopts an MPMD-style runtime to orchestrate inter-op parallel execution

Limitations and discussion

Advantages of Alpa’s view of parallelisms
Flexibility of Alpa: uneven operators, different shapes, non-uniform configuration
Limitations of Alpa: communication cost, hyperparameter, static linear schedule, no overlapping

Evaluation

Alpa is implemented using 16K LoC in Python and 6K LoC in C++.
Alpa uses Jax and XLA for the frontend and backend.
Alpa is evaluated on large-scale models with billions of parameters.
Alpa is compared to two state-of-the-art distributed systems.
Ablation studies of optimization algorithms are performed.
A case study of execution plans is included.

End-to-end performance

Target three types of models with homogeneous and heterogeneous architectures
Increase model size along with number of GPUs
Compare Alpa against strong baseline systems
Measure training throughput
GPT-3: Alpa achieves slightly better scaling than Megatron-LM
MoE: Alpa combines intra- and inter-operator parallelism
Wide-ResNet: Alpa achieves scalable performance on 32 GPUs
Baselines “PP-DP” and “Inter-op only” run out of memory
“Intra-only” requires a lot of communication on slow connections

Intra-op parallelism ablation study

We studied the effectiveness of an intra-operator parallelism optimization algorithm
We compared our solution to alternatives such as ZeRO optimizer and rule-based partitioning strategies
We ran a weak scaling benchmark on one AWS p3.16xlarge instance with 8 GPUs
We compared automatic solutions for intra-operator parallelism
Results showed that our solution performed best in all cases and maintained a near-linear scaling

Inter-op parallelism ablation study

Studied effectiveness of inter-operator parallelism optimization algorithm
Used “DP” algorithm for experiments
Compared “DP” algorithm with two rule-based algorithms
“DP” outperformed “Equal operator”
Whether “DP” outperforms “Equal layer” depends on model architecture
On Wide-ResNet, “DP” outperformed “Equal operator” and “Equal layer” by 2.6x and 1.6x respectively

Cross-mesh resharding

Evaluated generalized local all-gather optimization for cross-mesh resharding between meshes with different shapes
Used Wide-ResNet for evaluation
Synthetic case of sending 1 signal byte between stages as upper bound of performance

Case study: wide-resnet

Alpa finds parallelization strategies for Wide-ResNet on 16 GPUs
Visualizations of results on 4 and 8 GPUs are included in Appendix C
On 4 GPUs, Alpa uses only intra-operator parallelism
On 16 GPUs, Alpa slices the model into 3 stages and assigns 4, 4, 8 GPUs to each stage
Data parallelism is preferred in the first two stages
In the third stage, the ILP solver finds a non-trivial way of partitioning the convolution operators

Horovod and Py-TorchDDP are data-parallel training systems
BytePS unifies all-reduce and parameter servers
AutoDist uses learning-based approaches
ZeRO reduces replicated tensors
MiCS minimizes communication scale
Alpa reduces data parallelism to a special case
Mesh-TensorFlow, GSPMD and OneFlow provide annotation APIs
ColocRL and Gpipe use different approaches for model parallelism
PipeDream improves GPipe
Tofu, FlexFlow, TensorOpt and Varuna search for model-parallel plans
Memory optimization, communication compression and low-precision training can be incorporated into Alpa
Compilers for deep learning optimize execution of DL models
Distributed tensor computation used for linear algebra and stencil computations

Conclusion

Alpa is a new architecture for automated model-parallel distributed training
Alpa uses a hierarchical space and compilation passes to create efficient parallel execution plans
Alpa orchestrates the parallel execution on distributed compute devices on two different granularities
Alpa will make distributed model-parallel learning easier and faster
Theorem 1 proves that a solution can always be found to cover the cluster mesh
GPT-3 models use sequence length = 1024 and vocabulary size = 51200
GShard MoE models use sequence length = 1024 and vocabulary size = 32000
Alpa’s API for Jax uses a Python decorator @parallelize to annotate functions that need to be parallelized
Sharding specs of a 2-dimentional tensor on a 2 × 2 device mesh
Inter-op pass summary
Cross-mesh resharding
End-to-end evaluation results
Inter-operator parallelism ablation study
Alpa’s compilation time scales linearly with model size and number of GPUs
Compilation time breakdown for the largest GPT model
Parallel strategy of Wide-ResNet on 4 and 8 GPUs
Models used in the end-to-end evaluation

Link to paper#

Abstract#

Paper Content#

Introduction#

Background: distributed deep learning#

Conventional view of ml parallelism#

Intra-and inter-operator parallelisms#

Overview#

Intra-operator parallelism#

The space of intra-operator parallelism#

Ilp formulation#

Inter-operator parallelism#

The space for inter-operator parallelism#

Dp formulation#

Parallelism orchestration#

Limitations and discussion#

Evaluation#

End-to-end performance#

Intra-op parallelism ablation study#

Inter-op parallelism ablation study#

Cross-mesh resharding#

Case study: wide-resnet#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background: distributed deep learning

Conventional view of ml parallelism

Intra-and inter-operator parallelisms

Overview

Intra-operator parallelism

The space of intra-operator parallelism

Ilp formulation

Inter-operator parallelism

The space for inter-operator parallelism

Dp formulation

Parallelism orchestration

Limitations and discussion

Evaluation

End-to-end performance

Intra-op parallelism ablation study

Inter-op parallelism ablation study

Cross-mesh resharding

Case study: wide-resnet

Related work

Conclusion