Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Alpa automates model-parallel training of large deep learning models
- Existing model-parallel training systems require manual creation of parallelization plans or limited space of model parallelism configurations
- Alpa views parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms
- Alpa designs compilation passes to automatically derive efficient parallel execution plans
- Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed compute devices
- Evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems
Paper Content
Introduction
- Deep learning advances are due to increases in model size
- Training large models on distributed clusters requires engineering effort
- Tuning parallelization strategy can improve training performance
- Automating parallelization of large-scale models would accelerate ML research and production
- Complex space of plans grows exponentially with model and cluster size
- Recent efforts to automatically parallelize model training are limited
- Hierarchical space of plans can be used to optimize
- Alpa is a compiler system for distributed DL on GPU clusters
- Alpa features compilation passes, runtime architecture, and system optimizations
- Alpa is evaluated on large models with billions of parameters
- Alpa can match specialized systems and achieve speedups compared to hand-tuned systems
Background: distributed deep learning
- DL computation is represented by ML frameworks as a dataflow graph.
- Edges in the graph represent multi-dimensional tensors.
- Nodes are computational operators that transform input tensors into output tensors.
- Training a DL model consists of computing a loss, deriving updates, and applying updates to parameters.
- Model developers define the dataflow graph.
- Execution engine optimizes and executes it on a compute device.
- ML parallelization approaches parallelize computation on distributed devices.
Conventional view of ml parallelism
- Data parallelism: Training data is partitioned across distributed workers, model is replicated
- Operator parallelism: Computation of a specific operator is partitioned across multiple devices
- Pipeline parallelism: Different groups of ops from the model graph are placed on different workers
- Manual combination of parallelisms: Hand-designed operator and data parallelism configuration
- Automatic combination of parallelisms: Intractable space, prior explorations limited to combining data parallelism with one model parallelism approach
Intra-and inter-operator parallelisms
- Two categories of parallelization approaches: intra-operator and inter-operator
- Intra-operator parallelism involves partitioning operators along tensor axes
- Examples of intra-operator parallelism include data parallelism and operator parallelism
- Inter-operator parallelism assigns different operators of the graph to execute on distributed devices
- Inter-operator parallelism typically uses point-to-point communication between device pairs
- Intra-operator parallelism results in collective communication among distributed devices
Overview
- Alpa is a compiler that generates model-parallel execution plans
- Alpa optimizes the plan at two levels: intra-op and inter-op
- Intra-op level minimizes cost of executing a stage on a given device mesh
- Inter-op level minimizes inter-op parallelization latency
- Hierarchical optimization process generates execution plan consisting of intra-op and inter-op plans
- Developers use Python decorator @parallelize to annotate functions to be parallelized
Intra-operator parallelism
- Alpa optimizes intra-operator parallelism within a device mesh.
- Alpa adopts SPMD-style intra-op parallelism which partitions operators evenly across devices.
- Alpa covers data parallelism, ZeRO, Megatron-LM’s operator parallelism, and their combinations.
- Alpa formalizes the problem as an integer linear programming (ILP) and can be solved efficiently for graphs with tens of thousands of operators.
The space of intra-operator parallelism
- Operator in computational graph can be parallelized in multiple ways
- Parallelization of matrix multiplication requires 3-level for-loop
- Parallelization of loop i, loop j, loop k, or combinations of them across devices can be done
- Layout of input tensors must satisfy requirement, otherwise layout conversion is needed
- Device mesh is a 2-dimensional logical view of physical devices
- Sharding spec defines layout of tensor
- Resharding requires cross-device communication
- Parallel algorithms of an operator can be derived from math expression
- Model graph is represented in XLA’s HLO format with less than 80 primitive operators
Ilp formulation
- Total execution cost of a computational graph is the sum of compute and communication costs on all nodes and resharding costs on all edges
- Formulated cost minimization as an ILP and solved it optimally with an off-the-shelf solver
- For each node, there is a communication cost vector and a compute cost vector
- Defined an one-hot decision vector to represent the algorithm used
- Defined a resharding cost matrix for the cost between nodes
- Objective is to minimize the compute and communication cost of nodes and the resharding cost of edges
- Quadratic term linearized by introducing a new decision vector
- Costs estimated by computing number of communicated bytes and dividing by mesh dimension bandwidth
- Graph simplified by merging computationally-trivial operators
- Post-ILP communication optimizations applied to reduce number of replicated tensors and corresponding computations
Inter-operator parallelism
- Develop methods to slice model and device cluster into stage-mesh pairs
- Minimize end-to-end pipeline execution latency for entire computational graph
The space for inter-operator parallelism
- Computational graph contains a sequence of operators
- Operators are sliced into stages
- Each stage is assigned to a submesh of a computer cluster
- Latency of executing stage is minimized by ILP
- Overall latency contains two terms: total latency of all stages and latency of slowest stage
- Two additional constraints: colocate forward and backward operators and fully cover cluster mesh
Dp formulation
- Submeshes of sizes (1, 1), (1, 2), (1, 4) . . . (1, 2 m ) and (2, M), (3, M), . . . , (N, M) can always fully cover the cluster mesh
- Devices are assigned to larger submeshes first and then to smaller ones
- Submeshes with n > 1 and m < M lead to inferior results
- Algorithm finds T* in Eq. 2 using a DP algorithm
- Intra-op pass determines stage latency and memory required on each device
- Complexity of DP algorithm is O(K 5 NM(N + log(M)) 2 )
- Early pruning and operator clustering used to speed up DP algorithm
Parallelism orchestration
- Alpa compiles each stage against its assigned device mesh
- Automatically inserts collective communication primitives
- Implements an additional parallelism orchestration pass to address cross-mesh communication
- Cross-mesh resharding is a many-to-many multicast problem
- Generates static execution instructions to launch training on clusters
- Adopts an MPMD-style runtime to orchestrate inter-op parallel execution
Limitations and discussion
- Advantages of Alpa’s view of parallelisms
- Flexibility of Alpa: uneven operators, different shapes, non-uniform configuration
- Limitations of Alpa: communication cost, hyperparameter, static linear schedule, no overlapping
Evaluation
- Alpa is implemented using 16K LoC in Python and 6K LoC in C++.
- Alpa uses Jax and XLA for the frontend and backend.
- Alpa is evaluated on large-scale models with billions of parameters.
- Alpa is compared to two state-of-the-art distributed systems.
- Ablation studies of optimization algorithms are performed.
- A case study of execution plans is included.
End-to-end performance
- Target three types of models with homogeneous and heterogeneous architectures
- Increase model size along with number of GPUs
- Compare Alpa against strong baseline systems
- Measure training throughput
- GPT-3: Alpa achieves slightly better scaling than Megatron-LM
- MoE: Alpa combines intra- and inter-operator parallelism
- Wide-ResNet: Alpa achieves scalable performance on 32 GPUs
- Baselines “PP-DP” and “Inter-op only” run out of memory
- “Intra-only” requires a lot of communication on slow connections
Intra-op parallelism ablation study
- We studied the effectiveness of an intra-operator parallelism optimization algorithm
- We compared our solution to alternatives such as ZeRO optimizer and rule-based partitioning strategies
- We ran a weak scaling benchmark on one AWS p3.16xlarge instance with 8 GPUs
- We compared automatic solutions for intra-operator parallelism
- Results showed that our solution performed best in all cases and maintained a near-linear scaling
Inter-op parallelism ablation study
- Studied effectiveness of inter-operator parallelism optimization algorithm
- Used “DP” algorithm for experiments
- Compared “DP” algorithm with two rule-based algorithms
- “DP” outperformed “Equal operator”
- Whether “DP” outperforms “Equal layer” depends on model architecture
- On Wide-ResNet, “DP” outperformed “Equal operator” and “Equal layer” by 2.6x and 1.6x respectively
Cross-mesh resharding
- Evaluated generalized local all-gather optimization for cross-mesh resharding between meshes with different shapes
- Used Wide-ResNet for evaluation
- Synthetic case of sending 1 signal byte between stages as upper bound of performance
Case study: wide-resnet
- Alpa finds parallelization strategies for Wide-ResNet on 16 GPUs
- Visualizations of results on 4 and 8 GPUs are included in Appendix C
- On 4 GPUs, Alpa uses only intra-operator parallelism
- On 16 GPUs, Alpa slices the model into 3 stages and assigns 4, 4, 8 GPUs to each stage
- Data parallelism is preferred in the first two stages
- In the third stage, the ILP solver finds a non-trivial way of partitioning the convolution operators
Related work
- Horovod and Py-TorchDDP are data-parallel training systems
- BytePS unifies all-reduce and parameter servers
- AutoDist uses learning-based approaches
- ZeRO reduces replicated tensors
- MiCS minimizes communication scale
- Alpa reduces data parallelism to a special case
- Mesh-TensorFlow, GSPMD and OneFlow provide annotation APIs
- ColocRL and Gpipe use different approaches for model parallelism
- PipeDream improves GPipe
- Tofu, FlexFlow, TensorOpt and Varuna search for model-parallel plans
- Memory optimization, communication compression and low-precision training can be incorporated into Alpa
- Compilers for deep learning optimize execution of DL models
- Distributed tensor computation used for linear algebra and stencil computations
Conclusion
- Alpa is a new architecture for automated model-parallel distributed training
- Alpa uses a hierarchical space and compilation passes to create efficient parallel execution plans
- Alpa orchestrates the parallel execution on distributed compute devices on two different granularities
- Alpa will make distributed model-parallel learning easier and faster
- Theorem 1 proves that a solution can always be found to cover the cluster mesh
- GPT-3 models use sequence length = 1024 and vocabulary size = 51200
- GShard MoE models use sequence length = 1024 and vocabulary size = 32000
- Alpa’s API for Jax uses a Python decorator @parallelize to annotate functions that need to be parallelized
- Sharding specs of a 2-dimentional tensor on a 2 × 2 device mesh
- Inter-op pass summary
- Cross-mesh resharding
- End-to-end evaluation results
- Inter-operator parallelism ablation study
- Alpa’s compilation time scales linearly with model size and number of GPUs
- Compilation time breakdown for the largest GPT model
- Parallel strategy of Wide-ResNet on 4 and 8 GPUs
- Models used in the end-to-end evaluation