Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Model parallelism is a way to scale a single large deep learning model beyond the memory limits of a single device.
- Model parallelism can also be used to serve multiple models on multiple devices, even when a single model can fit into a single device.
- AlpaServe is a novel serving system that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster.
- Evaluation results show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints.
Paper Content
Introduction
- Advances in self-supervised learning have enabled exponential scaling in model sizes.
- Serving large models is challenging due to high computational and memory requirements.
- Provisioning sufficient resources to serve these models can be expensive due to bursty request rates.
- It is increasingly common to serve multiple models and multiple variations of the same large model.
- Model parallelism refers to partitioning and executing a model on distributed devices.
- Model parallelism has been studied in the throughput-oriented training setting, but not in latency-sensitive settings.
- Model parallelism can improve the statistical multiplexing of the system under bursty workloads.
- Model parallelism adds overheads that may negate its benefits for less bursty workloads.
- AlpaServe is a system that automatically and efficiently explores the tradeoffs among different parallelization and placement strategies for model serving.
- AlpaServe can increase the request processing rate, achieve lower latency deadlines, or tolerate burstier traffic compared to existing systems.
Background
- Models have been developed for various tasks
- Serving predictions from these models is an essential workload in cloud systems
- Requests for models are queued, dispatched to GPUs/TPUs and results are returned
- Serving systems must adhere to aggressive SLO on latency
- Serving systems must minimize operational costs
- Multiple instances of the same model architecture are used
Model parallelism in model serving
- Distributed parallel model execution is necessary for large models or high performance requirements
- Intra-operator parallelism splits a single operator across multiple devices
- Inter-operator parallelism assigns different operators to execute on distributed devices in a pipeline fashion
Motivation and tradeoff analysis
- Model parallelism reduces memory usage by partitioning a model on multiple devices
- Aim is to fit more models on one device to handle bursty requests
- Series of empirical examinations and theoretical analysis to explore idea
- Experiments performed on AWS EC2 p3.16xlarge instance with 8 NVIDIA 16GB V100 GPUs
Case study: a two-model example
- Model parallelism can benefit the serving of multiple models
- Two GPUs are used to serve two Transformer models with 6.7 billion parameters each
- Two model placements are compared: simple placement and model-parallel placement
- Model-parallel placement reduces the average latency of the simple placement by 1.3x
- With higher burstiness, the speedup on mean latency is increased to 1.9x
- Model-parallel placement reduces the mean latency by 6.6x
When is model parallelism beneficial
- Model parallelism can benefit model serving
- Model parallelism uses statistical multiplexing
- Model parallelism is beneficial when device memory is limited, request rate is low, request CV is high, or SLO is tight
- 8 GPUs and 8 Transformer models with 2.6B parameters each were used
- Requests set to each model as a Gamma process with an average rate of 20 request/s and CV of 3
- Two placement methods: Replication and Model Parallelism
- More GPU memory can fit more models onto a single GPU
- Model Parallelism can greatly reduce latency when request rate is low
- Model Parallelism can improve SLO attainment when SLO is tight
Overhead of model parallelism
- Model parallelism can outperform replication when there is no overhead
- Model parallelism can still outperform replication when there is overhead and the SLO is low
- Inter-op parallelism has two sources of overhead: communication and uneven partition
- Intra-op parallelism has higher communication overhead than inter-op
- Inter-op parallelism can have higher throughput than intra-op
- Both parallel methods have the same memory usage with increasing numbers of GPUs
Queueing theory analysis
- Queuing theory is used to mathematically verify conclusions from previous sections
- Request serving time is assumed to be deterministic
- Average number of requests and average latency are calculated for single device case
- Percentage of requests for two models is controlled by p
- Average latency for simple placement is derived
- Average latency for model-parallel case is derived
- Model-parallel execution has half the waiting time of simple placement
- Overhead factors are calculated for model-parallelism
Method
- Model parallelism is difficult to use for deep learning serving
- Challenges include data parallelism, model parallelism, and communication overhead
Automatic parallelization for inference
- AlpaServe generates a list of possible configurations for a given model.
- AlpaServe builds on an existing auto-parallelization training system, Alpa.
- AlpaServe uses a dynamic programming algorithm to figure out the optimal inter-op parallel plan.
- AlpaServe profiles the latency of each stage with an integer linear programming problem.
Placement algorithm
- AlpaServe partitions a cluster into groups of devices and assigns a subset of models to each group
- Goal is to find a placement that maximizes SLO attainment
- Finding optimal placement is a difficult combinatorial optimization problem
- AlpaServe uses a simulator-guided greedy algorithm to compute SLO attainment
- Algorithm 1 uses a beam search to select model replicas
- Algorithm 2 enumerates various potential cluster partitions and parallel configurations
- Algorithm 2 clusters models into buckets and assigns devices to each bucket
- Algorithm 1 runs the simulator to compute SLO attainment
- Algorithm 2 uses heuristics to prune the search space
Runtime scheduling
- Requests sent to centralized controller
- Controller dispatches requests to group with shortest queue length
- Group manages first-come-first-serve queue
- Execution time of DNN model is predictable
- Advanced runtime policies (batching, swapping, preemption) discussed
- Batching enabled in some experiments
- Preemption can help correct suboptimal decisions
- Swapping not implemented
- Fault tolerance not a focus
Implementation
- Implemented a real system and a simulator for AlpaServe with 4k lines of code in Python
- Real system is built on top of existing model-parallel training system, Alpa
- Auto-parallelization algorithms extended for inference settings
- Centralized controller dispatches requests to groups
- Simulator is a continuous-time, discrete-event simulator
- Simulator is orders of magnitude faster than real experiments
- Simulator has high fidelity due to predictability of DNN model execution
Evaluation
- AlpaServe’s serving ability is evaluated under different model and workload conditions
- AlpaServe outperforms strong baselines across all model sizes
- AlpaServe is evaluated for robustness against changing arrival patterns and ablation studies are conducted
- AlpaServe can save up to 2.3x devices, handle 10x higher rates, 6x more burstiness, or 2.5x more stringent SLO while meeting latency SLOs for over 99% requests
Experiment setup
- Deployed AlpaServe on 8 node cluster with 64 GPUs
- Evaluated BERT and GShard MoE model families
- Selected several model sizes and created multiple model instances
- Used SLO attainment as major evaluation metric
- Used simulator to study system behavior under extensive models, workload, and resource settings
End-to-end results with real workloads
- AlpaServe is compared to two baseline methods on two real traces
- The two traces have distinct traffic patterns
- SLO attainment depends on many factors, with a default value of 5x inference latency
- Variables (3) and (4) are changed by slicing the original traces into time windows
- AlpaServe outperforms the two baselines and uses fewer devices to achieve 99% SLO attainment
- AlpaServe can handle a higher rate than baselines for a stable trace
- AlpaServe reveals a hidden opportunity to handle burstiness by model parallelism
- AlpaServe favors intra-op parallelism to reduce inference latency when SLO is tight
- AlpaServe switches to inter-op parallelism to get higher throughput when SLO is looser
Serving very large models
- Large models can have hundreds of billions of parameters
- Common practice is to manually choose model parallelism strategy and use dedicated GPUs
- AlpaServe searches for optimal GPU group allocation and model placement
- Traffic is generated via a Gamma Process with an average rate of 8 requests/s and CV of 4
- AlpaServe slices cluster into two groups and balances requests between them
- AlpaServe exploits statistical multiplexing for bursty workloads
Robustness to changing traffic patterns
- AlpaServe’s placement algorithm assumes knowledge of the arrival process in advance
- In practice, the arrival process can be approximated but the prediction may be inaccurate
- Experiment studied how AlpaServe performs when traffic patterns change
- AlpaServe maintained good performance and outperformed Clockwork++
- Statistical multiplexing with model parallelism is a better alternative than existing replication-or replacement-based algorithms
Benefits of dynamic batching
- Batching is an optimization used in other serving systems
- Choice of batch size is important for performance
- Batching is limited in large model scenarios due to two reasons
- Batching disabled in experiments to isolate benefits of model parallelism
- Batching algorithm implemented and evaluated
- SLO attainment improved when SLO requirement is loose
Ablation study
- Auto-parallelization reduces parallelism overheads and improves serving performance
- Auto-parallelization partitions models at the computational graph level
- Placement algorithm tested on synthetic workload
- Round robin, Greedy placement and Greedy placement + Group partitioning evaluated
- Group partitioning increases rate and traffic burstiness that can be handled to meet 99% SLO attainment
Related work
- Model serving systems range from general-purpose to specialized
- AlpaServe targets a broader set of models than other systems
- Other systems do not consider latency of co-located models
- Optimizations for inference over large models exist
- AlpaServe is largely orthogonal to model parallelism in training
- Resource allocation and multiplexing is an old topic in computer science
Conclusion
- AlpaServe is a system for prediction servings of multiple large deep-learning models
- Integrates model parallelism into multi-model serving
- Model parallelism is traditionally applied conservatively
- AlpaServe demonstrates model parallelism is useful for many scenarios
- Automatically navigates tradeoff space
- Reduces average completion time of bursty requests
- Model parallelism is beneficial for limited memory budget and larger CVs
- Inter-op and intra-op parallelism
- AlpaServe utilizes a centralized controller to dispatch requests
- Algorithm 1 and 2 for efficient model parallel strategies
- Comparison of SLO attainment reported by simulator and real system