Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Model parallelism is a way to scale a single large deep learning model beyond the memory limits of a single device.
Model parallelism can also be used to serve multiple models on multiple devices, even when a single model can fit into a single device.
AlpaServe is a novel serving system that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster.
Evaluation results show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints.

Paper Content

Introduction

Advances in self-supervised learning have enabled exponential scaling in model sizes.
Serving large models is challenging due to high computational and memory requirements.
Provisioning sufficient resources to serve these models can be expensive due to bursty request rates.
It is increasingly common to serve multiple models and multiple variations of the same large model.
Model parallelism refers to partitioning and executing a model on distributed devices.
Model parallelism has been studied in the throughput-oriented training setting, but not in latency-sensitive settings.
Model parallelism can improve the statistical multiplexing of the system under bursty workloads.
Model parallelism adds overheads that may negate its benefits for less bursty workloads.
AlpaServe is a system that automatically and efficiently explores the tradeoffs among different parallelization and placement strategies for model serving.
AlpaServe can increase the request processing rate, achieve lower latency deadlines, or tolerate burstier traffic compared to existing systems.

Background

Models have been developed for various tasks
Serving predictions from these models is an essential workload in cloud systems
Requests for models are queued, dispatched to GPUs/TPUs and results are returned
Serving systems must adhere to aggressive SLO on latency
Serving systems must minimize operational costs
Multiple instances of the same model architecture are used

Model parallelism in model serving

Distributed parallel model execution is necessary for large models or high performance requirements
Intra-operator parallelism splits a single operator across multiple devices
Inter-operator parallelism assigns different operators to execute on distributed devices in a pipeline fashion

Motivation and tradeoff analysis

Model parallelism reduces memory usage by partitioning a model on multiple devices
Aim is to fit more models on one device to handle bursty requests
Series of empirical examinations and theoretical analysis to explore idea
Experiments performed on AWS EC2 p3.16xlarge instance with 8 NVIDIA 16GB V100 GPUs

Case study: a two-model example

Model parallelism can benefit the serving of multiple models
Two GPUs are used to serve two Transformer models with 6.7 billion parameters each
Two model placements are compared: simple placement and model-parallel placement
Model-parallel placement reduces the average latency of the simple placement by 1.3x
With higher burstiness, the speedup on mean latency is increased to 1.9x
Model-parallel placement reduces the mean latency by 6.6x

When is model parallelism beneficial

Model parallelism can benefit model serving
Model parallelism uses statistical multiplexing
Model parallelism is beneficial when device memory is limited, request rate is low, request CV is high, or SLO is tight
8 GPUs and 8 Transformer models with 2.6B parameters each were used
Requests set to each model as a Gamma process with an average rate of 20 request/s and CV of 3
Two placement methods: Replication and Model Parallelism
More GPU memory can fit more models onto a single GPU
Model Parallelism can greatly reduce latency when request rate is low
Model Parallelism can improve SLO attainment when SLO is tight

Overhead of model parallelism

Model parallelism can outperform replication when there is no overhead
Model parallelism can still outperform replication when there is overhead and the SLO is low
Inter-op parallelism has two sources of overhead: communication and uneven partition
Intra-op parallelism has higher communication overhead than inter-op
Inter-op parallelism can have higher throughput than intra-op
Both parallel methods have the same memory usage with increasing numbers of GPUs

Queueing theory analysis

Queuing theory is used to mathematically verify conclusions from previous sections
Request serving time is assumed to be deterministic
Average number of requests and average latency are calculated for single device case
Percentage of requests for two models is controlled by p
Average latency for simple placement is derived
Average latency for model-parallel case is derived
Model-parallel execution has half the waiting time of simple placement
Overhead factors are calculated for model-parallelism

Method

Model parallelism is difficult to use for deep learning serving
Challenges include data parallelism, model parallelism, and communication overhead

Automatic parallelization for inference

AlpaServe generates a list of possible configurations for a given model.
AlpaServe builds on an existing auto-parallelization training system, Alpa.
AlpaServe uses a dynamic programming algorithm to figure out the optimal inter-op parallel plan.
AlpaServe profiles the latency of each stage with an integer linear programming problem.

Placement algorithm

AlpaServe partitions a cluster into groups of devices and assigns a subset of models to each group
Goal is to find a placement that maximizes SLO attainment
Finding optimal placement is a difficult combinatorial optimization problem
AlpaServe uses a simulator-guided greedy algorithm to compute SLO attainment
Algorithm 1 uses a beam search to select model replicas
Algorithm 2 enumerates various potential cluster partitions and parallel configurations
Algorithm 2 clusters models into buckets and assigns devices to each bucket
Algorithm 1 runs the simulator to compute SLO attainment
Algorithm 2 uses heuristics to prune the search space

Runtime scheduling

Requests sent to centralized controller
Controller dispatches requests to group with shortest queue length
Group manages first-come-first-serve queue
Execution time of DNN model is predictable
Advanced runtime policies (batching, swapping, preemption) discussed
Batching enabled in some experiments
Preemption can help correct suboptimal decisions
Swapping not implemented
Fault tolerance not a focus

Implementation

Implemented a real system and a simulator for AlpaServe with 4k lines of code in Python
Real system is built on top of existing model-parallel training system, Alpa
Auto-parallelization algorithms extended for inference settings
Centralized controller dispatches requests to groups
Simulator is a continuous-time, discrete-event simulator
Simulator is orders of magnitude faster than real experiments
Simulator has high fidelity due to predictability of DNN model execution

Evaluation

AlpaServe’s serving ability is evaluated under different model and workload conditions
AlpaServe outperforms strong baselines across all model sizes
AlpaServe is evaluated for robustness against changing arrival patterns and ablation studies are conducted
AlpaServe can save up to 2.3x devices, handle 10x higher rates, 6x more burstiness, or 2.5x more stringent SLO while meeting latency SLOs for over 99% requests

Experiment setup

Deployed AlpaServe on 8 node cluster with 64 GPUs
Evaluated BERT and GShard MoE model families
Selected several model sizes and created multiple model instances
Used SLO attainment as major evaluation metric
Used simulator to study system behavior under extensive models, workload, and resource settings

End-to-end results with real workloads

AlpaServe is compared to two baseline methods on two real traces
The two traces have distinct traffic patterns
SLO attainment depends on many factors, with a default value of 5x inference latency
Variables (3) and (4) are changed by slicing the original traces into time windows
AlpaServe outperforms the two baselines and uses fewer devices to achieve 99% SLO attainment
AlpaServe can handle a higher rate than baselines for a stable trace
AlpaServe reveals a hidden opportunity to handle burstiness by model parallelism
AlpaServe favors intra-op parallelism to reduce inference latency when SLO is tight
AlpaServe switches to inter-op parallelism to get higher throughput when SLO is looser

Serving very large models

Large models can have hundreds of billions of parameters
Common practice is to manually choose model parallelism strategy and use dedicated GPUs
AlpaServe searches for optimal GPU group allocation and model placement
Traffic is generated via a Gamma Process with an average rate of 8 requests/s and CV of 4
AlpaServe slices cluster into two groups and balances requests between them
AlpaServe exploits statistical multiplexing for bursty workloads

Robustness to changing traffic patterns

AlpaServe’s placement algorithm assumes knowledge of the arrival process in advance
In practice, the arrival process can be approximated but the prediction may be inaccurate
Experiment studied how AlpaServe performs when traffic patterns change
AlpaServe maintained good performance and outperformed Clockwork++
Statistical multiplexing with model parallelism is a better alternative than existing replication-or replacement-based algorithms

Benefits of dynamic batching

Batching is an optimization used in other serving systems
Choice of batch size is important for performance
Batching is limited in large model scenarios due to two reasons
Batching disabled in experiments to isolate benefits of model parallelism
Batching algorithm implemented and evaluated
SLO attainment improved when SLO requirement is loose

Ablation study

Auto-parallelization reduces parallelism overheads and improves serving performance
Auto-parallelization partitions models at the computational graph level
Placement algorithm tested on synthetic workload
Round robin, Greedy placement and Greedy placement + Group partitioning evaluated
Group partitioning increases rate and traffic burstiness that can be handled to meet 99% SLO attainment

Model serving systems range from general-purpose to specialized
AlpaServe targets a broader set of models than other systems
Other systems do not consider latency of co-located models
Optimizations for inference over large models exist
AlpaServe is largely orthogonal to model parallelism in training
Resource allocation and multiplexing is an old topic in computer science

Conclusion

AlpaServe is a system for prediction servings of multiple large deep-learning models
Integrates model parallelism into multi-model serving
Model parallelism is traditionally applied conservatively
AlpaServe demonstrates model parallelism is useful for many scenarios
Automatically navigates tradeoff space
Reduces average completion time of bursty requests
Model parallelism is beneficial for limited memory budget and larger CVs
Inter-op and intra-op parallelism
AlpaServe utilizes a centralized controller to dispatch requests
Algorithm 1 and 2 for efficient model parallel strategies
Comparison of SLO attainment reported by simulator and real system

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Model parallelism in model serving#

Motivation and tradeoff analysis#

Case study: a two-model example#

When is model parallelism beneficial#

Overhead of model parallelism#

Queueing theory analysis#

Method#

Automatic parallelization for inference#

Placement algorithm#

Runtime scheduling#

Implementation#

Evaluation#

Experiment setup#

End-to-end results with real workloads#

Serving very large models#

Robustness to changing traffic patterns#

Benefits of dynamic batching#

Ablation study#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Model parallelism in model serving

Motivation and tradeoff analysis

Case study: a two-model example

When is model parallelism beneficial

Overhead of model parallelism

Queueing theory analysis

Method

Automatic parallelization for inference

Placement algorithm

Runtime scheduling

Implementation

Evaluation

Experiment setup

End-to-end results with real workloads

Serving very large models

Robustness to changing traffic patterns

Benefits of dynamic batching

Ablation study

Related work

Conclusion