Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Model parallelism is a way to scale a single large deep learning model beyond the memory limits of a single device.
  • Model parallelism can also be used to serve multiple models on multiple devices, even when a single model can fit into a single device.
  • AlpaServe is a novel serving system that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster.
  • Evaluation results show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints.

Paper Content

Introduction

  • Advances in self-supervised learning have enabled exponential scaling in model sizes.
  • Serving large models is challenging due to high computational and memory requirements.
  • Provisioning sufficient resources to serve these models can be expensive due to bursty request rates.
  • It is increasingly common to serve multiple models and multiple variations of the same large model.
  • Model parallelism refers to partitioning and executing a model on distributed devices.
  • Model parallelism has been studied in the throughput-oriented training setting, but not in latency-sensitive settings.
  • Model parallelism can improve the statistical multiplexing of the system under bursty workloads.
  • Model parallelism adds overheads that may negate its benefits for less bursty workloads.
  • AlpaServe is a system that automatically and efficiently explores the tradeoffs among different parallelization and placement strategies for model serving.
  • AlpaServe can increase the request processing rate, achieve lower latency deadlines, or tolerate burstier traffic compared to existing systems.

Background

  • Models have been developed for various tasks
  • Serving predictions from these models is an essential workload in cloud systems
  • Requests for models are queued, dispatched to GPUs/TPUs and results are returned
  • Serving systems must adhere to aggressive SLO on latency
  • Serving systems must minimize operational costs
  • Multiple instances of the same model architecture are used

Model parallelism in model serving

  • Distributed parallel model execution is necessary for large models or high performance requirements
  • Intra-operator parallelism splits a single operator across multiple devices
  • Inter-operator parallelism assigns different operators to execute on distributed devices in a pipeline fashion

Motivation and tradeoff analysis

  • Model parallelism reduces memory usage by partitioning a model on multiple devices
  • Aim is to fit more models on one device to handle bursty requests
  • Series of empirical examinations and theoretical analysis to explore idea
  • Experiments performed on AWS EC2 p3.16xlarge instance with 8 NVIDIA 16GB V100 GPUs

Case study: a two-model example

  • Model parallelism can benefit the serving of multiple models
  • Two GPUs are used to serve two Transformer models with 6.7 billion parameters each
  • Two model placements are compared: simple placement and model-parallel placement
  • Model-parallel placement reduces the average latency of the simple placement by 1.3x
  • With higher burstiness, the speedup on mean latency is increased to 1.9x
  • Model-parallel placement reduces the mean latency by 6.6x

When is model parallelism beneficial

  • Model parallelism can benefit model serving
  • Model parallelism uses statistical multiplexing
  • Model parallelism is beneficial when device memory is limited, request rate is low, request CV is high, or SLO is tight
  • 8 GPUs and 8 Transformer models with 2.6B parameters each were used
  • Requests set to each model as a Gamma process with an average rate of 20 request/s and CV of 3
  • Two placement methods: Replication and Model Parallelism
  • More GPU memory can fit more models onto a single GPU
  • Model Parallelism can greatly reduce latency when request rate is low
  • Model Parallelism can improve SLO attainment when SLO is tight

Overhead of model parallelism

  • Model parallelism can outperform replication when there is no overhead
  • Model parallelism can still outperform replication when there is overhead and the SLO is low
  • Inter-op parallelism has two sources of overhead: communication and uneven partition
  • Intra-op parallelism has higher communication overhead than inter-op
  • Inter-op parallelism can have higher throughput than intra-op
  • Both parallel methods have the same memory usage with increasing numbers of GPUs

Queueing theory analysis

  • Queuing theory is used to mathematically verify conclusions from previous sections
  • Request serving time is assumed to be deterministic
  • Average number of requests and average latency are calculated for single device case
  • Percentage of requests for two models is controlled by p
  • Average latency for simple placement is derived
  • Average latency for model-parallel case is derived
  • Model-parallel execution has half the waiting time of simple placement
  • Overhead factors are calculated for model-parallelism

Method

  • Model parallelism is difficult to use for deep learning serving
  • Challenges include data parallelism, model parallelism, and communication overhead

Automatic parallelization for inference

  • AlpaServe generates a list of possible configurations for a given model.
  • AlpaServe builds on an existing auto-parallelization training system, Alpa.
  • AlpaServe uses a dynamic programming algorithm to figure out the optimal inter-op parallel plan.
  • AlpaServe profiles the latency of each stage with an integer linear programming problem.

Placement algorithm

  • AlpaServe partitions a cluster into groups of devices and assigns a subset of models to each group
  • Goal is to find a placement that maximizes SLO attainment
  • Finding optimal placement is a difficult combinatorial optimization problem
  • AlpaServe uses a simulator-guided greedy algorithm to compute SLO attainment
  • Algorithm 1 uses a beam search to select model replicas
  • Algorithm 2 enumerates various potential cluster partitions and parallel configurations
  • Algorithm 2 clusters models into buckets and assigns devices to each bucket
  • Algorithm 1 runs the simulator to compute SLO attainment
  • Algorithm 2 uses heuristics to prune the search space

Runtime scheduling

  • Requests sent to centralized controller
  • Controller dispatches requests to group with shortest queue length
  • Group manages first-come-first-serve queue
  • Execution time of DNN model is predictable
  • Advanced runtime policies (batching, swapping, preemption) discussed
  • Batching enabled in some experiments
  • Preemption can help correct suboptimal decisions
  • Swapping not implemented
  • Fault tolerance not a focus

Implementation

  • Implemented a real system and a simulator for AlpaServe with 4k lines of code in Python
  • Real system is built on top of existing model-parallel training system, Alpa
  • Auto-parallelization algorithms extended for inference settings
  • Centralized controller dispatches requests to groups
  • Simulator is a continuous-time, discrete-event simulator
  • Simulator is orders of magnitude faster than real experiments
  • Simulator has high fidelity due to predictability of DNN model execution

Evaluation

  • AlpaServe’s serving ability is evaluated under different model and workload conditions
  • AlpaServe outperforms strong baselines across all model sizes
  • AlpaServe is evaluated for robustness against changing arrival patterns and ablation studies are conducted
  • AlpaServe can save up to 2.3x devices, handle 10x higher rates, 6x more burstiness, or 2.5x more stringent SLO while meeting latency SLOs for over 99% requests

Experiment setup

  • Deployed AlpaServe on 8 node cluster with 64 GPUs
  • Evaluated BERT and GShard MoE model families
  • Selected several model sizes and created multiple model instances
  • Used SLO attainment as major evaluation metric
  • Used simulator to study system behavior under extensive models, workload, and resource settings

End-to-end results with real workloads

  • AlpaServe is compared to two baseline methods on two real traces
  • The two traces have distinct traffic patterns
  • SLO attainment depends on many factors, with a default value of 5x inference latency
  • Variables (3) and (4) are changed by slicing the original traces into time windows
  • AlpaServe outperforms the two baselines and uses fewer devices to achieve 99% SLO attainment
  • AlpaServe can handle a higher rate than baselines for a stable trace
  • AlpaServe reveals a hidden opportunity to handle burstiness by model parallelism
  • AlpaServe favors intra-op parallelism to reduce inference latency when SLO is tight
  • AlpaServe switches to inter-op parallelism to get higher throughput when SLO is looser

Serving very large models

  • Large models can have hundreds of billions of parameters
  • Common practice is to manually choose model parallelism strategy and use dedicated GPUs
  • AlpaServe searches for optimal GPU group allocation and model placement
  • Traffic is generated via a Gamma Process with an average rate of 8 requests/s and CV of 4
  • AlpaServe slices cluster into two groups and balances requests between them
  • AlpaServe exploits statistical multiplexing for bursty workloads

Robustness to changing traffic patterns

  • AlpaServe’s placement algorithm assumes knowledge of the arrival process in advance
  • In practice, the arrival process can be approximated but the prediction may be inaccurate
  • Experiment studied how AlpaServe performs when traffic patterns change
  • AlpaServe maintained good performance and outperformed Clockwork++
  • Statistical multiplexing with model parallelism is a better alternative than existing replication-or replacement-based algorithms

Benefits of dynamic batching

  • Batching is an optimization used in other serving systems
  • Choice of batch size is important for performance
  • Batching is limited in large model scenarios due to two reasons
  • Batching disabled in experiments to isolate benefits of model parallelism
  • Batching algorithm implemented and evaluated
  • SLO attainment improved when SLO requirement is loose

Ablation study

  • Auto-parallelization reduces parallelism overheads and improves serving performance
  • Auto-parallelization partitions models at the computational graph level
  • Placement algorithm tested on synthetic workload
  • Round robin, Greedy placement and Greedy placement + Group partitioning evaluated
  • Group partitioning increases rate and traffic burstiness that can be handled to meet 99% SLO attainment
  • Model serving systems range from general-purpose to specialized
  • AlpaServe targets a broader set of models than other systems
  • Other systems do not consider latency of co-located models
  • Optimizations for inference over large models exist
  • AlpaServe is largely orthogonal to model parallelism in training
  • Resource allocation and multiplexing is an old topic in computer science

Conclusion

  • AlpaServe is a system for prediction servings of multiple large deep-learning models
  • Integrates model parallelism into multi-model serving
  • Model parallelism is traditionally applied conservatively
  • AlpaServe demonstrates model parallelism is useful for many scenarios
  • Automatically navigates tradeoff space
  • Reduces average completion time of bursty requests
  • Model parallelism is beneficial for limited memory budget and larger CVs
  • Inter-op and intra-op parallelism
  • AlpaServe utilizes a centralized controller to dispatch requests
  • Algorithm 1 and 2 for efficient model parallel strategies
  • Comparison of SLO attainment reported by simulator and real system