Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLM inference traditionally requires multiple high-end accelerators.
- This paper studies LLM inference using limited resources, such as a single commodity GPU.
- FlexGen is a high-throughput generation engine for running LLMs with limited GPU memory.
- FlexGen can be flexibly configured under various hardware resource constraints.
- FlexGen compresses weights and the attention cache to 4 bits with negligible accuracy loss.
- FlexGen significantly increases maximum throughput.
- FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
- FlexGen can benchmark a 30B model with a 16GB GPU in 21 hours.
Paper Content
Introduction
- Large language models (LLMs) have strong performance across a range of tasks
- LLMs can have billions or trillions of parameters
- This leads to high computational and memory requirements
- Lowering LLM inference resource requirements has attracted interest
- Throughput-oriented generative inference is a setting where latency can be traded off for higher throughput
- Three directions to lower resource requirements: model compression, collaborative inference, and offloading
- Challenges: efficient offloading strategy and effective compression strategies
- FlexGen: offloading framework for high-throughput LLM inference
- Search space of possible offloading strategies
- Linear programming-based search algorithm to optimize throughput
- Fine-grained group-wise quantization for compression
- FlexGen achieves much higher throughput than existing systems
- FlexGen outperforms decentralized Petals cluster in terms of per-GPU throughput
Related work
- LLM inference is an important workload
- Systems have been developed to enable LLM inference
- Offloading is an essential technique for LLM inference on commodity hardware
- Algorithm-oriented works have been developed to accelerate LLM inference
- Memory optimizations and offloading have been studied for training and linear algebra
Background: llm inference
- LLM inference workflow consists of two stages: prefill and decoding
- Prefill stage generates key-value cache for each transformer layer
- Decoding stage utilizes and updates KV cache to generate tokens
- Memory footprint of LLM inference comes from model weights and KV cache
- KV cache is a new bottleneck of large-batch high-throughput inference
Offloading strategy
- Formulate the problem and construct the search space of possible offloading strategies in FlexGen
- Build an analytical cost model and search for configurations with an optimizer based on linear programming
- Extend FlexGen to support multi-GPU settings
- Compute a block with multiple GPU batches
Problem formulation
- Machine has 3 devices: GPU, CPU, and disk
- GPU has smallest but fastest memory, disk has largest but slowest memory
- LLM can’t fit in GPU, so need to offload to secondary storage
- Graph traversal problem to generate inference with offloading
- 4 layers, 3 tokens per prompt
- Valid path must traverse all squares, subject to constraints
- Goal is to minimize total execution time
Search space
- Construct a search space for possible valid strategies in FlexGen
- Two orders to traverse the graph: row-by-row and column-by-column
- Existing systems traverse the graph row-by-row
- Traversing the graph column-by-column reduces I/O costs
- Converge to a zig-zag block schedule
- Introduce two parameters: GPU batch size and number of GPU batches in a block
- Tensor placement: define percentages of weights, activations and KV cache stored on GPU, CPU and disk
- Partition tensors at layer granularity for weights and tensor granularity for activations and KV cache
- Computation delegation: compute attention scores on CPU to reduce I/O costs
Cost model and policy search
- Constructs a search space with several parameters
- Develops an analytical cost model to estimate execution time
- Cost model predicts latency during prefill and decoding for one layer
- Total latency for computing a block estimated as T = T pre • l + T gen • (n − 1) • l
- Latency of read from CPU to GPU, write from GPU to CPU, read from disk to CPU, write from CPU to disk, computation estimated during prefill and decoding
- Policy includes 11 variables: block size bls, GPU batch size gbs, weight placement wg, wc, wd, activation placement hg, hc, hd, and KV cache placement cg, cc, cd
- Solved as two-level optimization problem
- Cost model can usually return a good policy, but better policy can be obtained by tuning manually
Extension to multiple gpus
- FlexGen can extend offloading strategy if multiple GPUs are available
- Model parallelism can reduce memory pressure and lead to super-linear scaling
- Two kinds of model parallelism: tensor and pipeline
- FlexGen implements pipeline parallelism by partitioning LLM on multiple GPUs
- Policy search developed for one GPU can be reused
- Micro-batch pipelining added to Fig. 4 to combine iteration-level pipeline execution schedule
Approximate methods
- LLMs are typically robust to careful approximations
- Two approximations introduced: group-wise quantization and sparse attention
- Group-wise quantization: weights and KV cache quantized into 4-bit integers without retraining or calibration
- Group-wise quantization: fine-grained group-wise asymmetric quantization method used
- Sparse Attention: top 10% attention value cache loaded on OPT-175B while maintaining model quality
Evaluation
- Experiments run on NVIDIA T4 GPU instances from Google Cloud
- OPT models with 6.7B to 175B parameters used in evaluation
- Focus is high-throughput generation on given dataset
- Prompts padded to same length, system required to generate 32 tokens
- Evaluation metric is generation throughput
- Baselines are DeepSpeed ZeRO-Inference and Hugging Face Accelerate
- FlexGen implemented on top of PyTorch
Offloading
- FlexGen outperforms all baselines in maximum generation throughput
- FlexGen uses block scheduling to reuse weights
- FlexGen achieves 69x higher throughput than baselines on OPT-175B
- With compression enabled, FlexGen achieves 112x higher generation throughput
- FlexGen achieves super-linear scaling on decoding throughput
- FlexGen sets a new Pareto-optimal frontier that significantly outperforms baselines
- FlexGen uses partial offloading and more space for weights on low-latency side
- FlexGen aggressively offloads all things out of the GPU on high-throughput side
- Ablation study shows importance of good policy and using CPU compute and overlapping
Approximations
- Two tasks used to show negligible accuracy loss: next-word prediction and language modeling
- 4-bit compression used to compress weights and KV cache into 4-bit integers
- 4-bit-S combines quantization and sparse attention with 10% sparsity on value cache
- Negligible accuracy loss compared to FP16
- 3-bit compression does not preserve accuracy
Offloading vs. collaborative inference
- Decentralized collaborative inference is an option to reduce resource requirement for LLM utilization.
- Petals and FlexGen are compared under different network conditions.
- FlexGen outperforms Petals in terms of throughput and latency.
Conclusion
- Introduction of FlexGen, a high-throughput generation engine for LLMs
- Resource requirements of running 175B-scale models reduced to a single 16GB GPU
- Generation throughput of 1 token/s with an effective batch size of 144
- FlexGen provides a viable option for resource-constrained and throughput-oriented scenarios
- Graph traversal problem discussed in Section 4.1
- Model cannot fit in a single GPU, no application of CPU computation
- Two schedules discussed: zig-zag block schedule and I/O-optimal diagonal block schedule
- Three things need to be stored during generation process: weights, activations, and KV cache
- Three observations made from computational graph
- Zig-zag block schedule: compute first column for bls samples, save caches and activations, then compute second column for bls samples, until last column for bls samples
- Diagonal block schedule: block containing 4 GPU batches, one-time warm-up phase, compute diagonal containing 4 sub-diagonals, repeat in next row
- Discussions: I/O is a significant bottleneck, diagonal block schedule can give considerable gain, can reduce peak memory and enlarge batch size
- Full cost model: maximize throughput, piece-wise functions and regularization terms used, linear programming problem with respect to policy variables
- Tables 10-19 show results for different sequence lengths, systems, and tasks