Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLM inference traditionally requires multiple high-end accelerators.
This paper studies LLM inference using limited resources, such as a single commodity GPU.
FlexGen is a high-throughput generation engine for running LLMs with limited GPU memory.
FlexGen can be flexibly configured under various hardware resource constraints.
FlexGen compresses weights and the attention cache to 4 bits with negligible accuracy loss.
FlexGen significantly increases maximum throughput.
FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
FlexGen can benchmark a 30B model with a 16GB GPU in 21 hours.

Paper Content

Introduction

Large language models (LLMs) have strong performance across a range of tasks
LLMs can have billions or trillions of parameters
This leads to high computational and memory requirements
Lowering LLM inference resource requirements has attracted interest
Throughput-oriented generative inference is a setting where latency can be traded off for higher throughput
Three directions to lower resource requirements: model compression, collaborative inference, and offloading
Challenges: efficient offloading strategy and effective compression strategies
FlexGen: offloading framework for high-throughput LLM inference
Search space of possible offloading strategies
Linear programming-based search algorithm to optimize throughput
Fine-grained group-wise quantization for compression
FlexGen achieves much higher throughput than existing systems
FlexGen outperforms decentralized Petals cluster in terms of per-GPU throughput

LLM inference is an important workload
Systems have been developed to enable LLM inference
Offloading is an essential technique for LLM inference on commodity hardware
Algorithm-oriented works have been developed to accelerate LLM inference
Memory optimizations and offloading have been studied for training and linear algebra

Background: llm inference

LLM inference workflow consists of two stages: prefill and decoding
Prefill stage generates key-value cache for each transformer layer
Decoding stage utilizes and updates KV cache to generate tokens
Memory footprint of LLM inference comes from model weights and KV cache
KV cache is a new bottleneck of large-batch high-throughput inference

Offloading strategy

Formulate the problem and construct the search space of possible offloading strategies in FlexGen
Build an analytical cost model and search for configurations with an optimizer based on linear programming
Extend FlexGen to support multi-GPU settings
Compute a block with multiple GPU batches

Problem formulation

Machine has 3 devices: GPU, CPU, and disk
GPU has smallest but fastest memory, disk has largest but slowest memory
LLM can’t fit in GPU, so need to offload to secondary storage
Graph traversal problem to generate inference with offloading
4 layers, 3 tokens per prompt
Valid path must traverse all squares, subject to constraints
Goal is to minimize total execution time

Search space

Construct a search space for possible valid strategies in FlexGen
Two orders to traverse the graph: row-by-row and column-by-column
Existing systems traverse the graph row-by-row
Traversing the graph column-by-column reduces I/O costs
Converge to a zig-zag block schedule
Introduce two parameters: GPU batch size and number of GPU batches in a block
Tensor placement: define percentages of weights, activations and KV cache stored on GPU, CPU and disk
Partition tensors at layer granularity for weights and tensor granularity for activations and KV cache
Computation delegation: compute attention scores on CPU to reduce I/O costs

Cost model and policy search

Constructs a search space with several parameters
Develops an analytical cost model to estimate execution time
Cost model predicts latency during prefill and decoding for one layer
Total latency for computing a block estimated as T = T pre • l + T gen • (n − 1) • l
Latency of read from CPU to GPU, write from GPU to CPU, read from disk to CPU, write from CPU to disk, computation estimated during prefill and decoding
Policy includes 11 variables: block size bls, GPU batch size gbs, weight placement wg, wc, wd, activation placement hg, hc, hd, and KV cache placement cg, cc, cd
Solved as two-level optimization problem
Cost model can usually return a good policy, but better policy can be obtained by tuning manually

Extension to multiple gpus

FlexGen can extend offloading strategy if multiple GPUs are available
Model parallelism can reduce memory pressure and lead to super-linear scaling
Two kinds of model parallelism: tensor and pipeline
FlexGen implements pipeline parallelism by partitioning LLM on multiple GPUs
Policy search developed for one GPU can be reused
Micro-batch pipelining added to Fig. 4 to combine iteration-level pipeline execution schedule

Approximate methods

LLMs are typically robust to careful approximations
Two approximations introduced: group-wise quantization and sparse attention
Group-wise quantization: weights and KV cache quantized into 4-bit integers without retraining or calibration
Group-wise quantization: fine-grained group-wise asymmetric quantization method used
Sparse Attention: top 10% attention value cache loaded on OPT-175B while maintaining model quality

Evaluation

Experiments run on NVIDIA T4 GPU instances from Google Cloud
OPT models with 6.7B to 175B parameters used in evaluation
Focus is high-throughput generation on given dataset
Prompts padded to same length, system required to generate 32 tokens
Evaluation metric is generation throughput
Baselines are DeepSpeed ZeRO-Inference and Hugging Face Accelerate
FlexGen implemented on top of PyTorch

Offloading

FlexGen outperforms all baselines in maximum generation throughput
FlexGen uses block scheduling to reuse weights
FlexGen achieves 69x higher throughput than baselines on OPT-175B
With compression enabled, FlexGen achieves 112x higher generation throughput
FlexGen achieves super-linear scaling on decoding throughput
FlexGen sets a new Pareto-optimal frontier that significantly outperforms baselines
FlexGen uses partial offloading and more space for weights on low-latency side
FlexGen aggressively offloads all things out of the GPU on high-throughput side
Ablation study shows importance of good policy and using CPU compute and overlapping

Approximations

Two tasks used to show negligible accuracy loss: next-word prediction and language modeling
4-bit compression used to compress weights and KV cache into 4-bit integers
4-bit-S combines quantization and sparse attention with 10% sparsity on value cache
Negligible accuracy loss compared to FP16
3-bit compression does not preserve accuracy

Offloading vs. collaborative inference

Decentralized collaborative inference is an option to reduce resource requirement for LLM utilization.
Petals and FlexGen are compared under different network conditions.
FlexGen outperforms Petals in terms of throughput and latency.

Conclusion

Introduction of FlexGen, a high-throughput generation engine for LLMs
Resource requirements of running 175B-scale models reduced to a single 16GB GPU
Generation throughput of 1 token/s with an effective batch size of 144
FlexGen provides a viable option for resource-constrained and throughput-oriented scenarios
Graph traversal problem discussed in Section 4.1
Model cannot fit in a single GPU, no application of CPU computation
Two schedules discussed: zig-zag block schedule and I/O-optimal diagonal block schedule
Three things need to be stored during generation process: weights, activations, and KV cache
Three observations made from computational graph
Zig-zag block schedule: compute first column for bls samples, save caches and activations, then compute second column for bls samples, until last column for bls samples
Diagonal block schedule: block containing 4 GPU batches, one-time warm-up phase, compute diagonal containing 4 sub-diagonals, repeat in next row
Discussions: I/O is a significant bottleneck, diagonal block schedule can give considerable gain, can reduce peak memory and enlarge batch size
Full cost model: maximize throughput, piece-wise functions and regularization terms used, linear programming problem with respect to policy variables
Tables 10-19 show results for different sequence lengths, systems, and tasks

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Background: llm inference#

Offloading strategy#

Problem formulation#

Search space#

Cost model and policy search#

Extension to multiple gpus#

Approximate methods#

Evaluation#

Offloading#

Approximations#

Offloading vs. collaborative inference#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Background: llm inference

Offloading strategy

Problem formulation

Search space

Cost model and policy search

Extension to multiple gpus

Approximate methods

Evaluation

Offloading

Approximations

Offloading vs. collaborative inference

Conclusion