Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLM inference traditionally requires multiple high-end accelerators.
  • This paper studies LLM inference using limited resources, such as a single commodity GPU.
  • FlexGen is a high-throughput generation engine for running LLMs with limited GPU memory.
  • FlexGen can be flexibly configured under various hardware resource constraints.
  • FlexGen compresses weights and the attention cache to 4 bits with negligible accuracy loss.
  • FlexGen significantly increases maximum throughput.
  • FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
  • FlexGen can benchmark a 30B model with a 16GB GPU in 21 hours.

Paper Content

Introduction

  • Large language models (LLMs) have strong performance across a range of tasks
  • LLMs can have billions or trillions of parameters
  • This leads to high computational and memory requirements
  • Lowering LLM inference resource requirements has attracted interest
  • Throughput-oriented generative inference is a setting where latency can be traded off for higher throughput
  • Three directions to lower resource requirements: model compression, collaborative inference, and offloading
  • Challenges: efficient offloading strategy and effective compression strategies
  • FlexGen: offloading framework for high-throughput LLM inference
  • Search space of possible offloading strategies
  • Linear programming-based search algorithm to optimize throughput
  • Fine-grained group-wise quantization for compression
  • FlexGen achieves much higher throughput than existing systems
  • FlexGen outperforms decentralized Petals cluster in terms of per-GPU throughput
  • LLM inference is an important workload
  • Systems have been developed to enable LLM inference
  • Offloading is an essential technique for LLM inference on commodity hardware
  • Algorithm-oriented works have been developed to accelerate LLM inference
  • Memory optimizations and offloading have been studied for training and linear algebra

Background: llm inference

  • LLM inference workflow consists of two stages: prefill and decoding
  • Prefill stage generates key-value cache for each transformer layer
  • Decoding stage utilizes and updates KV cache to generate tokens
  • Memory footprint of LLM inference comes from model weights and KV cache
  • KV cache is a new bottleneck of large-batch high-throughput inference

Offloading strategy

  • Formulate the problem and construct the search space of possible offloading strategies in FlexGen
  • Build an analytical cost model and search for configurations with an optimizer based on linear programming
  • Extend FlexGen to support multi-GPU settings
  • Compute a block with multiple GPU batches

Problem formulation

  • Machine has 3 devices: GPU, CPU, and disk
  • GPU has smallest but fastest memory, disk has largest but slowest memory
  • LLM can’t fit in GPU, so need to offload to secondary storage
  • Graph traversal problem to generate inference with offloading
  • 4 layers, 3 tokens per prompt
  • Valid path must traverse all squares, subject to constraints
  • Goal is to minimize total execution time

Search space

  • Construct a search space for possible valid strategies in FlexGen
  • Two orders to traverse the graph: row-by-row and column-by-column
  • Existing systems traverse the graph row-by-row
  • Traversing the graph column-by-column reduces I/O costs
  • Converge to a zig-zag block schedule
  • Introduce two parameters: GPU batch size and number of GPU batches in a block
  • Tensor placement: define percentages of weights, activations and KV cache stored on GPU, CPU and disk
  • Partition tensors at layer granularity for weights and tensor granularity for activations and KV cache
  • Computation delegation: compute attention scores on CPU to reduce I/O costs
  • Constructs a search space with several parameters
  • Develops an analytical cost model to estimate execution time
  • Cost model predicts latency during prefill and decoding for one layer
  • Total latency for computing a block estimated as T = T pre • l + T gen • (n − 1) • l
  • Latency of read from CPU to GPU, write from GPU to CPU, read from disk to CPU, write from CPU to disk, computation estimated during prefill and decoding
  • Policy includes 11 variables: block size bls, GPU batch size gbs, weight placement wg, wc, wd, activation placement hg, hc, hd, and KV cache placement cg, cc, cd
  • Solved as two-level optimization problem
  • Cost model can usually return a good policy, but better policy can be obtained by tuning manually

Extension to multiple gpus

  • FlexGen can extend offloading strategy if multiple GPUs are available
  • Model parallelism can reduce memory pressure and lead to super-linear scaling
  • Two kinds of model parallelism: tensor and pipeline
  • FlexGen implements pipeline parallelism by partitioning LLM on multiple GPUs
  • Policy search developed for one GPU can be reused
  • Micro-batch pipelining added to Fig. 4 to combine iteration-level pipeline execution schedule

Approximate methods

  • LLMs are typically robust to careful approximations
  • Two approximations introduced: group-wise quantization and sparse attention
  • Group-wise quantization: weights and KV cache quantized into 4-bit integers without retraining or calibration
  • Group-wise quantization: fine-grained group-wise asymmetric quantization method used
  • Sparse Attention: top 10% attention value cache loaded on OPT-175B while maintaining model quality

Evaluation

  • Experiments run on NVIDIA T4 GPU instances from Google Cloud
  • OPT models with 6.7B to 175B parameters used in evaluation
  • Focus is high-throughput generation on given dataset
  • Prompts padded to same length, system required to generate 32 tokens
  • Evaluation metric is generation throughput
  • Baselines are DeepSpeed ZeRO-Inference and Hugging Face Accelerate
  • FlexGen implemented on top of PyTorch

Offloading

  • FlexGen outperforms all baselines in maximum generation throughput
  • FlexGen uses block scheduling to reuse weights
  • FlexGen achieves 69x higher throughput than baselines on OPT-175B
  • With compression enabled, FlexGen achieves 112x higher generation throughput
  • FlexGen achieves super-linear scaling on decoding throughput
  • FlexGen sets a new Pareto-optimal frontier that significantly outperforms baselines
  • FlexGen uses partial offloading and more space for weights on low-latency side
  • FlexGen aggressively offloads all things out of the GPU on high-throughput side
  • Ablation study shows importance of good policy and using CPU compute and overlapping

Approximations

  • Two tasks used to show negligible accuracy loss: next-word prediction and language modeling
  • 4-bit compression used to compress weights and KV cache into 4-bit integers
  • 4-bit-S combines quantization and sparse attention with 10% sparsity on value cache
  • Negligible accuracy loss compared to FP16
  • 3-bit compression does not preserve accuracy

Offloading vs. collaborative inference

  • Decentralized collaborative inference is an option to reduce resource requirement for LLM utilization.
  • Petals and FlexGen are compared under different network conditions.
  • FlexGen outperforms Petals in terms of throughput and latency.

Conclusion

  • Introduction of FlexGen, a high-throughput generation engine for LLMs
  • Resource requirements of running 175B-scale models reduced to a single 16GB GPU
  • Generation throughput of 1 token/s with an effective batch size of 144
  • FlexGen provides a viable option for resource-constrained and throughput-oriented scenarios
  • Graph traversal problem discussed in Section 4.1
  • Model cannot fit in a single GPU, no application of CPU computation
  • Two schedules discussed: zig-zag block schedule and I/O-optimal diagonal block schedule
  • Three things need to be stored during generation process: weights, activations, and KV cache
  • Three observations made from computational graph
  • Zig-zag block schedule: compute first column for bls samples, save caches and activations, then compute second column for bls samples, until last column for bls samples
  • Diagonal block schedule: block containing 4 GPU batches, one-time warm-up phase, compute diagonal containing 4 sub-diagonals, repeat in next row
  • Discussions: I/O is a significant bottleneck, diagonal block schedule can give considerable gain, can reduce peak memory and enlarge batch size
  • Full cost model: maximize throughput, piece-wise functions and regularization terms used, linear programming problem with respect to policy variables
  • Tables 10-19 show results for different sequence lengths, systems, and tasks