Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • GPT family models can be pruned to 50% sparsity without retraining
  • SparseGPT is a new pruning method designed for GPT-family models
  • SparseGPT can reach 60% sparsity with minimal increase in perplexity
  • SparseGPT is compatible with weight quantization approaches

Paper Content

Introduction

  • Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks.
  • GPT-175B model has 175 billion parameters, requiring five A100 GPUs with 80GB of memory each for inference.
  • Model compression approaches have focused on quantization and pruning.
  • Pruning has a long history and has been applied to vision and smaller-scale language models and tasks.
  • Existing pruning methods require extensive retraining of the model to recover from accuracy loss.
  • One-shot pruning methods exist, but are too computationally-expensive to be applied to models with billions of parameters.
  • SparseGPT is the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters.
  • SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss.
  • Larger models are easier to sparsify, with 50% sparsity resulting in practically no accuracy decrease on the largest models.

Background

Fast approximate reconstruction

  • Motivation: Solving the sparse reconstruction problem for a fixed pruning mask requires inverting the Hessian matrix for each row, which takes O(d3col) time and is infeasible for large GPT variants.
  • Different Row-Hessian Challenge: The row masks are generally different, so the inverse of a masked Hessian does not equal the masked version of the full inverse.
  • Equivalent Iterative Perspective: OBS update can be used to iteratively prune the weights one-at-a-time, reducing an initially complete mask to the desired mask.
  • Optimal Partial Updates: OBS update can be used to update only the weights in a subset U of the remaining unpruned weights.
  • Hessian Synchronization: A sequence of inverse Hessians (HUj)−1 can be calculated recursively in O(d3col) time and reused to remove weight j in all rows where it is part of the pruning mask.
  • Weight Freezing Interpretation: Compressing a weight ultimately means fixing it to some specific value and ensuring that it is never “decompressed” again.
  • Goal Achieved: The overall runtime scales with the 3rd power of the hidden dimension dhidden, resulting in a > 10000× compute reduction for models with more than 100 billion parameters.

Adaptive mask selection

  • Reconstruction aspect of pruning mask is discussed
  • Pruning mask can be chosen using magnitude criterion or second-order information
  • Updates applied during pruning process change weights significantly due to correlations
  • Sparsity can be distributed non-uniformly across columns
  • Iterative blocking proposed to alleviate disadvantage while exploiting accuracy gains of adaptive weight selection

Extension to semi-structured sparsity

  • SparseGPT can be adapted to semi-structured pruning patterns, such as n:m sparsity format.
  • For the 2:4 implementation on Ampere NVIDIA GPUs, every m weights should contain n zeros.

Full algorithm pseudocode

  • Algorithm 1 is the SparseGPT algorithm
  • Prune layer matrix W to p% unstructured sparsity
  • Inverse Hessian H-1 is used
  • Lazy batch-update blocksize B and adaptive mask selection blocksize Bs used
  • Column-wise greedy framework of GPTQ used
  • Precomputing inverse Hessian sequence information via Cholesky decomposition for numerical robustness
  • Lazy batched weight matrix updates to improve compute-to-memory ratio

Joint sparsification & quantization

  • Algorithm 1 operates in the column-wise greedy framework of GPTQ
  • SparseGPT and Algorithm 1 can be merged into a single joint procedure
  • Weights frozen by SparseGPT are additionally quantized
  • Quantization and pruning are performed jointly in a single pass
  • Prior techniques sparsify a layer and then quantize the remaining weights, where quantization has no influence on pruning outcomes

Experiments

  • Implemented SparseGPT in PyTorch
  • Used HuggingFace Transformers library
  • Experiments conducted on single NVIDIA A100 GPU
  • SparseGPT can fully sparsify 175-billion-parameter models in 4 hours
  • Sparsify Transformer layers sequentially
  • Compression experiments performed without finetuning
  • Calibration data from C4 dataset
  • Primarily worked with OPT model family
  • Evaluated perplexity on WikiText2 test set
  • Also evaluated ZeroShot accuracy on LAMBADA, ARC, PIQA and StoryCloze
  • Compared with magnitude pruning criterion

Results

  • Pruning difficulty of LLMs increases with size
  • Magnitude pruning causes accuracy to collapse across all scales
  • SparseGPT enables up to 60% sparsity with comparable perplexity increase
  • Magnitude pruning can achieve up to 30% sparsity for BLOOM-176B
  • SparseGPT can remove 100 billion weights with limited impact on accuracy
  • SparseGPT models stay close to original accuracy in ZeroShot tasks
  • Combining sparsity and quantization can save memory
  • SparseGPT 50% + 4-bit models are more accurate than 3-bit versions
  • Model compression reduces the size of models to make them more efficient
  • Pruning is one approach to model compression
  • Pruning of massive GPT-scale models (10+ billion parameters) has not been investigated before
  • Existing pruning methods require extensive retraining, which is difficult for GPT-scale models
  • SparseGPT is a post-training method for GPT-scale models
  • Post-training quantization has been investigated for GPT-scale models
  • SparseGPT can be used in conjunction with quantization approaches

Discussion

  • SparseGPT is a post-training pruning method for GPT family language models
  • SparseGPT can compress large-scale GPT-family models to high sparsity with low accuracy fluctuations
  • SparseGPT can ignore more than 100 billion weights from these models at inference time
  • SparseGPT is local and performs weight updates to preserve the input-output relationship for each layer
  • Larger models are easier to sparsify
  • SparseGPT can achieve 50-60% sparsity with no accuracy decrease on the largest models
  • Future work could investigate fine-tuning mechanisms and applicability during training
  • SparseGPT results are confirmed on C4 dataset
  • SparseGPT can achieve 50-60% sparsity with only a minor perplexity increase