Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- GPT family models can be pruned to 50% sparsity without retraining
- SparseGPT is a new pruning method designed for GPT-family models
- SparseGPT can reach 60% sparsity with minimal increase in perplexity
- SparseGPT is compatible with weight quantization approaches
Paper Content
Introduction
- Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks.
- GPT-175B model has 175 billion parameters, requiring five A100 GPUs with 80GB of memory each for inference.
- Model compression approaches have focused on quantization and pruning.
- Pruning has a long history and has been applied to vision and smaller-scale language models and tasks.
- Existing pruning methods require extensive retraining of the model to recover from accuracy loss.
- One-shot pruning methods exist, but are too computationally-expensive to be applied to models with billions of parameters.
- SparseGPT is the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters.
- SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss.
- Larger models are easier to sparsify, with 50% sparsity resulting in practically no accuracy decrease on the largest models.
Background
Fast approximate reconstruction
- Motivation: Solving the sparse reconstruction problem for a fixed pruning mask requires inverting the Hessian matrix for each row, which takes O(d3col) time and is infeasible for large GPT variants.
- Different Row-Hessian Challenge: The row masks are generally different, so the inverse of a masked Hessian does not equal the masked version of the full inverse.
- Equivalent Iterative Perspective: OBS update can be used to iteratively prune the weights one-at-a-time, reducing an initially complete mask to the desired mask.
- Optimal Partial Updates: OBS update can be used to update only the weights in a subset U of the remaining unpruned weights.
- Hessian Synchronization: A sequence of inverse Hessians (HUj)−1 can be calculated recursively in O(d3col) time and reused to remove weight j in all rows where it is part of the pruning mask.
- Weight Freezing Interpretation: Compressing a weight ultimately means fixing it to some specific value and ensuring that it is never “decompressed” again.
- Goal Achieved: The overall runtime scales with the 3rd power of the hidden dimension dhidden, resulting in a > 10000× compute reduction for models with more than 100 billion parameters.
Adaptive mask selection
- Reconstruction aspect of pruning mask is discussed
- Pruning mask can be chosen using magnitude criterion or second-order information
- Updates applied during pruning process change weights significantly due to correlations
- Sparsity can be distributed non-uniformly across columns
- Iterative blocking proposed to alleviate disadvantage while exploiting accuracy gains of adaptive weight selection
Extension to semi-structured sparsity
- SparseGPT can be adapted to semi-structured pruning patterns, such as n:m sparsity format.
- For the 2:4 implementation on Ampere NVIDIA GPUs, every m weights should contain n zeros.
Full algorithm pseudocode
- Algorithm 1 is the SparseGPT algorithm
- Prune layer matrix W to p% unstructured sparsity
- Inverse Hessian H-1 is used
- Lazy batch-update blocksize B and adaptive mask selection blocksize Bs used
- Column-wise greedy framework of GPTQ used
- Precomputing inverse Hessian sequence information via Cholesky decomposition for numerical robustness
- Lazy batched weight matrix updates to improve compute-to-memory ratio
Joint sparsification & quantization
- Algorithm 1 operates in the column-wise greedy framework of GPTQ
- SparseGPT and Algorithm 1 can be merged into a single joint procedure
- Weights frozen by SparseGPT are additionally quantized
- Quantization and pruning are performed jointly in a single pass
- Prior techniques sparsify a layer and then quantize the remaining weights, where quantization has no influence on pruning outcomes
Experiments
- Implemented SparseGPT in PyTorch
- Used HuggingFace Transformers library
- Experiments conducted on single NVIDIA A100 GPU
- SparseGPT can fully sparsify 175-billion-parameter models in 4 hours
- Sparsify Transformer layers sequentially
- Compression experiments performed without finetuning
- Calibration data from C4 dataset
- Primarily worked with OPT model family
- Evaluated perplexity on WikiText2 test set
- Also evaluated ZeroShot accuracy on LAMBADA, ARC, PIQA and StoryCloze
- Compared with magnitude pruning criterion
Results
- Pruning difficulty of LLMs increases with size
- Magnitude pruning causes accuracy to collapse across all scales
- SparseGPT enables up to 60% sparsity with comparable perplexity increase
- Magnitude pruning can achieve up to 30% sparsity for BLOOM-176B
- SparseGPT can remove 100 billion weights with limited impact on accuracy
- SparseGPT models stay close to original accuracy in ZeroShot tasks
- Combining sparsity and quantization can save memory
- SparseGPT 50% + 4-bit models are more accurate than 3-bit versions
Related work
- Model compression reduces the size of models to make them more efficient
- Pruning is one approach to model compression
- Pruning of massive GPT-scale models (10+ billion parameters) has not been investigated before
- Existing pruning methods require extensive retraining, which is difficult for GPT-scale models
- SparseGPT is a post-training method for GPT-scale models
- Post-training quantization has been investigated for GPT-scale models
- SparseGPT can be used in conjunction with quantization approaches
Discussion
- SparseGPT is a post-training pruning method for GPT family language models
- SparseGPT can compress large-scale GPT-family models to high sparsity with low accuracy fluctuations
- SparseGPT can ignore more than 100 billion weights from these models at inference time
- SparseGPT is local and performs weight updates to preserve the input-output relationship for each layer
- Larger models are easier to sparsify
- SparseGPT can achieve 50-60% sparsity with no accuracy decrease on the largest models
- Future work could investigate fine-tuning mechanisms and applicability during training
- SparseGPT results are confirmed on C4 dataset
- SparseGPT can achieve 50-60% sparsity with only a minor perplexity increase