Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Training large models with billions of parameters is expensive and requires specialized HPC clusters.
- Alternative setups for training large models include using cheap “preemptible” instances or pooling resources from multiple regions.
- SWARM parallelism is a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices.
- SWARM parallelism was used to train a large Transformer language model with 1B shared parameters on preemptible T4 GPUs.
Paper Content
Introduction
- Deep learning community is increasingly reliant on large pretrained neural networks
- Parameter count of models has grown from hundreds of millions to hundreds of billions
- Models no longer fit into a single accelerator and require specialized training algorithms
- Intensive device-to-device communication is needed
- Training requires dedicated high-performance computing clusters or supercomputers
- Cost-efficient distributed training strategies leverage fleets of temporary “preemptible” instances
- Training in collaborations by pooling together preexisting resources or using volunteers
- Specialized algorithms needed to adapt to changing number of workers, utilize heterogeneous devices and recover from hardware and network failures
- Square-Cube Law of distributed training: training larger models can decrease network overhead
- SWARM parallelism: decentralized model-parallel algorithm for billion-scale training on heterogeneous unreliable devices with slow interconnect
- 8-bit compression allows training billion-scale Transformer language model on preemptible servers with low-power GPUs and network bandwidth of less than 200Mb/s
Background & related work
Model-parallel training
- Deep learning algorithms divide models between multiple workers
- Traditional model parallelism assigns each device to compute a subset of each layer
- Pipeline parallelism assigns each device with one or several layers
- Data parallelism with dynamic parameter loading is another strategy
- Mixture-of-Experts is a model-specific algorithm
Distributed training outside hpc
- Techniques designed for HPC setup not always available
- Preemptible instances or volunteer computing used as cost-efficient alternative
- Training in these environments more difficult due to machine disconnection and limited instances
- Elastic and asynchronous training methods proposed to handle unstable peers and heterogeneous devices
- Global scheduling used to optimize training over heterogeneous devices
- Model-parallel algorithms vulnerable to hardware and network failures
- Two methods allow training large models with unreliable devices
Communication efficiency and compression
- Training with limited network bandwidth or high latency can be addressed with gradient compression or overlapping computation with communication phases.
- Efficient gradient communication can be achieved with Deep Gradient Compression, PowerSGD, or 8-bit quantization.
- Overlapping communication and computation can be achieved with parallelism techniques, layer-by-layer synchronization, or pure pipeline parallelism.
Communication-efficient model parallelism
- Training large models with unreliable devices can be done using model-parallel algorithms
- SWARM parallelism is a decentralized algorithm for training large models with unreliable devices
The square-cube law of distributed training
- Model parallelism can be abstracted away from application-specific parameters
- Model consists of k stages, each represented by nxn matrices
- Compute time for matrix multiplication scales as O(n^3)
- Communication phase requires at most O(n^2) time
- Square-cube law applies to many real-world neural network architectures
- Pipeline parallelism grows more communication-efficient with model size
- Data-parallel training does not scale as well
Swarm parallelism
- Traditional pipeline parallelism can be communication efficient, but not enough for certain setups
- Weakest link in pipeline can bottleneck entire process
- Rigid pipeline structure replaced with temporary “pipelines” built stochastically on the fly
- Each participant can send outputs to any peer that serves the next pipeline stage
- Fault tolerance of SWARM parallelism
- Model partitioned into evenly sized stages
- Peers receive inputs from predecessors and send activations to peers in the next stage
- Gradients for outputs received, gradients for inputs computed, gradients for parameters accumulated
- All-Reduce to average gradients within their pipeline stages and optimizer step
- Delayed Parameter Updates (DPU) to improve hardware utilization
- Queues for incoming and outgoing requests to maintain high GPU utilization
- Activation checkpointing to reduce memory footprint
- Stochastic wiring to better utilize heterogeneous devices and recover from faults
- Adaptive swarm rebalancing to maximize throughput
Experiments
Communication efficiency at scale
- Experiments measure GPU utilization and network usage for different model sizes
- Experiments run on homogeneous V100 GPU nodes
- Model size affects computation to communication ratio
- 4 model configurations evaluated
- Larger models achieve better GPU utilization rate
- Activation compression can reduce network usage with little effect on convergence
- Large models maintain most of their training efficiency at 100ms latency
Detailed performance comparison
- Investigating how SWARM parallelism compares to existing systems for training large models
- Comparing training throughput in “ideal” conditions with homogeneous reliable devices and balanced layers
- Benchmarking individual SWARM components in preemptible setups
- Evaluating training performance for sequences of 4 Transformer layers of identical size distributed over 16 workers
- Measuring two separate performance statistics: training throughput and All-Reduce time
- Hardware setup: V100-PCIe GPU with 16 CPU threads and 128 GB RAM
- ZeRO-Offload outperforms both SWARM and GPipe when training smaller models
- SWARM is competitive to HPC baselines even in an idealized homogeneous environment
Large-scale distributed training
- Conducted a series of large-scale distributed experiments using cloud T4 and A100 GPUs
- Trained a Transformer language model with architecture similar to prior work
- Used 8-bit compression for activations and gradients
- Verified that model parallelism with asynchronous updates does not have significant convergence issues
- Compared training dynamics of two approaches to demonstrate viability of SWARM parallelism
- Measured pipeline throughput in different hardware conditions and compared to best-case pipeline performance
- Found that with layer sharing and 8-bit compression, medium-performance GPUs can be saturated with moderate network speeds
Adaptive rebalancing evaluation
- Peer rebalancing algorithm proposed in Section 3.2 is validated for efficiency
- Results show significant improvement over baseline, which grows over time
Conclusion
- Evaluating feasibility of high-throughput training of billion-scale neural networks on unreliable peers with low network bandwidth
- Proposing SWARM parallelism to overcome challenges of pipeline parallelism for preemptible devices with heterogeneous network bandwidths and computational throughputs
- Showing SWARM parallelism is effective at rebalancing peers and maximizing aggregate training throughput
- Showing training large models with SWARM parallelism and compression-aware architectures enables high utilization of cheap preemptible instances with slow interconnect
- Making training of large models accessible to researchers without access to dedicated compute infrastructure
- Regular data parallelism requires all-reduce steps which can be prohibitively expensive for large models
- Sharded data parallelism requires all-to-all communication of parameter buffers at each layer which can be inefficient for setups with different network bandwidths
- SWARM is worse than traditional parallelism due to extra complexity
- SWARM can be useful in case of supercomputers with heterogeneous devices
- SWARM can accelerate training while allowing training of large models
- SWARM can be more economical than on-demand cloud VMs or dedicated HPC setups
- SWARM is more efficient for models with more than 1B parameters
- SWARM can still be effective without layer sharing or quantization
- SWARM can suffer from pipeline “bubble” problem with long pipelines
- Sharded data parallelism, locally connected layers, Mixture-of-Experts, Switch layers, Product Key Memory can scale to large number of parameters
- Optimal scheduling for distributed training can be done with global optimization techniques
- Elastic training can rebalance load between remaining nodes if a worker leaves or fails
- Asynchronous training can utilize each device but may reduce convergence rate
- Stochastic wiring routes each training microbatch through random devices from each pipeline stage
- Adaptive rebalancing moves peers with lowest queue size from stage with minimum load to stage with maximum load