Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Training large models with billions of parameters is expensive and requires specialized HPC clusters.
  • Alternative setups for training large models include using cheap “preemptible” instances or pooling resources from multiple regions.
  • SWARM parallelism is a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices.
  • SWARM parallelism was used to train a large Transformer language model with 1B shared parameters on preemptible T4 GPUs.

Paper Content

Introduction

  • Deep learning community is increasingly reliant on large pretrained neural networks
  • Parameter count of models has grown from hundreds of millions to hundreds of billions
  • Models no longer fit into a single accelerator and require specialized training algorithms
  • Intensive device-to-device communication is needed
  • Training requires dedicated high-performance computing clusters or supercomputers
  • Cost-efficient distributed training strategies leverage fleets of temporary “preemptible” instances
  • Training in collaborations by pooling together preexisting resources or using volunteers
  • Specialized algorithms needed to adapt to changing number of workers, utilize heterogeneous devices and recover from hardware and network failures
  • Square-Cube Law of distributed training: training larger models can decrease network overhead
  • SWARM parallelism: decentralized model-parallel algorithm for billion-scale training on heterogeneous unreliable devices with slow interconnect
  • 8-bit compression allows training billion-scale Transformer language model on preemptible servers with low-power GPUs and network bandwidth of less than 200Mb/s

Model-parallel training

  • Deep learning algorithms divide models between multiple workers
  • Traditional model parallelism assigns each device to compute a subset of each layer
  • Pipeline parallelism assigns each device with one or several layers
  • Data parallelism with dynamic parameter loading is another strategy
  • Mixture-of-Experts is a model-specific algorithm

Distributed training outside hpc

  • Techniques designed for HPC setup not always available
  • Preemptible instances or volunteer computing used as cost-efficient alternative
  • Training in these environments more difficult due to machine disconnection and limited instances
  • Elastic and asynchronous training methods proposed to handle unstable peers and heterogeneous devices
  • Global scheduling used to optimize training over heterogeneous devices
  • Model-parallel algorithms vulnerable to hardware and network failures
  • Two methods allow training large models with unreliable devices

Communication efficiency and compression

  • Training with limited network bandwidth or high latency can be addressed with gradient compression or overlapping computation with communication phases.
  • Efficient gradient communication can be achieved with Deep Gradient Compression, PowerSGD, or 8-bit quantization.
  • Overlapping communication and computation can be achieved with parallelism techniques, layer-by-layer synchronization, or pure pipeline parallelism.

Communication-efficient model parallelism

  • Training large models with unreliable devices can be done using model-parallel algorithms
  • SWARM parallelism is a decentralized algorithm for training large models with unreliable devices

The square-cube law of distributed training

  • Model parallelism can be abstracted away from application-specific parameters
  • Model consists of k stages, each represented by nxn matrices
  • Compute time for matrix multiplication scales as O(n^3)
  • Communication phase requires at most O(n^2) time
  • Square-cube law applies to many real-world neural network architectures
  • Pipeline parallelism grows more communication-efficient with model size
  • Data-parallel training does not scale as well

Swarm parallelism

  • Traditional pipeline parallelism can be communication efficient, but not enough for certain setups
  • Weakest link in pipeline can bottleneck entire process
  • Rigid pipeline structure replaced with temporary “pipelines” built stochastically on the fly
  • Each participant can send outputs to any peer that serves the next pipeline stage
  • Fault tolerance of SWARM parallelism
  • Model partitioned into evenly sized stages
  • Peers receive inputs from predecessors and send activations to peers in the next stage
  • Gradients for outputs received, gradients for inputs computed, gradients for parameters accumulated
  • All-Reduce to average gradients within their pipeline stages and optimizer step
  • Delayed Parameter Updates (DPU) to improve hardware utilization
  • Queues for incoming and outgoing requests to maintain high GPU utilization
  • Activation checkpointing to reduce memory footprint
  • Stochastic wiring to better utilize heterogeneous devices and recover from faults
  • Adaptive swarm rebalancing to maximize throughput

Experiments

Communication efficiency at scale

  • Experiments measure GPU utilization and network usage for different model sizes
  • Experiments run on homogeneous V100 GPU nodes
  • Model size affects computation to communication ratio
  • 4 model configurations evaluated
  • Larger models achieve better GPU utilization rate
  • Activation compression can reduce network usage with little effect on convergence
  • Large models maintain most of their training efficiency at 100ms latency

Detailed performance comparison

  • Investigating how SWARM parallelism compares to existing systems for training large models
  • Comparing training throughput in “ideal” conditions with homogeneous reliable devices and balanced layers
  • Benchmarking individual SWARM components in preemptible setups
  • Evaluating training performance for sequences of 4 Transformer layers of identical size distributed over 16 workers
  • Measuring two separate performance statistics: training throughput and All-Reduce time
  • Hardware setup: V100-PCIe GPU with 16 CPU threads and 128 GB RAM
  • ZeRO-Offload outperforms both SWARM and GPipe when training smaller models
  • SWARM is competitive to HPC baselines even in an idealized homogeneous environment

Large-scale distributed training

  • Conducted a series of large-scale distributed experiments using cloud T4 and A100 GPUs
  • Trained a Transformer language model with architecture similar to prior work
  • Used 8-bit compression for activations and gradients
  • Verified that model parallelism with asynchronous updates does not have significant convergence issues
  • Compared training dynamics of two approaches to demonstrate viability of SWARM parallelism
  • Measured pipeline throughput in different hardware conditions and compared to best-case pipeline performance
  • Found that with layer sharing and 8-bit compression, medium-performance GPUs can be saturated with moderate network speeds

Adaptive rebalancing evaluation

  • Peer rebalancing algorithm proposed in Section 3.2 is validated for efficiency
  • Results show significant improvement over baseline, which grows over time

Conclusion

  • Evaluating feasibility of high-throughput training of billion-scale neural networks on unreliable peers with low network bandwidth
  • Proposing SWARM parallelism to overcome challenges of pipeline parallelism for preemptible devices with heterogeneous network bandwidths and computational throughputs
  • Showing SWARM parallelism is effective at rebalancing peers and maximizing aggregate training throughput
  • Showing training large models with SWARM parallelism and compression-aware architectures enables high utilization of cheap preemptible instances with slow interconnect
  • Making training of large models accessible to researchers without access to dedicated compute infrastructure
  • Regular data parallelism requires all-reduce steps which can be prohibitively expensive for large models
  • Sharded data parallelism requires all-to-all communication of parameter buffers at each layer which can be inefficient for setups with different network bandwidths
  • SWARM is worse than traditional parallelism due to extra complexity
  • SWARM can be useful in case of supercomputers with heterogeneous devices
  • SWARM can accelerate training while allowing training of large models
  • SWARM can be more economical than on-demand cloud VMs or dedicated HPC setups
  • SWARM is more efficient for models with more than 1B parameters
  • SWARM can still be effective without layer sharing or quantization
  • SWARM can suffer from pipeline “bubble” problem with long pipelines
  • Sharded data parallelism, locally connected layers, Mixture-of-Experts, Switch layers, Product Key Memory can scale to large number of parameters
  • Optimal scheduling for distributed training can be done with global optimization techniques
  • Elastic training can rebalance load between remaining nodes if a worker leaves or fails
  • Asynchronous training can utilize each device but may reduce convergence rate
  • Stochastic wiring routes each training microbatch through random devices from each pipeline stage
  • Adaptive rebalancing moves peers with lowest queue size from stage with minimum load to stage with maximum load