Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Renewed and growing interest in alternatives to digital computers to reduce energy cost of running neural networks
  • Optical matrix-vector multipliers best suited to performing computations with large operands
  • Small-scale optical experiments with prototype accelerator to demonstrate Transformer operations can run on optical hardware
  • Simulations and experiments explored energy efficiency of optical implementations of Transformers
  • Optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width
  • With well-engineered, large-scale optical hardware, possible to achieve $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models
  • Assumptions about future improvements to electronics and Transformer quantization techniques could grow optical computers’ advantage to $>100,000\times$

Paper Content

Introduction

  • Deep learning models are becoming increasingly large, leading to concerns about energy usage, speed, and practicality.
  • Transformer architectures have been found to improve with larger parameters.
  • This has caused an exponential growth of deep learning compute usage.
  • Transformer models are being used in natural language processing, computer vision, graphs, and multi-modal settings.
  • Digital hardware’s energy efficiency has not kept up with the growing FLOP requirements of deep learning models.
  • Transformer models have transfer learning capabilities.
  • There is growing interest in analog computing for deep learning, such as optical neural networks.
  • Optical neural networks may offer superior efficiency and latency over digital computers.
  • We demonstrate that Transformers can run on optical systems.

Optical accelerators

  • There are many designs for optical accelerators for deep learning
  • Optical systems manipulate different types of optical modes to perform matrix-vector multiplications, vector-vector dot products, or convolutions
  • Optical systems are subject to noise from the hardware and photon detection
  • The number of photons used affects the signal-to-noise ratio
  • The efficiency of photon usage increases with more multiply-accumulate operations
  • The energy cost of optical neural networks is broken down into optical and electrical costs
  • Optical systems are more efficient than digital electronic ones

Previous optical neural network architectures

  • Previous work has studied deep learning models on benchmark tasks
  • Previous work has studied simulations of larger convolutional models on more difficult datasets

Optical transformers

  • Designed models to simulate behavior and energy consumption of other Transformers on optical hardware
  • Approach and model summarized in Figure 2

Architecture and task

  • Created optical Transformer models with GPT2-like architecture
  • Replaced GELU activation with ReLU6
  • Used raw Wikitext-103 dataset for language modelling

Transformer computations on optical hardware

  • SLM-based system computes vector-vector dot products
  • Sampled dot products from some layers to run on real hardware
  • Model of energy consumption for optical accelerators based on assumptions and results from experiment/simulations
  • Ran experiments using real Transformer’s weights to characterize system behavior
  • Sampled vector-vector dot products from different parts of the model to run on data with typical model statistics

Simulation of transformers on optical hardware

  • Constructed simulations of Transformers running on an optical accelerator
  • Hybrid Scheme used to compute activation and normalization functions digitally
  • Linear operations accelerated optically
  • Non-Negative Weights and Inputs used to work around display and SLM limitations
  • Device Quantization and Quantization-Aware Training used to emulate hardware
  • Systematic Errors simulated by adding Gaussian noise
  • Optical Encoding and Shot Noise used to simulate optical computations

Results

Error tolerance and simulation accuracy (experiments using optical hardware and comparison to simulation)

  • Transformer operations can run on real hardware without degraded performance.
  • Experiments used high photon numbers and multiple runs to eliminate noise.
  • Simulated noise distributions match experimental data.
  • Model performs well with significant noise, within 1 perplexity from noise-free performance.

Optical scaling laws (simulation)

  • Optical shot noise limits a model’s energy efficiency.
  • Simulations removed systematic errors and used same embedding dimension and photons for inference.
  • Optical hardware can achieve same perplexity as digital-electronic hardware with enough photons.
  • Perplexity degrades from ideal as fewer photons are used.
  • Optical Transformers can achieve same perplexity as digital-electronic processors with modest photon budgets.
  • Optical Transformers use fewer photons/MAC as they scale.
  • Larger Transformers may require less output precision.

Estimated energy usage (simulation)

  • Efficient photon scaling trend observed in Section 4.2 suggests Transformers running on optical hardware could achieve energy efficiency advantages over digital hardware
  • Designed an ONN system based on current hardware with measured precision and photon scaling
  • Inference system with in-place weights, 10 GHz light modulator array, and optical “core” which can perform 10M multiplications per cycle
  • Photon-per-MAC scaling versus model dimension taken from Figure 5
  • Estimated energy usage of Transformer models on optical hardware for single inference
  • Hypothetical future model designs labelled FUTURE-*
  • Energy advantage of running on optics over estimated NVIDIA A100 usage grows with model compute
  • 5-bit input precision, 8-bit weight precision, and 7-bit output precision
  • Calculated cost of loading and detecting every value in operation of model
  • For attention, weights-in-place not possible, must load both operands dynamically
  • Electrical costs composed of amplification, modulation, memory access, and digital/analog conversion
  • Optical energy simple: total number of MACs for model multiplied by photons/MAC and energy of a photon
  • As models grow, running Transformers on optical hardware has large and asymptotic efficiency advantage over digital hardware

Discussion

  • Optical Transformers can be used to design future hardware and software systems.
  • Results from Section 4.3 show the efficiency of Optical Transformers.

Implications for designing optical-neural-network (onn) accelerators

  • ONN systems require computation units (cores) for performing MACs, detectors, modulators, and memory.
  • Biggest challenge in creating ONN hardware is having enough (or large enough) cores.
  • Strategy to deal with hardware requirement is to split up large weight matrix into several “chunks” and cycle through them.
  • ONN hardware should implement at least 10^4 x 10^4 parallelism to achieve >100x efficiency improvements over current state-of-the-art GPUs.
  • ONN hardware should operate in regime where single matrix-vector multiplication is performed every 0.1 nanoseconds.
  • Weights-in-place is a good choice for large-scale neural network designs.
  • Improvements in CMOS technology (DAC, ADC, memory access costs) will be beneficial for ONN efficiency.
  • Future ONNs might achieve energy advantages of >100,000x running models the size of FUTURE-4q.

Scalable optical deep learning

  • Transformer designs are favorable for optics
  • Sparse-expert architectures shard computations to different parts of the system
  • Research investigating how to run large models at lower precision
  • Speed/energy tradeoffs and communication cost concerns in ONNs
  • Design and scale models for optimal efficiency on optical hardware
  • Maximize data reuse for optimal optical matrix-vector multipliers

Summary and conclusion

  • Transformer models can run accurately and efficiently on optical hardware
  • Optical systems have a large and asymptotic energy advantage over digital ones that grows with the model size
  • Optical hardware may achieve an over 100× energy advantage when running the largest Transformer models today
  • Larger, future Transformers may be realized with an >8000× optical energy advantage
  • Neural-network models designed to be more efficient on digital-electronic hardware will be more efficient on optical hardware
  • Optical hardware needs to meet certain specifications to realize energy-efficiency advantages
  • Future architectural changes and improvements to electronics would further improve ONN efficiency
  • Digital-electronic neural-network accelerators are improving at a rate of ∼10× every 7 years
  • Code, data, and instructions to reproduce results are available
  • Pretrained models used the Wikitext-103 dataset and Xavier uniform initialization
  • AdamW optimizer, with weight decay applied to parameters, and dropout applied after every linear layer
  • 8-bit QAT scheme used with RMSProp optimizer
  • Evaluated perplexity over the entire validation set
  • SLM-HWP-PBS setup used for optical dot products
  • Data-pixel encoding scheme used to reduce systematic errors
  • Lookup tables used for display and SLM
  • Calibration curve maps output intensity measurements to neuron values