Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Renewed and growing interest in alternatives to digital computers to reduce energy cost of running neural networks
Optical matrix-vector multipliers best suited to performing computations with large operands
Small-scale optical experiments with prototype accelerator to demonstrate Transformer operations can run on optical hardware
Simulations and experiments explored energy efficiency of optical implementations of Transformers
Optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width
With well-engineered, large-scale optical hardware, possible to achieve $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models
Assumptions about future improvements to electronics and Transformer quantization techniques could grow optical computers’ advantage to $>100,000\times$

Paper Content

Introduction

Deep learning models are becoming increasingly large, leading to concerns about energy usage, speed, and practicality.
Transformer architectures have been found to improve with larger parameters.
This has caused an exponential growth of deep learning compute usage.
Transformer models are being used in natural language processing, computer vision, graphs, and multi-modal settings.
Digital hardware’s energy efficiency has not kept up with the growing FLOP requirements of deep learning models.
Transformer models have transfer learning capabilities.
There is growing interest in analog computing for deep learning, such as optical neural networks.
Optical neural networks may offer superior efficiency and latency over digital computers.
We demonstrate that Transformers can run on optical systems.

Optical accelerators

There are many designs for optical accelerators for deep learning
Optical systems manipulate different types of optical modes to perform matrix-vector multiplications, vector-vector dot products, or convolutions
Optical systems are subject to noise from the hardware and photon detection
The number of photons used affects the signal-to-noise ratio
The efficiency of photon usage increases with more multiply-accumulate operations
The energy cost of optical neural networks is broken down into optical and electrical costs
Optical systems are more efficient than digital electronic ones

Previous optical neural network architectures

Previous work has studied deep learning models on benchmark tasks
Previous work has studied simulations of larger convolutional models on more difficult datasets

Optical transformers

Designed models to simulate behavior and energy consumption of other Transformers on optical hardware
Approach and model summarized in Figure 2

Architecture and task

Created optical Transformer models with GPT2-like architecture
Replaced GELU activation with ReLU6
Used raw Wikitext-103 dataset for language modelling

Transformer computations on optical hardware

SLM-based system computes vector-vector dot products
Sampled dot products from some layers to run on real hardware
Model of energy consumption for optical accelerators based on assumptions and results from experiment/simulations
Ran experiments using real Transformer’s weights to characterize system behavior
Sampled vector-vector dot products from different parts of the model to run on data with typical model statistics

Simulation of transformers on optical hardware

Constructed simulations of Transformers running on an optical accelerator
Hybrid Scheme used to compute activation and normalization functions digitally
Linear operations accelerated optically
Non-Negative Weights and Inputs used to work around display and SLM limitations
Device Quantization and Quantization-Aware Training used to emulate hardware
Systematic Errors simulated by adding Gaussian noise
Optical Encoding and Shot Noise used to simulate optical computations

Results

Error tolerance and simulation accuracy (experiments using optical hardware and comparison to simulation)

Transformer operations can run on real hardware without degraded performance.
Experiments used high photon numbers and multiple runs to eliminate noise.
Simulated noise distributions match experimental data.
Model performs well with significant noise, within 1 perplexity from noise-free performance.

Optical scaling laws (simulation)

Optical shot noise limits a model’s energy efficiency.
Simulations removed systematic errors and used same embedding dimension and photons for inference.
Optical hardware can achieve same perplexity as digital-electronic hardware with enough photons.
Perplexity degrades from ideal as fewer photons are used.
Optical Transformers can achieve same perplexity as digital-electronic processors with modest photon budgets.
Optical Transformers use fewer photons/MAC as they scale.
Larger Transformers may require less output precision.

Estimated energy usage (simulation)

Efficient photon scaling trend observed in Section 4.2 suggests Transformers running on optical hardware could achieve energy efficiency advantages over digital hardware
Designed an ONN system based on current hardware with measured precision and photon scaling
Inference system with in-place weights, 10 GHz light modulator array, and optical “core” which can perform 10M multiplications per cycle
Photon-per-MAC scaling versus model dimension taken from Figure 5
Estimated energy usage of Transformer models on optical hardware for single inference
Hypothetical future model designs labelled FUTURE-*
Energy advantage of running on optics over estimated NVIDIA A100 usage grows with model compute
5-bit input precision, 8-bit weight precision, and 7-bit output precision
Calculated cost of loading and detecting every value in operation of model
For attention, weights-in-place not possible, must load both operands dynamically
Electrical costs composed of amplification, modulation, memory access, and digital/analog conversion
Optical energy simple: total number of MACs for model multiplied by photons/MAC and energy of a photon
As models grow, running Transformers on optical hardware has large and asymptotic efficiency advantage over digital hardware

Discussion

Optical Transformers can be used to design future hardware and software systems.
Results from Section 4.3 show the efficiency of Optical Transformers.

Implications for designing optical-neural-network (onn) accelerators

ONN systems require computation units (cores) for performing MACs, detectors, modulators, and memory.
Biggest challenge in creating ONN hardware is having enough (or large enough) cores.
Strategy to deal with hardware requirement is to split up large weight matrix into several “chunks” and cycle through them.
ONN hardware should implement at least 10^4 x 10^4 parallelism to achieve >100x efficiency improvements over current state-of-the-art GPUs.
ONN hardware should operate in regime where single matrix-vector multiplication is performed every 0.1 nanoseconds.
Weights-in-place is a good choice for large-scale neural network designs.
Improvements in CMOS technology (DAC, ADC, memory access costs) will be beneficial for ONN efficiency.
Future ONNs might achieve energy advantages of >100,000x running models the size of FUTURE-4q.

Scalable optical deep learning

Transformer designs are favorable for optics
Sparse-expert architectures shard computations to different parts of the system
Research investigating how to run large models at lower precision
Speed/energy tradeoffs and communication cost concerns in ONNs
Design and scale models for optimal efficiency on optical hardware
Maximize data reuse for optimal optical matrix-vector multipliers

Summary and conclusion

Transformer models can run accurately and efficiently on optical hardware
Optical systems have a large and asymptotic energy advantage over digital ones that grows with the model size
Optical hardware may achieve an over 100× energy advantage when running the largest Transformer models today
Larger, future Transformers may be realized with an >8000× optical energy advantage
Neural-network models designed to be more efficient on digital-electronic hardware will be more efficient on optical hardware
Optical hardware needs to meet certain specifications to realize energy-efficiency advantages
Future architectural changes and improvements to electronics would further improve ONN efficiency
Digital-electronic neural-network accelerators are improving at a rate of ∼10× every 7 years
Code, data, and instructions to reproduce results are available
Pretrained models used the Wikitext-103 dataset and Xavier uniform initialization
AdamW optimizer, with weight decay applied to parameters, and dropout applied after every linear layer
8-bit QAT scheme used with RMSProp optimizer
Evaluated perplexity over the entire validation set
SLM-HWP-PBS setup used for optical dot products
Data-pixel encoding scheme used to reduce systematic errors
Lookup tables used for display and SLM
Calibration curve maps output intensity measurements to neuron values

Link to paper#

Abstract#

Paper Content#

Introduction#

Optical accelerators#

Previous optical neural network architectures#

Optical transformers#

Architecture and task#

Transformer computations on optical hardware#

Simulation of transformers on optical hardware#

Results#

Error tolerance and simulation accuracy (experiments using optical hardware and comparison to simulation)#

Optical scaling laws (simulation)#

Estimated energy usage (simulation)#

Discussion#

Implications for designing optical-neural-network (onn) accelerators#

Scalable optical deep learning#

Summary and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Optical accelerators

Previous optical neural network architectures

Optical transformers

Architecture and task

Transformer computations on optical hardware

Simulation of transformers on optical hardware

Results

Error tolerance and simulation accuracy (experiments using optical hardware and comparison to simulation)

Optical scaling laws (simulation)

Estimated energy usage (simulation)

Discussion

Implications for designing optical-neural-network (onn) accelerators

Scalable optical deep learning

Summary and conclusion