Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Increasing interest in TinyML is pushing toward enabling TinyML-class training.
- Current training algorithms rely on floating-point matrix operations.
- This paper presents RedMulE, a low-power specialized accelerator for multi-precision floating-point operations.
- RedMulE-augmented PULP cluster achieves high GFLOPS/W and TFLOPS/W.
- RedMulE consumes less than 60 mW on average, enabling on-device training of deep learning models.
Paper Content
I. introduction
- Number of IoT devices connected and executing ML/DL algorithms has increased
- Computation has been moved from data centers to energy-efficient IoT end-nodes
- Tiny-ML field of research and application has emerged
- Extreme-edge applications rely on GEMM-Ops
- TensorCores have been proposed to augment acceleration capabilities
- Low-precision formats (FP16, FP8) have been adapted to low-power embedded systems
- RedMulE is a TinyML-class open-source hardware accelerator
- RedMulE enables on-chip learning and adaptation capabilities
- RedMulE achieves high speedup and energy efficiency
Ii. related work
- Linear algebra-based algorithms are used for inference and training of neural networks.
- NVIDIA’s Hopper H100 GPU is an example of a data-center computing platform for DL tasks.
- Extreme-edge inference is achievable with low-precision integer arithmetic, but training faces large memory requirements and the need for FP calculations.
A. inference accelerators
- Hardware accelerators can provide alternatives to software-based executions for low-power DL inference.
- Diana is a low-power NN SoC with a digital NN inference accelerator and an analog in-memory computing core.
- DNPU is a fully-digital energy-efficient DL processor for convolutional and recursive NN inference acceleration.
- Gemmini is a 16x16 systolic accelerator designed for inference of deep NNs with 8-bit multiply-accumulate units.
B. on-device learning
- On-device learning is a challenge for training DL models on ultra-low-power microcontrollers
- Direct feedback alignment and equilibrium propagation are less effective than backpropagation
- TinyOL and PULP Trainlib are approaches to enable on-device learning on microcontrollers
- Low speed and number of available floating point units limit the performance of these libraries
C. training accelerators
- Researchers have used hardware acceleration to improve training performance on low-power processors.
- Most training-oriented chips use floating-point operations, but they consume too much power for TinyML devices.
- Recently, some training-oriented SoCs have been proposed that fit the power budget of extreme-edge applications.
- Many training-oriented chips use pruning to increase sparsity during training, but this lacks generality.
D. gemm-ops chips
- Targets most common DL operations
- Includes GEMM-Ops
- SIMD 2 provides up to 15.8x speedup
- Red-MulE combines features for efficient training and inference of DL models
Iii. background
A. generalized matrix-matrix operations
- GEMM-Ops are operations of the kind f 2(Y, f 1(X, W))
- Group 1 includes GEMM-Ops with +/× operators
- Group 2 includes GEMM-Ops with min/max operators
- X, W, Z and Y are matrices of different sizes
- GEMM-Ops are well-suited for ML applications
- Equation 1 is symmetric, so X and W can be flexibly exchanged
B. asymptotic optimality of linear algebra acceleration strategies
- Memory load/store operations reduce the gap between theoretical and practical performance and efficiency.
- Maximizing the number of operations per memory access is important for efficient design.
- Scalar dot products and vector units do not guarantee the best trade-off between the number of operations and memory load/store access.
Iv. architecture
- Describe PULP cluster
- Describe hardware template
- Describe RedMulE micro-architecture
A. pulp cluster and redmule
- PULP cluster contains 8 RISC-V cores with 128 kB of TCDM
- Event unit and DMAC for internal synchronization and external memory access
- Peripheral interconnect and AXI4 full cross-bar interconnect
- Hardware Processing Engines (HWPEs) for application-specific hardware accelerators
- HCI with logarithmic and shallow branches
- DataMover for 3-dimensional tensor transposition
- Streamer for memory access
- Computing Element Microarchitecture with FMA, FNCOMP, multiplexer, clock gating module
- Casting Module for hybrid FP8 precision formats
C. redmule computational model
- RedMulE performs GEMM-Op by visualizing it on the computed matrices
- RedMulE features L = 12, H = 4, and P = 3
- RedMulE starts by pre-loading the Z-Buffer with L rows from the Ymatrix
- W-elements are broadcasted to all the L CEs in the first Datpath column
- After P + 1 cycles, each of the L CEs in the first column forwards its computed partial result to the neighbour CE in the second column
- RedMulE activates its feedback to provide the intermediate results to the accumulation input of the first CEs of the given row
- W-buffer accesses the memory once every (P + 1)-cycles to load a new set of H × (P + 1) W-elements
- Store operations are interleaved between two adjacent W load accesses until the Z-Buffer is empty
- RedMulE optimizes the bandwidth utilization using a single wide memory port and achieves up to 99.4% CEs utilization
V. implementation and measurements
A. experimental setup
- Experiments conducted on RedMulE 12x4 and 12x8 instances
- Memory interface is 288-bit wide
- Experiments target GlobalFoundries 22 nm technology
- Synopsys Design Compiler used for synthesis
- Cadence Innovus used for full-cluster Place&Route
- Prime Time used for timing analysis and power extraction
- Two operating points targeted: 470 MHz at 0.65 V and 613 MHz at 0.8 V
B. performance evaluation 1) gemm performance evaluation:
- Used square and rectangular matrices to evaluate RedMulE’s computation latency
- RedMulE achieved 95.4 OP/cycle and 99.4% CEs utilization
- RedMulE reached 15x average speedup over software on large matrices
- RedMulE achieved 3.5x speedup over software on small 8x8x8 case
- Evaluated RedMulE performance on real-case NN training
- RedMulE accelerated matrix multiplication execution 14.6x with respect to parallel RISC-V execution
- Data reorganization during Im2Col accounted for 3 million computing cycles
- DataMover engine halved number of computing cycles required to perform Im2Col, speeding up overall training step execution up to 4.9x
- Evaluated GEMM-Ops performance, RedMulE achieved up to 62x speedup
- RedMulE 12x4 occupies 0.15 mm2, 23.8% of entire PULP cluster area
- RedMulE’s area occupation becomes comparable to entire PULP cluster when it contains 256 CEs
- Changing the shape of the Datapath affects the size of the Streamer
D. redmule power
- Power consumption in efficiency point is 59.3 mW
Vi. comparison with the state-of-the-art
- RedMulE targets training and inference
- DNPU has 1.9x higher performance than RedMulE but 16x more CEs
- DNPU features 2.7x higher efficiency than RedMulE but only works with fixed-point precision
- Diana has 44.5% less performance than RedMulE 12x8 and 12% less performance than RedMulE 12x4
- Diana has lower power consumption but RedMulE 12x4 consumes 7.65 mW at 50 MHz
- Gemmini has one order of magnitude less energy efficiency than RedMulE 12x4 despite 5x more CEs
- IBM chip is 2.4x more energy-efficient, 33.2x larger, and 74x more power-consuming than RedMulE 12x4
- LNPU has 6.67x higher power envelope than RedMulE 12x4
- Vega has 7.8x higher performance and 3.2x higher energy efficiency than RedMulE 12x4
- Cambricon-Q is 2.9x more energy-efficient but uses 8-bit fixed-point arithmetic
- Cambricon-Q is 17.7x more power-hungry than RedMulE 12x4
- T-PIM works with 16-bit integer precision but does not satisfy precision requirements
- TSUNAMI and Trainer use pruning and sparse matrices to increase energy efficiency
- RedMulE 12x4 has 1.72x better performance and 20.5% higher energy efficiency than Anders et al.
- SIMD 2 has 36.1x higher power consumption than RedMulE
Vii. conclusion
- RedMulE is a cluster-coupled accelerator for TinyML training
- RedMulE supports FP16 GEMM-Ops computation and compressed FP8 inputs
- RedMulE achieves 99.4% CEs utilization and 15x speedup for GEMM execution
- RedMulE accelerates matrix multiplication by up to 14.6x and 28.5x with 16-bit and 8-bit inputs respectively