Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Increasing interest in TinyML is pushing toward enabling TinyML-class training.
Current training algorithms rely on floating-point matrix operations.
This paper presents RedMulE, a low-power specialized accelerator for multi-precision floating-point operations.
RedMulE-augmented PULP cluster achieves high GFLOPS/W and TFLOPS/W.
RedMulE consumes less than 60 mW on average, enabling on-device training of deep learning models.

Paper Content

I. introduction

Number of IoT devices connected and executing ML/DL algorithms has increased
Computation has been moved from data centers to energy-efficient IoT end-nodes
Tiny-ML field of research and application has emerged
Extreme-edge applications rely on GEMM-Ops
TensorCores have been proposed to augment acceleration capabilities
Low-precision formats (FP16, FP8) have been adapted to low-power embedded systems
RedMulE is a TinyML-class open-source hardware accelerator
RedMulE enables on-chip learning and adaptation capabilities
RedMulE achieves high speedup and energy efficiency

Linear algebra-based algorithms are used for inference and training of neural networks.
NVIDIA’s Hopper H100 GPU is an example of a data-center computing platform for DL tasks.
Extreme-edge inference is achievable with low-precision integer arithmetic, but training faces large memory requirements and the need for FP calculations.

A. inference accelerators

Hardware accelerators can provide alternatives to software-based executions for low-power DL inference.
Diana is a low-power NN SoC with a digital NN inference accelerator and an analog in-memory computing core.
DNPU is a fully-digital energy-efficient DL processor for convolutional and recursive NN inference acceleration.
Gemmini is a 16x16 systolic accelerator designed for inference of deep NNs with 8-bit multiply-accumulate units.

B. on-device learning

On-device learning is a challenge for training DL models on ultra-low-power microcontrollers
Direct feedback alignment and equilibrium propagation are less effective than backpropagation
TinyOL and PULP Trainlib are approaches to enable on-device learning on microcontrollers
Low speed and number of available floating point units limit the performance of these libraries

C. training accelerators

Researchers have used hardware acceleration to improve training performance on low-power processors.
Most training-oriented chips use floating-point operations, but they consume too much power for TinyML devices.
Recently, some training-oriented SoCs have been proposed that fit the power budget of extreme-edge applications.
Many training-oriented chips use pruning to increase sparsity during training, but this lacks generality.

D. gemm-ops chips

Targets most common DL operations
Includes GEMM-Ops
SIMD 2 provides up to 15.8x speedup
Red-MulE combines features for efficient training and inference of DL models

Iii. background

A. generalized matrix-matrix operations

GEMM-Ops are operations of the kind f 2(Y, f 1(X, W))
Group 1 includes GEMM-Ops with +/× operators
Group 2 includes GEMM-Ops with min/max operators
X, W, Z and Y are matrices of different sizes
GEMM-Ops are well-suited for ML applications
Equation 1 is symmetric, so X and W can be flexibly exchanged

B. asymptotic optimality of linear algebra acceleration strategies

Memory load/store operations reduce the gap between theoretical and practical performance and efficiency.
Maximizing the number of operations per memory access is important for efficient design.
Scalar dot products and vector units do not guarantee the best trade-off between the number of operations and memory load/store access.

Iv. architecture

Describe PULP cluster
Describe hardware template
Describe RedMulE micro-architecture

A. pulp cluster and redmule

PULP cluster contains 8 RISC-V cores with 128 kB of TCDM
Event unit and DMAC for internal synchronization and external memory access
Peripheral interconnect and AXI4 full cross-bar interconnect
Hardware Processing Engines (HWPEs) for application-specific hardware accelerators
HCI with logarithmic and shallow branches
DataMover for 3-dimensional tensor transposition
Streamer for memory access
Computing Element Microarchitecture with FMA, FNCOMP, multiplexer, clock gating module
Casting Module for hybrid FP8 precision formats

C. redmule computational model

RedMulE performs GEMM-Op by visualizing it on the computed matrices
RedMulE features L = 12, H = 4, and P = 3
RedMulE starts by pre-loading the Z-Buffer with L rows from the Ymatrix
W-elements are broadcasted to all the L CEs in the first Datpath column
After P + 1 cycles, each of the L CEs in the first column forwards its computed partial result to the neighbour CE in the second column
RedMulE activates its feedback to provide the intermediate results to the accumulation input of the first CEs of the given row
W-buffer accesses the memory once every (P + 1)-cycles to load a new set of H × (P + 1) W-elements
Store operations are interleaved between two adjacent W load accesses until the Z-Buffer is empty
RedMulE optimizes the bandwidth utilization using a single wide memory port and achieves up to 99.4% CEs utilization

V. implementation and measurements

A. experimental setup

Experiments conducted on RedMulE 12x4 and 12x8 instances
Memory interface is 288-bit wide
Experiments target GlobalFoundries 22 nm technology
Synopsys Design Compiler used for synthesis
Cadence Innovus used for full-cluster Place&Route
Prime Time used for timing analysis and power extraction
Two operating points targeted: 470 MHz at 0.65 V and 613 MHz at 0.8 V

B. performance evaluation 1) gemm performance evaluation:

Used square and rectangular matrices to evaluate RedMulE’s computation latency
RedMulE achieved 95.4 OP/cycle and 99.4% CEs utilization
RedMulE reached 15x average speedup over software on large matrices
RedMulE achieved 3.5x speedup over software on small 8x8x8 case
Evaluated RedMulE performance on real-case NN training
RedMulE accelerated matrix multiplication execution 14.6x with respect to parallel RISC-V execution
Data reorganization during Im2Col accounted for 3 million computing cycles
DataMover engine halved number of computing cycles required to perform Im2Col, speeding up overall training step execution up to 4.9x
Evaluated GEMM-Ops performance, RedMulE achieved up to 62x speedup
RedMulE 12x4 occupies 0.15 mm2, 23.8% of entire PULP cluster area
RedMulE’s area occupation becomes comparable to entire PULP cluster when it contains 256 CEs
Changing the shape of the Datapath affects the size of the Streamer

D. redmule power

Power consumption in efficiency point is 59.3 mW

Vi. comparison with the state-of-the-art

RedMulE targets training and inference
DNPU has 1.9x higher performance than RedMulE but 16x more CEs
DNPU features 2.7x higher efficiency than RedMulE but only works with fixed-point precision
Diana has 44.5% less performance than RedMulE 12x8 and 12% less performance than RedMulE 12x4
Diana has lower power consumption but RedMulE 12x4 consumes 7.65 mW at 50 MHz
Gemmini has one order of magnitude less energy efficiency than RedMulE 12x4 despite 5x more CEs
IBM chip is 2.4x more energy-efficient, 33.2x larger, and 74x more power-consuming than RedMulE 12x4
LNPU has 6.67x higher power envelope than RedMulE 12x4
Vega has 7.8x higher performance and 3.2x higher energy efficiency than RedMulE 12x4
Cambricon-Q is 2.9x more energy-efficient but uses 8-bit fixed-point arithmetic
Cambricon-Q is 17.7x more power-hungry than RedMulE 12x4
T-PIM works with 16-bit integer precision but does not satisfy precision requirements
TSUNAMI and Trainer use pruning and sparse matrices to increase energy efficiency
RedMulE 12x4 has 1.72x better performance and 20.5% higher energy efficiency than Anders et al.
SIMD 2 has 36.1x higher power consumption than RedMulE

Vii. conclusion

RedMulE is a cluster-coupled accelerator for TinyML training
RedMulE supports FP16 GEMM-Ops computation and compressed FP8 inputs
RedMulE achieves 99.4% CEs utilization and 15x speedup for GEMM execution
RedMulE accelerates matrix multiplication by up to 14.6x and 28.5x with 16-bit and 8-bit inputs respectively

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. related work#

A. inference accelerators#

B. on-device learning#

C. training accelerators#

D. gemm-ops chips#

Iii. background#

A. generalized matrix-matrix operations#

B. asymptotic optimality of linear algebra acceleration strategies#

Iv. architecture#

A. pulp cluster and redmule#

C. redmule computational model#

V. implementation and measurements#

A. experimental setup#

B. performance evaluation 1) gemm performance evaluation:#

D. redmule power#

Vi. comparison with the state-of-the-art#

Vii. conclusion#

Link to paper

Abstract

Paper Content

I. introduction

Ii. related work

A. inference accelerators

B. on-device learning

C. training accelerators

D. gemm-ops chips

Iii. background

A. generalized matrix-matrix operations

B. asymptotic optimality of linear algebra acceleration strategies

Iv. architecture

A. pulp cluster and redmule

C. redmule computational model

V. implementation and measurements

A. experimental setup

B. performance evaluation 1) gemm performance evaluation:

D. redmule power

Vi. comparison with the state-of-the-art

Vii. conclusion