Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Increasing interest in TinyML is pushing toward enabling TinyML-class training.
  • Current training algorithms rely on floating-point matrix operations.
  • This paper presents RedMulE, a low-power specialized accelerator for multi-precision floating-point operations.
  • RedMulE-augmented PULP cluster achieves high GFLOPS/W and TFLOPS/W.
  • RedMulE consumes less than 60 mW on average, enabling on-device training of deep learning models.

Paper Content

I. introduction

  • Number of IoT devices connected and executing ML/DL algorithms has increased
  • Computation has been moved from data centers to energy-efficient IoT end-nodes
  • Tiny-ML field of research and application has emerged
  • Extreme-edge applications rely on GEMM-Ops
  • TensorCores have been proposed to augment acceleration capabilities
  • Low-precision formats (FP16, FP8) have been adapted to low-power embedded systems
  • RedMulE is a TinyML-class open-source hardware accelerator
  • RedMulE enables on-chip learning and adaptation capabilities
  • RedMulE achieves high speedup and energy efficiency
  • Linear algebra-based algorithms are used for inference and training of neural networks.
  • NVIDIA’s Hopper H100 GPU is an example of a data-center computing platform for DL tasks.
  • Extreme-edge inference is achievable with low-precision integer arithmetic, but training faces large memory requirements and the need for FP calculations.

A. inference accelerators

  • Hardware accelerators can provide alternatives to software-based executions for low-power DL inference.
  • Diana is a low-power NN SoC with a digital NN inference accelerator and an analog in-memory computing core.
  • DNPU is a fully-digital energy-efficient DL processor for convolutional and recursive NN inference acceleration.
  • Gemmini is a 16x16 systolic accelerator designed for inference of deep NNs with 8-bit multiply-accumulate units.

B. on-device learning

  • On-device learning is a challenge for training DL models on ultra-low-power microcontrollers
  • Direct feedback alignment and equilibrium propagation are less effective than backpropagation
  • TinyOL and PULP Trainlib are approaches to enable on-device learning on microcontrollers
  • Low speed and number of available floating point units limit the performance of these libraries

C. training accelerators

  • Researchers have used hardware acceleration to improve training performance on low-power processors.
  • Most training-oriented chips use floating-point operations, but they consume too much power for TinyML devices.
  • Recently, some training-oriented SoCs have been proposed that fit the power budget of extreme-edge applications.
  • Many training-oriented chips use pruning to increase sparsity during training, but this lacks generality.

D. gemm-ops chips

  • Targets most common DL operations
  • Includes GEMM-Ops
  • SIMD 2 provides up to 15.8x speedup
  • Red-MulE combines features for efficient training and inference of DL models

Iii. background

A. generalized matrix-matrix operations

  • GEMM-Ops are operations of the kind f 2(Y, f 1(X, W))
  • Group 1 includes GEMM-Ops with +/× operators
  • Group 2 includes GEMM-Ops with min/max operators
  • X, W, Z and Y are matrices of different sizes
  • GEMM-Ops are well-suited for ML applications
  • Equation 1 is symmetric, so X and W can be flexibly exchanged

B. asymptotic optimality of linear algebra acceleration strategies

  • Memory load/store operations reduce the gap between theoretical and practical performance and efficiency.
  • Maximizing the number of operations per memory access is important for efficient design.
  • Scalar dot products and vector units do not guarantee the best trade-off between the number of operations and memory load/store access.

Iv. architecture

  • Describe PULP cluster
  • Describe hardware template
  • Describe RedMulE micro-architecture

A. pulp cluster and redmule

  • PULP cluster contains 8 RISC-V cores with 128 kB of TCDM
  • Event unit and DMAC for internal synchronization and external memory access
  • Peripheral interconnect and AXI4 full cross-bar interconnect
  • Hardware Processing Engines (HWPEs) for application-specific hardware accelerators
  • HCI with logarithmic and shallow branches
  • DataMover for 3-dimensional tensor transposition
  • Streamer for memory access
  • Computing Element Microarchitecture with FMA, FNCOMP, multiplexer, clock gating module
  • Casting Module for hybrid FP8 precision formats

C. redmule computational model

  • RedMulE performs GEMM-Op by visualizing it on the computed matrices
  • RedMulE features L = 12, H = 4, and P = 3
  • RedMulE starts by pre-loading the Z-Buffer with L rows from the Ymatrix
  • W-elements are broadcasted to all the L CEs in the first Datpath column
  • After P + 1 cycles, each of the L CEs in the first column forwards its computed partial result to the neighbour CE in the second column
  • RedMulE activates its feedback to provide the intermediate results to the accumulation input of the first CEs of the given row
  • W-buffer accesses the memory once every (P + 1)-cycles to load a new set of H × (P + 1) W-elements
  • Store operations are interleaved between two adjacent W load accesses until the Z-Buffer is empty
  • RedMulE optimizes the bandwidth utilization using a single wide memory port and achieves up to 99.4% CEs utilization

V. implementation and measurements

A. experimental setup

  • Experiments conducted on RedMulE 12x4 and 12x8 instances
  • Memory interface is 288-bit wide
  • Experiments target GlobalFoundries 22 nm technology
  • Synopsys Design Compiler used for synthesis
  • Cadence Innovus used for full-cluster Place&Route
  • Prime Time used for timing analysis and power extraction
  • Two operating points targeted: 470 MHz at 0.65 V and 613 MHz at 0.8 V

B. performance evaluation 1) gemm performance evaluation:

  • Used square and rectangular matrices to evaluate RedMulE’s computation latency
  • RedMulE achieved 95.4 OP/cycle and 99.4% CEs utilization
  • RedMulE reached 15x average speedup over software on large matrices
  • RedMulE achieved 3.5x speedup over software on small 8x8x8 case
  • Evaluated RedMulE performance on real-case NN training
  • RedMulE accelerated matrix multiplication execution 14.6x with respect to parallel RISC-V execution
  • Data reorganization during Im2Col accounted for 3 million computing cycles
  • DataMover engine halved number of computing cycles required to perform Im2Col, speeding up overall training step execution up to 4.9x
  • Evaluated GEMM-Ops performance, RedMulE achieved up to 62x speedup
  • RedMulE 12x4 occupies 0.15 mm2, 23.8% of entire PULP cluster area
  • RedMulE’s area occupation becomes comparable to entire PULP cluster when it contains 256 CEs
  • Changing the shape of the Datapath affects the size of the Streamer

D. redmule power

  • Power consumption in efficiency point is 59.3 mW

Vi. comparison with the state-of-the-art

  • RedMulE targets training and inference
  • DNPU has 1.9x higher performance than RedMulE but 16x more CEs
  • DNPU features 2.7x higher efficiency than RedMulE but only works with fixed-point precision
  • Diana has 44.5% less performance than RedMulE 12x8 and 12% less performance than RedMulE 12x4
  • Diana has lower power consumption but RedMulE 12x4 consumes 7.65 mW at 50 MHz
  • Gemmini has one order of magnitude less energy efficiency than RedMulE 12x4 despite 5x more CEs
  • IBM chip is 2.4x more energy-efficient, 33.2x larger, and 74x more power-consuming than RedMulE 12x4
  • LNPU has 6.67x higher power envelope than RedMulE 12x4
  • Vega has 7.8x higher performance and 3.2x higher energy efficiency than RedMulE 12x4
  • Cambricon-Q is 2.9x more energy-efficient but uses 8-bit fixed-point arithmetic
  • Cambricon-Q is 17.7x more power-hungry than RedMulE 12x4
  • T-PIM works with 16-bit integer precision but does not satisfy precision requirements
  • TSUNAMI and Trainer use pruning and sparse matrices to increase energy efficiency
  • RedMulE 12x4 has 1.72x better performance and 20.5% higher energy efficiency than Anders et al.
  • SIMD 2 has 36.1x higher power consumption than RedMulE

Vii. conclusion

  • RedMulE is a cluster-coupled accelerator for TinyML training
  • RedMulE supports FP16 GEMM-Ops computation and compressed FP8 inputs
  • RedMulE achieves 99.4% CEs utilization and 15x speedup for GEMM execution
  • RedMulE accelerates matrix multiplication by up to 14.6x and 28.5x with 16-bit and 8-bit inputs respectively