

# Energy efficient computing from Exascale to MicroWatts: The RISC-V playground

Zürich, RISC-V Workshop

11.06.2019

Luca Benini<sup>1,2</sup>

erc

European Research Council



<sup>1</sup>Department of Electrical, Electronic and Information Engineering

<sup>2</sup>Integrated Systems Laboratory





## IoT Hierarchical Processing (Compute Continuum)







## **Energy efficiency is THE Challenge**



#### 2013: Parallel Ultra Low Power $\rightarrow$ PULP!



Near-Threshold Computing (NTC):

- **1.** Don't waste energy pushing devices in strong inversion
- 2. Recover performance with parallel execution
- **3.** Manage Leakage, PVT variability and SRAM limitations NT!!!



## **Near-Threshold Multiprocessing**



Need Strong ISA, Need full access to "deep" core interfaces, need to tune pipeline! OPEN ISA: **RISC-V** RV32IMC + **New, Open Microarchitecture** → <u>**RI5CY**</u>!



D. Rossi *et al.*, "Energy-Efficient Near-Threshold Parallel Computing: The PULPv2 Cluster," in *IEEE Micro*, Sep./Oct. 2017.

| | 6

#### 8-Processor PULP Cluster: Parallel Speed-up





#### **Bespoke ISA needed!** Enter Xpulp extensions

<32-bit precision  $\rightarrow$  SIMD2/4  $\rightarrow$  x2,4 efficiency & memory size

Risc-V ISA is extensible by construction (great!)

- V1 Baseline RISC-V RV32IMC HW loops
- V2 Post modified Load/Store Mac
- V3 SIMD 2/4 + DotProduct + Shuffling Bit manipulation unit Lightweight fixed point (EML centric)



#### $25KG \rightarrow 40KG$ (1.6x)



M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in IEEE TVLSI, Oct. 2017.

8

#### **RI5CY – are xPULP ISA Extensions (1.6x) worthwhile?**

for (i = 0; i < 100; i++) 10x on 2d d[i] = a[i] + b[i];convolutions ...YES! Baseline mv x5, 0 mv x4, 100 Lstart: Auto-incr load/store lb x2, 0(: mv x5, 0 1b x3, 0(: mv x4, 100 **HW Loop** addi x10, x1 Lstart: addi x11,x1 lb x2, 0( lp.setupi 100, Lend add x2, x3 1b x3, 0( 1b x2, 0(x10!) Packed-SIMD sb x2, 0( addi x4, x4 lb x3, 0(x11!) lp.setupi 25, Lend addi x4, x4 add x2, x3 add x2, x3, x2 lw x2, 0(x10!) addi x12,x1 sb x2, 0( Lend: sb x2, 0(x1 lw x3, 0(x11!) bne x4, x5 bne x4, x5, Lstart pv.add.b x2, x3, x2 Lend: **sw** x2, 0(x12!)

9

11 cycles/output 8 cycles/output 5 cycles/output 1,25 cycles/output



#### The Evolution of the 'Species'



🊸 PULF

Ultra-simplified Open HW release: 1-core PULPINO

10

#### More efficiency: Heterogeneous PULP Cluster



#### PULP cluster+MCU+HWCE $\rightarrow$ GWT's GAP8 (55 TSMC) Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V **MCU Function** FC clock & voltage domain Cluster clock & voltage domain Extended RISC-V core LVDS Extensive I/O set Cluster Shared L1 Memory Serial I/Q DMA Micro DMA L2 Embedded DC/DC co UART Memory DMA Logarithmic Interconnect Secured execution HW SPI Micro I **Computation engine** 8 extended RIS Core 2 Core 3 Core 4 Core 5 Core 6 Core 0 Core 1 Core 7 Fully programmable CPI Fabric Efficient parallelization Controller HyperBus Shared instruction cache Multi channel DMA GPIO / PWM HW synchronization Shared Instruction Cache PMU RTC Debua ROM Debug HW convolution Engine

| What                | Freq MHz | Exec Time ms     | Cycles     | Power mW        |              |
|---------------------|----------|------------------|------------|-----------------|--------------|
| 40nm Dual Issue MCU | 216      | 99.1             | 21 400 000 | <sup>60</sup>   | 1            |
| GAP8 @1.0V          | 15.4     | 99.1 <b>11 X</b> | 1 500 000  | 3.7 <b>16 X</b> | GREENWAVES   |
| GAP8 @1.2V          | 175      | 8.7 🔸            | 1 500 000  | 70              | TECHNOLOGIES |
| GAP8 @1.0V w HWCE   | 4.7      | 99.1             | 460 000    | 0.8             |              |

4x More efficiency at less than 10% area cost



#### **Back to the Cloud**

2

#### 1 PFLOPS, top 20 in GREEN500'17





| Total number (racks)       |
|----------------------------|
| Total number of nodes      |
| Compute node form factor   |
| SoC                        |
| GPU                        |
| Network                    |
| Cooling                    |
| Heat exchanger             |
| Max performance (per node) |
| Storage                    |
| Power                      |







| 5                                                        |   |
|----------------------------------------------------------|---|
| 45 (compute) + 2 (service & login nodes)                 |   |
| 2 OU                                                     |   |
| 2xPOWER8 NVIink                                          |   |
| 4xNVIDIA Tesla P100 HSMX2                                |   |
| 2xIB EDR, 2x 1GbE                                        |   |
| SoC and GPU with direct hot water                        |   |
| Liquid-liquid, redundant pumps                           |   |
| 22 TFLOPs (double precision), 44 TFLOPs single precision |   |
| 1xSSD SATA                                               |   |
| DC power distribution                                    | l |

#### 2KW/node 300W+ per GPU



#### **Addressing GPUs Weaknesses**

- Peak compute reaches 15 Tflop/s these days
- Only 5% of that power estimated to be spent in the FPUs [1]:
  - [1] reports 2.9%, but their kernels don't reach TDP/max perf.
  - In dubio pro Invidia: We scale power to assume modern GPUs do not exceed TDP at max perf. (making them more efficient)
  - Key point: GPU RF is SRAM (remember FMUL32 4pJ, SRAM 20pJ)



Graph extracted and cropped from [1].

| 64 FPUs                                      |  |  |  |
|----------------------------------------------|--|--|--|
| <mark>256 kB RF</mark><br>128 kB L0<br>Cache |  |  |  |
| 32-2048 threads                              |  |  |  |

#### Volta Assembly LDS R2, [R0]

LDS R2, [R0] LDS R3, [R1] FFMA R4, R2, R3, R2 2 mem. acc. ("[...]") 8 reg. acc. Into RF SRAM = 10 SRAM R/W total



[1] S. Hong and H. Kim, "An integrated gpu power and performance model," in ACM SIGARCH Computer Architecture News, 2010.

## **Network Training Accelerator (NTX)**

- Processor configures Reg IF and manages DMA double-buffering in L1 memory
- Controller issues AGU, HWL, and FPU micro-commands based on configuration
  - AGUs generate address streams for data access
- FMAC with extended precision + ML functions

Reads/writes data via 2 memory ports (2 operand and 1 writeback streams)



Again: specialized "deep interfaces" + Instruction extensions



## NTX Power Breakdown & GPU SM Comparison

- NTX dissipates significant fraction of power in its FPU (more is better):
  - 31% of cluster
  - 14% of entire HMC
  - Recall: GPU is just around 5% [1]
- Compared to NVIDIA Volta GPU [2]:
  - Register file in GPU holds registers and thread-local data
  - Each register read/write is an SRAM access
  - Register and data accesses compete for SRAM

| 1 Volta SM                      | 8 NTX cl.   |
|---------------------------------|-------------|
| 64 FPUs                         | 64 FPUs     |
| 256 kB RF<br>128 kB L0<br>Cache | 512 kB TCDM |
| 32-2048 threads                 | 8 threads   |



| Volta Assembly                                      | NTX Pseudocode                                               |  |  |  |
|-----------------------------------------------------|--------------------------------------------------------------|--|--|--|
| LDS R2, [R0]<br>LDS R3, [R1]<br>FFMA R4, R2, R3, R2 | FMAC accu, [AGU0], [AGU1]                                    |  |  |  |
| 8 reg. acc.                                         | 2 mem. acc. ("[]")<br>0 reg. acc.<br>(+ addr. calc for free) |  |  |  |
|                                                     |                                                              |  |  |  |

= 10 SRAM hits total = 2 SRAM hits total



## **Low Precision Formats for Training**





| 17

## NTX→European Processor Initiative



#### **Europe Needs its own Processors**

- Processors now control almost every aspect of our lives
- Security (back doors etc.)
- Possible future restrictions on exports to EU due to increasing protectionism
- A competitive EU supply chain for HPC technologies will create jobs and growth in Europe
- Sovereignty (data, economical, embargo)



- High Performance General Purpose
  Processor for HPC
- High-performance RISC-V based accelerator (NTX)
- Computing platform for autonomous cars
- Will also target the AI, Big Data and other markets in order to be economically sustainable



## Putting it all together: The Open PULP platform



But this is way too much for a university (or two)!

| 19

## A non-for profit Company: Enter lowRISC!

lowRISC Communit y Interest Company



enabling open source silicon through collaborative engineering





#### LowRISC is up and... hiring



Alex Bradbury, Dr Gavin Ferris, Dr Robert Mullins Prof. Luca Benini, Ron Minnich, Dominic Rizzo









21

vdata o

#### Will just one NFP Company be Enough?



## **OpenHW Group Charter**

**OpenHW Group** is a not-for-profit, global organization driven by its members and individual contributors where hardware and software designers collaborate in the development of open-source cores, related IP, tools and software such as the **CORE-V Family of cores**. OpenHW provides an infrastructure for hosting high quality open-source HW developments in line with industry best practices.



see R. O'Connor (OpenHW CEO) talk





#### www.pulp-platform.org



The fun is just beginning...



| 24