The European Processor Initiative (EPI) is building a new central processing unit (CPU) with European technology. This CPU will bundle an accelerator, based on the open source RISC-V architecture. This accelerator will include support for the upcoming V-extension of RISC-V. At the Barcelona Supercomputing Center (BSC), we have been busy at work building software tools and infrastructure to explore and learn about the benefits and challenges that this extension brings to the table.
RISC-V is a relative newcomer in the Instruction Set Architecture (ISA) space, along with well-established ones like x86-64 and AArch64. Its main distinctive feature is that it is an open source specification that anyone can take to implement and extend, be it for commercial purposes or research studies. The specification is maintained by the RISC-V Foundation and its members.
Another important aspect of the RISC-V ISA is the fact that it has been designed to be modular. Conscious that not all the features present in modern ISAs may be of interest to everyone, the ISA is structured in a base specification and a set of standard extensions. Also, given its open source nature, anyone can define their own non-standard extensions, and the ISA caters to that possibility by providing customization features. Standard extensions have the advantage that they are eventually ratified by the RISC-V Foundation and its members after a collaborative development of the specification and gathering of experience in early implementations.
Standard extensions are commonly identified with a single letter. One of them is the V-extension. V stands for vector. This extension aims to provide vector computation capabilities to the RISC-V ecosystem. It is currently under development so no hardware or software exists that supports it.
Programs running on a computer are made up by a number of instructions. Each instruction is executed by the CPU, and most of them just do a simple operation on a single data value. For instance, an instruction may add two numbers. While this is perfectly reasonable for most applications, there are several domains, spanning from High Performance Computing (HPC) to Digital Signal Processing (DSP), where applications need to repeatedly perform the same computation over a regular set of data. In our running example, rather than just adding two numbers, applications in these domains are better served by an instruction that is able to add, pair-wise, two sets of numbers. These sets of numbers are called vectors (conversely, a single number is called a scalar). In this scenario, using vector instructions can result in applications that run more efficiently both in time and in energy consumption.
These kind of instructions traditionally have been called Single Instruction Multiple Data (SIMD), as a single instruction is able to operate over a set of data instead of individual elements. Many of the well-established vendors like Intel and Arm already provide SIMD instructions in the ISAs they maintain. Examples are Intel’s SSE, AVX-2 and AVX-512 and Arm’s Advanced SIMD and SVE.
There are a number of distinctive features in the V-extension that make it very interesting and also pose some challenges to implementers. An important one is that the extension does not prescribe the length of the vectors. Traditional SIMD ISAs prescribe the size of a vector as 128, 256 or 512 bit. A drawback with this approach is that a new ISA is required every time there is an interest to enlarge the vectors. The alternative is to allow the vendor to choose the size depending on its market needs. For instance, this is what Arm’s SVE does, where vectors can range from 128 bit to 2048 bit, in multiples of 128 bit. The V-extension currently only requires the width to be a power-of-2, so it is possible to cover markets that are well served with shorter vectors, like DSP, and also markets that benefit from longer vectors, like HPC.
Another interesting feature of the V-extension is that it restores the concept of vector length, a feature reminiscent of ancient vector architectures of the 1970s. The vector length tells the CPU how many elements in the vector have to be processed.
SIMD ISAs often process the whole vector so this concept does not exist there. An application has to consider the case when there is not enough data to fill a full vector. One option is to resort in this case to regular, scalar, instructions. Another option is to keep using vector instructions but discard some of the results computed, using a common feature called masking that may come with an extra penalty. The vector length can be used to reduce the number of elements being processed without requiring extra instructions, like in the first option, or having to compute a mask. For some implementations like in EPI’s VPU, shortening the vector length also allows to shorten the latency of instructions – no computation cycle is needed for the unused “tail” of the vector.
To explore the software side of the V-extension, we took the LLVM open source compiler which, except for the V-extension, already has good support for a number of the standard extensions of RISC-V. LLVM is an umbrella project for open source compiler and other toolchain-related projects like linkers or static analyzers.
In order to enable the exploration of the V-extension, we designed and implemented an initial set of C/C++ builtins. These builtins allow the C/C++ developers to be able to target the V-extension instructions from their applications. Along with our partners at EPI, we have ported several computational kernels, core parts of applications that are used very frequently, using the developed builtins. Some of those kernels are classic in HPC, such as matrix multiply, sparse matrix vector and the FFTW implementation of the Fast Fourier Transform (FFT). Algorithms from other domains, such as cryptography, have also been evaluated under the V-extension.
Finally, we also implemented in the compiler initial autovectorization support. A compiler can determine that vector instructions can be used without having to use builtins. Because of the two distinctive features of V-ext described earlier, compilers do not have good support yet in this area. Most of our work here is very infrastructural in making sure the compiler can vectorize in the way we believe is the best for the V-extension. We hope to be able to provide better support in this area, which is under intense work-in-progress status.
The V-extension is still being built, so no hardware exists that can execute V-extension instructions. This limits users and developers of the compiler as they would not be able to tell if their program and the compiler work correctly. To unblock this issue, we developed an emulator, called Vehave, that runs on top of existing RISC-V Linux platforms. This way the correctness of the applications and the compiler can be validated using the emulator.
We also implemented in the emulator a mechanism to generate traces of the executed vector instructions. These traces can be loaded in BSC’s trace visualization tool Paraver. This provides information valuable to the users and developer of the compiler.
For instance, application developers can determine that their application requires the compiler instructions that are known to be slow in a specific implementation of the V-extension. Compiler developers can identify redundant instructions or complicated instruction sequences. At BSC, we identified some of those complicated instruction sequences and we reported to the V-extension work group. The specification was extended with new individual instructions that achieve the same functionality of the original sequences.
In order to allow quicker experimentation, since all compilers are large pieces of software, we installed a version of Compiler Explorer that can work with the compiler we developed. This way it is possible to share small snippets of code to evaluate the quality of the code emitted by the compiler. This tool is publicly available at https://repo.hca.bsc.es/epic