The MRISC32 approach to Single Instruction Multiple Data (SIMD) operation is very similar to the early vector processors (such as the Cray-1):
- There are 32 vector registers, V0-V31, with at least 16 elements in each register.
- All vector elements are the same size (32 bits), regardless if they represent bytes, half-words, words or floats.
- A Vector Length (VL) register controls the length of the vector operation.
- There are vector,vector and vector,scalar versions of most integer and floating point operations.
- Vector loads and stores can either be stride-based or gather-scatter (see addressing modes for more details).
- Folding operations are provided for doing horizontal vector operations (e.g. sum, min/max).
Planned (not yet implemented):
- Each vector register has a Register Length (RL) state.
- Writing to a register updates the Register Length to the operation Vector Length.
- Elements with an index >= RL are zero.
- Clearing all Register Lengths to zero reduces stack overhead.
- Each vector register has a Register Mask (RM) state.
- The Register Mask can have three possible values:
- 0 (zero): All elements are zero (
0x00000000
). - -1 (set): All elements are minus one (
0xffffffff
). - +1: At least one element is non-zero and at least one element is non-minus-one.
- 0 (zero): All elements are zero (
- The Register Mask is suitable as a branch condition, e.g. in combination with
S[cc]
instructions with vector operands.
- The Register Mask can have three possible values:
The dominating SIMD solution today (SSE, AVX, NEON) is based on an ISA that is largely separate from the scalar ISA of the CPU. That model, however, comes with relatively high costs for hardware and software:
- All SIMD instructions operate on fixed width registers (you have to use all elements or nothing).
- A completely separate instruction set and separate execution units are used for operating on the SIMD registers.
- Each register is split into different number of elements depending on the type (i.e. the data is packed in the registers).
- It is hard to write software that utilizes the SIMD hardware efficiently, partially because compilers have a hard time to map traditional software constructs to the SIMD ISA, so you often have to hand-write code at a very low level.
- In order to exploit more parallelism in new hardware generations, new instruction sets and registers have to be added (e.g. MMX vs SSE vs AVX vs ...), leading to a very complex software model.
In comparison, the MRISC32 vector model is easier to implement in hardware and easier to use in software. For instance:
- The same execution units can be used for both vector operations and scalar operations, meaning less hardware.
- The software model maps better to traditional software patterns, and it should be easier for compilers to auto-vectorize code.
- The same ISA can be used for many different levels of hardware parallelism. In other words, the vector model scales well from very simple, scalar architectures, all the way up to highly parallel superscalar architectures.
Consider the following C code:
void abs_diff(float* c, const float* a, const float* b, const int n) {
for (int i = 0; i < n; ++i) {
c[i] = fabs(a[i] - b[i]);
}
}
Assuming that the arguments (c, a, b, n) are in registers S1, S2, S3 and S4 (according to the calling convention), this can be implemented using scalar operations as:
abs_diff:
bz s4, #done ; n == 0? (nothing to do)
ldi s5, #0
loop:
add s4, s4, #-1 ; Decrement the loop counter
ldw s6, s2, s5*4 ; s6 = a
ldw s7, s3, s5*4 ; s7 = b
fsub s6, s6, s7 ; s6 = a - b
and s6, s6, #0x7fffffff ; s6 = fabs(a - b) (i.e. clear the sign bit)
stw s6, s1, s5*4 ; c = fabs(a - b)
add s5, s5, #1 ; Increment the array offset
bgt s4, #loop
done:
ret
...or using vector operations as:
abs_diff:
bz s4, #done ; n == 0? (nothing to do)
; Prepare the vector operation
mov s5, vl ; Preserve VL
cpuid s6, z, z ; s6 is the max number of vector elements
loop:
min vl, s4, s6 ; vl = min(s4, s6)
sub s4, s4, vl ; Decrement the loop counter
ldw v1, s2, #4 ; v1 = a
ldw v2, s3, #4 ; v2 = b
fsub v1, v1, v2 ; v1 = a - b
and v1, v1, #0x7fffffff ; v1 = fabs(a - b) (i.e. clear the sign bit)
stw v1, s1, #4 ; c = fabs(a - b)
ldea s1, s1, vl*4 ; Increment the memory pointers
ldea s2, s2, vl*4
ldea s3, s3, vl*4
bgt s4, #loop
mov vl, s5 ; Restore VL
done:
ret
Notice that:
- The same instructions are used in both cases, only with vector operands for the vector version.
- It is easy to mix scalar and vector operands for vector operations.
- The loop overhead is actually lower in the vector version, since fewer loop iterations are required:
- Scalar version: 3 instructions / array element.
- Vector version: 6/16 = 0.375 instructions / array element (for a machine with 16 elements per vector register).
- Any data dependency latencies in the scalar version (e.g. due to memory loads and the FPU pipeline) have vanished in the vector version.
- Each vector instruction will iterate over several cycles, allowing the first vector elements to be produced before the next instruction starts executing.
- Typically, an
M
elements wide,N
stages long pipeline machine will have at leastM * N
elements per vector register.
It is possible to implement vector operations in various different ways, with different degrees of parallelism and different levels of operation throughput.
In the simplest implementation each vector operation is implemented as a pipeline interlocking loop that executes a single vector element operation per clock cycle. This is essentially just a hardware assisted loop.
Even in this implementation, the vectorized operation will be faster than a corresponding repeated scalar operation for a number of reasons:
- Less overhead from loop branches, counters and memory index calculation.
- Improved throughput thanks to reduced number of data dependency stalls (vector operations effectively hide data dependencies).
Low-hanging fruits:
- Deep pipelining:
- Long pipelines (e.g. for division instructions) can be blocking/stalling in scalar mode, but fully pipelined in vector mode, without breaking the promise of in-order instruction retirement.
- Improved cache performance:
- With relatively little effort, accurate (non-speculative) data cache prefetching can be implemented in hardware for vector loads and stores.
An extension to the simplest model is to keep two (or more) vector loops running in parallel, which would enable a single-issue CPU (fetching only a single instruction per cycle) to execute multiple operations in parallel.
This is similar to the concept of "chaining" in the Cray 1, which allowed it to do 160 MFLOPS at 80 MHz.
This requires slightly more hardware logic:
- Duplicated vector loop logic.
- Duplicated register fetch and instruction issue logic.
- More register read ports (with some restrictions it may be possible to rely entirely on operand forwarding though).
- Logic for determning if two vector operations can run in parallel, and how.
- Possibly more execution units, in order to maximize parallelism.
One advantage of this implementation is that the instruction fetch pipeline can be kept simple, and the logic for running multiple instructions in parallel is simpler than that of a traditional superscalar architecture.
Another advantage, compared to implementing a wider pipeline (see below), is that you can improve parallelism wihout adding more execution units.
Instead of processing one element at a time, each vector loop iteration can process multiple elements at a time. For instance, if there are four identical floating point units, four elements can be read from a vector register and processed in parallel per clock cycle.
This is essentially the same principle as for SIMD ISAs such as SSE or NEON.
It puts some more requirements on the hardware logic to be able to issue multiple elements per vector operation. In particular the hardware needs:
- A sufficient number of execution units.
- Wider read/write ports for the vector registers and the data cache(s).
- More advanced data cache interface (e.g. for wide gather-scatter operations).