Chapter — Vector processing & RVV

A different way to get IPC > 1: do the same op on many data items in a single instruction. The RISC-V Vector extension (RVV) is the modern, scalable take.

Why vector — three levels of parallelism
Scalar vs vector — the punchline
Traditional SIMD (SSE/AVX) and its limits
RVV programming model
Strip-mining loop walkthrough
Why RVV beats SSE/AVX for portability

1.Three levels of parallelism

Type	Example	Granularity
Instruction-Level	Pipelining, superscalar, OoO	Within one instruction stream
Thread-Level	Multicore, SMT	Multiple instruction streams
Data-Level	SIMD / Vector / GPU	One instruction, many data elements

Data-level parallelism

Run the same operation on many data items at once. Examples: pixel-wise image filters, matrix add, dot products, AI tensor ops.

2.Scalar vs vector — the punchline

Scalar RISC-V (10-element vector add)

li     t0, 0
li     t1, 10
loop:
  bge    t0, t1, done
  slli   t2, t0, 2
  add    t3, a0, t2
  flw    fa0, 0(t3)
  add    t3, a1, t2
  flw    fa1, 0(t3)
  fadd.s fa2, fa0, fa1
  add    t3, a2, t2
  fsw    fa2, 0(t3)
  addi   t0, t0, 1
  j      loop
done:

10 iterations · ~10 instructions each = ~100 dynamic instructions.

RVV equivalent

vsetvli  t0, a3, e32, m1, ta, ma
vle32.v  v0, (a0)
vle32.v  v1, (a1)
vfadd.vv v2, v0, v1
vse32.v  v2, (a2)

~5 instructions, one pass through 10 elements. ~20× fewer dynamic instructions.

Analogy Scalar = pairing socks one at a time. Vector = a sock-pairing machine that swallows 8 pairs and spits them matched in one motion. Same total work, far less control overhead.

3.Traditional SIMD and its limits

Family	Width	Vendor
MMX, SSE, SSE2…SSE4.2	64 → 128 bits	Intel/AMD
AVX, AVX2	256 bits	Intel/AMD
AVX-512	512 bits	Intel/AMD
NEON	128 bits	ARM (mobile)
SVE / SVE2	Scalable	ARM (server)

Pain point Traditional SIMD bakes the vector width into the opcode. If you compile for SSE (128 b) and the CPU has AVX-512 (512 b), you can't use it without recompiling. And you have to write a separate "tail" loop for the leftover elements that don't fill a full vector.

4.RVV programming model

4.1VLEN, SEW, LMUL

Hardware fixed

VLEN: width of each vector register in bits (128, 256, 512, …). Set by chip designer.

Software-set per pass

SEW (Selected Element Width): bits per element — 8, 16, 32, or 64.
LMUL (Length Multiplier): 1, 2, 4, 8 (or ½, ¼). Groups multiple v-regs into one logical larger register.

4.2AVL, VLMAX, VL

The runtime triad

AVL (Application Vector Length): how many elements the program wants to process.
VLMAX: hardware capacity = LMUL × VLEN / SEW.
VL: what the hardware will process this iteration = min(AVL, VLMAX).

4.2.1Numeric example

VLEN = 256 bits, SEW = 32 bits, LMUL = 1, AVL = 5

VLMAX = 1 · 256 / 32 = 8 elements
VL    = min(5, 8) = 5

[ e4  e3  e2  e1  e0  ·  ·  · ]   ← 3 trailing tail lanes

4.3The `vsetvli` instruction

vsetvli  rd, rs1, vtypei
         │   │     └─ encoding of SEW, LMUL, tail/mask policy
         │   └─ AVL (elements remaining)
         └─ destination: VL is written here (use it as loop step)

One instruction configures everything: element width, LMUL, tail/mask policy, and returns the runtime VL the hardware will use.

4.4Tail & mask policies

Policy	Behaviour for inactive lanes
Undisturbed	Old values preserved.
Agnostic	Hardware may write 1s, leave alone, or anything — do not rely on the value.

5.Strip-mining loop walkthrough

Add two arrays element-wise. a0 = count, a1/a2 = x/y bases, a3 = z base.

vvaddint32:
    vsetvli  t0, a0, e32, m1, ta, ma   # VL = min(a0, VLMAX); t0 = VL
    vle32.v  v0, (a1)                  # load VL elements of x
    sub      a0, a0, t0                # remaining -= VL
    slli     t0, t0, 2                 # VL words → bytes
    add      a1, a1, t0
    vle32.v  v1, (a2)                  # load VL elements of y
    add      a2, a2, t0
    vadd.vv  v2, v0, v1                # element-wise add
    vse32.v  v2, (a3)                  # store VL results
    add      a3, a3, t0
    bnez     a0, vvaddint32            # loop while remaining > 0
    ret

Iteration with AVL=6, VLMAX=4

Iter	vsetvli inputs	VL	elements processed	a0 after
1	AVL=6	4	x[0..3], y[0..3] → z[0..3]	2
2	AVL=2	2	x[4..5], y[4..5] → z[4..5]	0

No tail loop, no recompile if VLMAX changes — the hardware just handles fewer/more elements per pass.

6.Why RVV beats SSE/AVX

Feature	Traditional SIMD	RVV
Vector width	Fixed (opcode-encoded)	Variable (runtime)
Portability	Recompile per width	Same binary scales
Tail handling	Manual tail loop	Automatic via VL
Scalability	Limited (max width)	128 b → 2048 b without code change

One-line summary

RVV makes vector length a runtime variable. Same RVV binary runs on a 128-bit phone CPU and a 2048-bit HPC chip — automatically using whatever width the hardware has.

Source Vector Processing/Vector_concise.pdf — programming model on pp. 9-13, full strip-mining example pp. 17-24. Vector_longnotes.pdf for deep dive.

Next: Green Computing + DV →

Chapter — Vector processing & RVV

Contents

1.Three levels of parallelism

2.Scalar vs vector — the punchline

Scalar RISC-V (10-element vector add)

RVV equivalent

3.Traditional SIMD and its limits

4.RVV programming model

4.1VLEN, SEW, LMUL

4.2AVL, VLMAX, VL

4.2.1Numeric example

4.3The vsetvli instruction

4.4Tail & mask policies

5.Strip-mining loop walkthrough

Iteration with AVL=6, VLMAX=4

6.Why RVV beats SSE/AVX

4.3The `vsetvli` instruction