Chapter — Vector processing & RVV

A different way to get IPC > 1: do the same op on many data items in a single instruction. The RISC-V Vector extension (RVV) is the modern, scalable take.

Contents

  1. Why vector — three levels of parallelism
  2. Scalar vs vector — the punchline
  3. Traditional SIMD (SSE/AVX) and its limits
  4. RVV programming model
    1. VLEN, SEW, LMUL
    2. AVL, VLMAX, VL
    3. The vsetvli instruction
    4. Tail & mask policies
  5. Strip-mining loop walkthrough
  6. Why RVV beats SSE/AVX for portability

1.Three levels of parallelism

TypeExampleGranularity
Instruction-LevelPipelining, superscalar, OoOWithin one instruction stream
Thread-LevelMulticore, SMTMultiple instruction streams
Data-LevelSIMD / Vector / GPUOne instruction, many data elements
Data-level parallelism
Run the same operation on many data items at once. Examples: pixel-wise image filters, matrix add, dot products, AI tensor ops.

2.Scalar vs vector — the punchline

Scalar RISC-V (10-element vector add)

li     t0, 0
li     t1, 10
loop:
  bge    t0, t1, done
  slli   t2, t0, 2
  add    t3, a0, t2
  flw    fa0, 0(t3)
  add    t3, a1, t2
  flw    fa1, 0(t3)
  fadd.s fa2, fa0, fa1
  add    t3, a2, t2
  fsw    fa2, 0(t3)
  addi   t0, t0, 1
  j      loop
done:

10 iterations · ~10 instructions each = ~100 dynamic instructions.

RVV equivalent

vsetvli  t0, a3, e32, m1, ta, ma
vle32.v  v0, (a0)
vle32.v  v1, (a1)
vfadd.vv v2, v0, v1
vse32.v  v2, (a2)

~5 instructions, one pass through 10 elements. ~20× fewer dynamic instructions.

Analogy Scalar = pairing socks one at a time. Vector = a sock-pairing machine that swallows 8 pairs and spits them matched in one motion. Same total work, far less control overhead.

3.Traditional SIMD and its limits

FamilyWidthVendor
MMX, SSE, SSE2…SSE4.264 → 128 bitsIntel/AMD
AVX, AVX2256 bitsIntel/AMD
AVX-512512 bitsIntel/AMD
NEON128 bitsARM (mobile)
SVE / SVE2ScalableARM (server)
Pain point Traditional SIMD bakes the vector width into the opcode. If you compile for SSE (128 b) and the CPU has AVX-512 (512 b), you can't use it without recompiling. And you have to write a separate "tail" loop for the leftover elements that don't fill a full vector.

4.RVV programming model

4.1VLEN, SEW, LMUL

Hardware fixed
VLEN: width of each vector register in bits (128, 256, 512, …). Set by chip designer.
Software-set per pass
SEW (Selected Element Width): bits per element — 8, 16, 32, or 64.
LMUL (Length Multiplier): 1, 2, 4, 8 (or ½, ¼). Groups multiple v-regs into one logical larger register.

4.2AVL, VLMAX, VL

The runtime triad
AVL (Application Vector Length): how many elements the program wants to process.
VLMAX: hardware capacity = LMUL × VLEN / SEW.
VL: what the hardware will process this iteration = min(AVL, VLMAX).

4.2.1Numeric example

VLEN = 256 bits, SEW = 32 bits, LMUL = 1, AVL = 5

VLMAX = 1 · 256 / 32 = 8 elements
VL    = min(5, 8) = 5

[ e4  e3  e2  e1  e0  ·  ·  · ]   ← 3 trailing tail lanes

4.3The vsetvli instruction

vsetvli  rd, rs1, vtypei
         │   │     └─ encoding of SEW, LMUL, tail/mask policy
         │   └─ AVL (elements remaining)
         └─ destination: VL is written here (use it as loop step)

One instruction configures everything: element width, LMUL, tail/mask policy, and returns the runtime VL the hardware will use.

4.4Tail & mask policies

PolicyBehaviour for inactive lanes
UndisturbedOld values preserved.
AgnosticHardware may write 1s, leave alone, or anything — do not rely on the value.

5.Strip-mining loop walkthrough

Add two arrays element-wise. a0 = count, a1/a2 = x/y bases, a3 = z base.

vvaddint32:
    vsetvli  t0, a0, e32, m1, ta, ma   # VL = min(a0, VLMAX); t0 = VL
    vle32.v  v0, (a1)                  # load VL elements of x
    sub      a0, a0, t0                # remaining -= VL
    slli     t0, t0, 2                 # VL words → bytes
    add      a1, a1, t0
    vle32.v  v1, (a2)                  # load VL elements of y
    add      a2, a2, t0
    vadd.vv  v2, v0, v1                # element-wise add
    vse32.v  v2, (a3)                  # store VL results
    add      a3, a3, t0
    bnez     a0, vvaddint32            # loop while remaining > 0
    ret

Iteration with AVL=6, VLMAX=4

Itervsetvli inputsVLelements processeda0 after
1AVL=64x[0..3], y[0..3] → z[0..3]2
2AVL=22x[4..5], y[4..5] → z[4..5]0

No tail loop, no recompile if VLMAX changes — the hardware just handles fewer/more elements per pass.

6.Why RVV beats SSE/AVX

FeatureTraditional SIMDRVV
Vector widthFixed (opcode-encoded)Variable (runtime)
PortabilityRecompile per widthSame binary scales
Tail handlingManual tail loopAutomatic via VL
ScalabilityLimited (max width)128 b → 2048 b without code change
One-line summary
RVV makes vector length a runtime variable. Same RVV binary runs on a 128-bit phone CPU and a 2048-bit HPC chip — automatically using whatever width the hardware has.
Source Vector Processing/Vector_concise.pdf — programming model on pp. 9-13, full strip-mining example pp. 17-24. Vector_longnotes.pdf for deep dive.

Next: Green Computing + DV →