Chapter — Microarchitecture (from assembly to pipelining)
A bottom-up build: RISC-V base assembly → instruction formats → single-cycle datapath → performance math → multicycle → pipelining → hazards → pipelined performance.
Contents
1.RISC-V base — registers & instruction types
1.1Why we need a microarchitecture
Microarchitecture: the implementation — wires, gates, registers, pipelines that actually run those instructions. Invisible to the programmer.
RISC-V is an architecture (an ISA). A single-cycle CPU, a pipelined CPU, and a fancy out-of-order CPU can all implement that same ISA — they're different microarchitectures with different performance/cost trade-offs.
1.2The RISC-V register file
RV32I gives you 32 general-purpose registers, each 32 bits wide, named x0–x31. ABI names overlay them:
| Register | ABI name | Role |
|---|---|---|
x0 | zero | Hardwired to 0 — writes are discarded |
x1 | ra | Return address |
x2 | sp | Stack pointer |
x3 | gp | Global pointer |
x5-x7, x28-x31 | t0-t6 | Temporaries (caller-saved) |
x8-x9, x18-x27 | s0-s11 | Saved registers (callee-saved) |
x10-x17 | a0-a7 | Arguments / return values |
1.3Categories of instructions
Every RISC-V instruction falls into one of four behavioural categories:
| Category | What it does | Examples |
|---|---|---|
| Compute | RegFile → ALU → RegFile | add, sub, and, or, slt, sll, addi |
| Memory | Move between RegFile and DMEM | lw, lb, lh, sw, sb, sh |
| Branch | Conditional jump if comparison holds | beq, bne, blt, bge |
| Jump | Unconditional jump (often with return-address save) | jal, jalr |
1.4Six instruction formats
The 32-bit instruction word is sliced differently depending on what fields it needs. The category drives the format:
| Format | Used by | Has rs1? | Has rs2? | Has rd? | Has immediate? |
|---|---|---|---|---|---|
| R-type | Reg-reg ALU (add, sub, …) | yes | yes | yes | no |
| I-type | addi, andi, lw, jalr | yes | no | yes | 12-bit |
| S-type | sw, sb, sh (stores) | yes (base) | yes (data) | no | 12-bit (split) |
| B-type | beq, bne, blt, bge | yes | yes | no | 12-bit (split) |
| U-type | lui, auipc | no | no | yes | 20-bit (upper) |
| J-type | jal | no | no | yes | 20-bit (offset) |
1.4.1Bit layouts
rs1 and rs2 live in the same bit positions across all formats. That regularity lets the hardware decode rs1/rs2 in parallel with deciding what type of instruction it is. The price: the immediate gets chopped into pieces.
2.The single-cycle datapath
2.1The five universal tasks
Regardless of which instruction is executing, the hardware must (in order):
- Fetch the instruction from instruction memory (IMEM) at address PC.
- Decode it: pull the opcode, read up to 2 source registers, sign-extend the immediate.
- Execute in the ALU: add, sub, compare, or compute an address.
- Memory access (only for
lw/sw): read or write data memory (DMEM). - Writeback: store the result into the destination register, and update PC.
2.2Datapath components
| Block | Role |
|---|---|
| PC | Holds the address of the current instruction. |
| IMEM | Reads the 32-bit instruction at IMEM[PC]. |
| RegFile | Reads rs1 and rs2 in parallel. Writes rd on clock edge. |
| Imm Gen | Sign-extends/reassembles the immediate from the instruction bits. |
| ALU | Performs the arithmetic/logic operation (or address calc for lw/sw). |
| DMEM | Reads (lw) or writes (sw) data at the address the ALU computed. |
| Control | Decodes the opcode → sets all the mux selects (not shown). |
2.3Walkthrough — R-type (add x3, x1, x2)
- PC drives IMEM → instruction fetched.
- RegFile reads x1 and x2 in parallel.
- ALU computes
x1 + x2. DMEM is bypassed. - The ALU result is muxed into the RegFile's write port; on the next clock edge x3 ← result.
- PC ← PC + 4.
2.4Walkthrough — lw x3, 12(x1)
- Fetch the instruction.
- RegFile reads x1; Imm Gen extracts 12.
- ALU computes
x1 + 12= effective address. - DMEM reads the word at that address.
- DMEM result → RegFile write port; x3 ← memory value on clock edge.
lw is the only instruction that uses all five blocks in series: PC → IMEM → RegFile → ALU → DMEM → RegFile. That makes it the longest path through the datapath — keep that in mind for §3.
2.5Walkthrough — beq x1, x2, LABEL
- Fetch + decode.
- RegFile reads x1, x2. Imm Gen builds the branch offset.
- ALU computes
x1 − x2; zero-detect bit determines "taken". - A second adder (or the same ALU) computes
PC + offset. - PC ←
taken? (PC + offset) : (PC + 4).
3.Single-cycle performance
3.1Performance definitions
Tc (clock period): seconds per clock cycle. f = 1/Tc: clock frequency.
CPI: Cycles Per Instruction — average number of clocks each instruction takes.
IPC: Instructions Per Cycle = 1 / CPI. The reciprocal — equally common.
Execution Time: the only metric that matters end-to-end.
To go faster, attack one of the three factors. Each microarchitecture style optimises a different one:
| Style | CPI | Tc | Notes |
|---|---|---|---|
| Single-cycle | 1 | very long | One slow clock per instruction |
| Multicycle | 3-5 | short | Fast clock but many cycles per inst |
| Pipelined | ≈ 1 | short | Throughput of the fast clock + CPI of single-cycle |
3.2The critical path
For single-cycle RISC-V, the critical path runs through lw (uses every block in series):
That's roughly the sum of five block delays. The whole machine is paying for the worst case every cycle — even for an add that doesn't touch DMEM. This is why single-cycle is slow.
3.3Worked example — single-cycle (Sarah Harris Ch. 7)
Given delays: IMEM = 250 ps, RegFile read = 150 ps, ALU = 200 ps, DMEM = 250 ps, RegFile write (negligible).
T_c ≥ 250 + 150 + 200 + 250 = 850 ps (round to typical 750-1000 ps in textbooks)
CPI = 1
For 100 billion instructions:
Time = 10^11 × 1 × 750 ps = 75 seconds
This 75 s is the benchmark we'll beat with pipelining (43 s, see §7).
4.From single-cycle to multicycle
4.1The motivation — why split the cycle
In single-cycle, an add uses IMEM + RF + ALU + RF write (~600 ps) but the clock is sized for lw (~850 ps). The add wastes ~250 ps of every clock doing nothing.
4.2The multicycle datapath
Same blocks as single-cycle (PC, IMEM, RegFile, ALU, DMEM), but now one block is active per clock. Between clocks, an internal register holds intermediate values (the ALU output, the fetched instruction, the memory data, etc.).
| State | What happens | Active block |
|---|---|---|
| S1: Fetch | IR ← IMEM[PC]; PC ← PC+4 | IMEM |
| S2: Decode | Read rs1, rs2; sign-extend imm | RegFile |
| S3: Execute | ALU op or address compute | ALU |
| S4: Memory | DMEM read/write (lw/sw only) | DMEM |
| S5: Writeback | RegFile ← result | RegFile |
4.3Per-instruction state count
Not every instruction needs all 5 states — that's the whole point. The control unit is a small FSM that walks through only the states each instruction needs:
| Instruction | States used | Cycles |
|---|---|---|
R-type (add, etc.) | Fetch · Decode · Execute · Writeback | 4 |
lw | Fetch · Decode · Execute · Memory · Writeback | 5 |
sw | Fetch · Decode · Execute · Memory | 4 |
beq | Fetch · Decode · Execute (compare + PC update) | 3 |
jal | Fetch · Decode · Execute · Writeback | 4 |
4.4Performance — CPI rises, Tc falls
Roughly one fifth of single-cycle's Tc.
4.4.1Worked example
Using the SPECINT2000 mix from §7.3: 25% lw (5 states), 10% sw (4), 13% branch (3), 52% R-type (4).
CPI_multi = 0.25·5 + 0.10·4 + 0.13·3 + 0.52·4
= 1.25 + 0.40 + 0.39 + 2.08
= 4.12
With Tc = 250 ps and 1011 instructions:
Time_multi = 10^11 × 4.12 × 250 ps = 103 seconds
(Sarah Harris's textbook gets 155 s using slightly different delays — same shape.)
4.5Did we win or lose?
| Single-cycle | Multicycle | |
|---|---|---|
| CPI | 1 | ≈ 4 |
| Tc | ~850 ps | ~250 ps |
| Time per inst | 850 ps | ≈ 1000 ps |
| Verdict | Multicycle is sometimes slower overall. The savings on simple instructions don't outweigh the FSM overhead. | |
5.The pipelined datapath
5.1The five stages
These map 1-to-1 onto the five universal tasks from §2.1.
5.2Pipeline registers
Between every pair of stages we drop a clocked register: IF/ID, ID/EX, EX/MEM, MEM/WB. They carry forward whatever state the next stage needs (rs values, immediate, control signals, the destination register number, the ALU result, etc.).
5.3Steady-state throughput & speedup
- Startup fill: first instruction finishes at cycle 5 (one full trip through the pipe).
- Steady state: from cycle 5 onward, one instruction completes every cycle ⇒ ideal CPI = 1.
- Tc: dictated by the slowest single stage, not the sum. Roughly 5× shorter than single-cycle.
6.Pipeline hazards
6.1Data hazards — RAW, WAR, WAW
| Type | Pattern | True dependency? | Visible in 5-stage in-order? |
|---|---|---|---|
| RAW (Read-After-Write) | I2 reads what I1 just wrote | yes — real | yes — common |
| WAR (anti-dependency) | I2 writes a reg I1 still reads | no — naming | no (in-order) |
| WAW (output dep.) | I2 writes same reg as I1 | no — naming | no (in-order) |
WAR and WAW only matter in out-of-order pipelines (see Advanced µArch). RAW is the one we deal with in the 5-stage in-order pipe.
6.2Forwarding (bypassing)
Most RAW hazards can be solved without stalling. The ALU result is already computed at the end of EX — feed it directly to the next instruction's ALU input, instead of waiting for it to land in the RegFile two cycles later.
- EX/MEM → EX forward: previous instruction's ALU result → current instruction's ALU input.
- MEM/WB → EX forward: instruction-two-ahead's result.
6.3The load-use hazard (one unavoidable stall)
e.g. 40% of loads stall ⇒ CPIlw = 0.6·1 + 0.4·2 = 1.4
6.4Control hazards — branch flushes
A branch resolves in EX. By that point, two instructions are already in IF and ID. If the branch is taken (or mispredicted), both must be flushed (turned into NOPs) and the correct path re-fetched.
Misprediction cost = 1 (the branch itself) + 2 flushed instructions = 3 cycles.
Branch prediction (covered in Advanced µArch) is how we get P(correct) close to 100%.
6.5Structural hazards
Two stages want the same hardware in the same cycle. The classic example: IF and MEM both touching memory.
- Fix: split caches — separate I-cache (for IF) and D-cache (for MEM). RISC-V CPUs do this by convention.
- Other fix: duplicate the unit, or stall one stage.
7.Pipelined performance math
7.1Per-type CPI
| Type | Best-case CPI | Stall scenario | Stall cost |
|---|---|---|---|
| R-type, store | 1 | — | — |
| Load | 1 | Load-use | +1 (bubble) |
| Branch | 1 | Misprediction | +2 (flushes) |
| Jump (jal) | 1 | Always: target known too late | +2 typically |
7.2Weighted-average CPI
The recipe:
- For each instruction type, find its share of the program (e.g. 25% loads).
- For each type, compute its CPI:
P(no stall)·1 + P(stall)·stallCost. - Sum the products. Done.
7.3SPECINT2000 worked example
Given: 25% loads, 10% stores, 13% branches, 52% R-type. 40% of loads cause a load-use stall. 50% of branches are mispredicted.
| Type | Freq | CPI calculation | CPI |
|---|---|---|---|
| Loads | 0.25 | 0.6·1 + 0.4·2 = 1.4 | 1.4 |
| Stores | 0.10 | 1 | 1.0 |
| Branches | 0.13 | 0.5·1 + 0.5·3 = 2.0 | 2.0 |
| R-type | 0.52 | 1 | 1.0 |
CPI_avg = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0
= 0.35 + 0.10 + 0.26 + 0.52
= 1.23
7.4Speedup recap
100 billion instructions, Tc = 350 ps:
Time_pipelined = 10^11 × 1.23 × 350 ps = 43 seconds
| Design | Time | Speedup |
|---|---|---|
| Single-cycle | 75 s | 1× |
| Multicycle | 155 s | 0.5× |
| Pipelined | 43 s | 1.7× |
- RISC-V code lives in 6 instruction formats (R, I, S, B, U, J).
- Every instruction does five tasks: Fetch, Decode, Execute, Memory, Writeback.
- Single-cycle = all five in one big clock (CPI=1, Tc huge).
- Multicycle = one task per clock (Tc small, CPI≈4).
- Pipelined = all five clocks in parallel for different instructions (CPI≈1, Tc small).
- Hazards push CPI above 1: load-use (+1), branch mispredict (+2). Forwarding kills most RAW penalties.
- Execution time = #Inst × CPI × Tc. Optimise any factor to go faster.
RiscV_Sarah_Harris.pdf Ch. 7. Slide deck mirrors at Microarchitecture/Ch7_MicroArch.pdf and Branch Prediction.pptx.pdf (examples 7.4–7.9).
Next: Advanced µArch → (branch prediction, superscalar, OoO + renaming)