Chapter — Microarchitecture (from assembly to pipelining)

A bottom-up build: RISC-V base assembly → instruction formats → single-cycle datapath → performance math → multicycle → pipelining → hazards → pipelined performance.

RISC-V base — registers & instruction types
The single-cycle datapath
Single-cycle performance
Multicycle — a brief detour
The pipelined datapath
Pipeline hazards
Pipelined performance math

1.RISC-V base — registers & instruction types

Before we can talk about how the hardware executes an instruction, we need to know what an instruction is. This section is the foundation everything else hangs on.

1.1Why we need a microarchitecture

Definition

Architecture (ISA): the contract — what instructions exist, what they do, what registers there are. Visible to the programmer.
Microarchitecture: the implementation — wires, gates, registers, pipelines that actually run those instructions. Invisible to the programmer.

RISC-V is an architecture (an ISA). A single-cycle CPU, a pipelined CPU, and a fancy out-of-order CPU can all implement that same ISA — they're different microarchitectures with different performance/cost trade-offs.

Q26 trap Verilog codes the microarchitecture, not the architecture. The ISA is described in a manual; assembly is what compilers emit; Verilog wires up the actual hardware.

1.2The RISC-V register file

RV32I gives you 32 general-purpose registers, each 32 bits wide, named x0–x31. ABI names overlay them:

Register	ABI name	Role
`x0`	`zero`	Hardwired to 0 — writes are discarded
`x1`	`ra`	Return address
`x2`	`sp`	Stack pointer
`x3`	`gp`	Global pointer
`x5-x7, x28-x31`	`t0-t6`	Temporaries (caller-saved)
`x8-x9, x18-x27`	`s0-s11`	Saved registers (callee-saved)
`x10-x17`	`a0-a7`	Arguments / return values

Key takeaway

Two reads + one write happen every cycle in a single-cycle CPU. That's why the register file in our diagrams shows two read ports + one write port.

1.3Categories of instructions

Every RISC-V instruction falls into one of four behavioural categories:

Category	What it does	Examples
Compute	RegFile → ALU → RegFile	`add`, `sub`, `and`, `or`, `slt`, `sll`, `addi`
Memory	Move between RegFile and DMEM	`lw`, `lb`, `lh`, `sw`, `sb`, `sh`
Branch	Conditional jump if comparison holds	`beq`, `bne`, `blt`, `bge`
Jump	Unconditional jump (often with return-address save)	`jal`, `jalr`

1.4Six instruction formats

The 32-bit instruction word is sliced differently depending on what fields it needs. The category drives the format:

Format	Used by	Has rs1?	Has rs2?	Has rd?	Has immediate?
R-type	Reg-reg ALU (add, sub, …)	yes	yes	yes	no
I-type	addi, andi, lw, jalr	yes	no	yes	12-bit
S-type	sw, sb, sh (stores)	yes (base)	yes (data)	no	12-bit (split)
B-type	beq, bne, blt, bge	yes	yes	no	12-bit (split)
U-type	lui, auipc	no	no	yes	20-bit (upper)
J-type	jal	no	no	yes	20-bit (offset)

1.4.1Bit layouts

R-type

funct7[31:25]

rs2[24:20]

rs1[19:15]

funct3[14:12]

rd[11:7]

opcode[6:0]

I-type

imm[11:0][31:20]

rs1[19:15]

funct3[14:12]

rd[11:7]

opcode[6:0]

S-type

imm[11:5][31:25]

rs2[24:20]

rs1[19:15]

funct3[14:12]

imm[4:0][11:7]

opcode[6:0]

B-type

imm[12|10:5][31:25]

rs2[24:20]

rs1[19:15]

funct3[14:12]

imm[4:1|11][11:7]

opcode[6:0]

U-type

imm[31:12][31:12]

rd[11:7]

opcode[6:0]

J-type

imm[20|10:1|11|19:12][31:12]

rd[11:7]

opcode[6:0]

Why immediates are split in S and B Because rs1 and rs2 live in the same bit positions across all formats. That regularity lets the hardware decode rs1/rs2 in parallel with deciding what type of instruction it is. The price: the immediate gets chopped into pieces.

Key takeaway

Six formats, but only one register-file layout. The immediate generator (Imm Gen) is the block that knows how to reassemble whichever immediate-shape this instruction uses.

2.The single-cycle datapath

"Single-cycle" means every instruction begins and finishes within one clock period. We'll see this is conceptually simple but performance-limited.

2.1The five universal tasks

Regardless of which instruction is executing, the hardware must (in order):

Fetch the instruction from instruction memory (IMEM) at address PC.
Decode it: pull the opcode, read up to 2 source registers, sign-extend the immediate.
Execute in the ALU: add, sub, compare, or compute an address.
Memory access (only for lw/sw): read or write data memory (DMEM).
Writeback: store the result into the destination register, and update PC.

Important

In single-cycle, all five tasks happen in one clock period. In pipelining, each task gets its own stage and overlaps with neighbours' tasks. Same five tasks, just timed differently.

2.2Datapath components

High-level single-cycle datapath. Dashed arrow = writeback into RegFile.

Block	Role
PC	Holds the address of the current instruction.
IMEM	Reads the 32-bit instruction at `IMEM[PC]`.
RegFile	Reads `rs1` and `rs2` in parallel. Writes `rd` on clock edge.
Imm Gen	Sign-extends/reassembles the immediate from the instruction bits.
ALU	Performs the arithmetic/logic operation (or address calc for lw/sw).
DMEM	Reads (lw) or writes (sw) data at the address the ALU computed.
Control	Decodes the opcode → sets all the mux selects (not shown).

2.3Walkthrough — R-type (`add x3, x1, x2`)

PC drives IMEM → instruction fetched.
RegFile reads x1 and x2 in parallel.
ALU computes x1 + x2. DMEM is bypassed.
The ALU result is muxed into the RegFile's write port; on the next clock edge x3 ← result.
PC ← PC + 4.

2.4Walkthrough — `lw x3, 12(x1)`

Fetch the instruction.
RegFile reads x1; Imm Gen extracts 12.
ALU computes x1 + 12 = effective address.
DMEM reads the word at that address.
DMEM result → RegFile write port; x3 ← memory value on clock edge.

Notice lw is the only instruction that uses all five blocks in series: PC → IMEM → RegFile → ALU → DMEM → RegFile. That makes it the longest path through the datapath — keep that in mind for §3.

2.5Walkthrough — `beq x1, x2, LABEL`

Fetch + decode.
RegFile reads x1, x2. Imm Gen builds the branch offset.
ALU computes x1 − x2; zero-detect bit determines "taken".
A second adder (or the same ALU) computes PC + offset.
PC ← taken ? (PC + offset) : (PC + 4).

3.Single-cycle performance

Now that we know what the hardware does, we can ask: how fast does it run?

3.1Performance definitions

Definitions

T_c (clock period): seconds per clock cycle. f = 1/T_c: clock frequency.

CPI: Cycles Per Instruction — average number of clocks each instruction takes.

IPC: Instructions Per Cycle = 1 / CPI. The reciprocal — equally common.

Execution Time: the only metric that matters end-to-end.

Master formula Execution Time = #Instructions × CPI × T_c

To go faster, attack one of the three factors. Each microarchitecture style optimises a different one:

Style	CPI	T_c	Notes
Single-cycle	1	very long	One slow clock per instruction
Multicycle	3-5	short	Fast clock but many cycles per inst
Pipelined	≈ 1	short	Throughput of the fast clock + CPI of single-cycle

3.2The critical path

Definition

Critical path: the longest combinational delay from one clocked element to the next. T_c must be ≥ this delay or the result won't latch correctly.

For single-cycle RISC-V, the critical path runs through lw (uses every block in series):

SC critical path T_c ≥ t_PC + t_IMEM + t_RFread + t_ALU + t_DMEM + t_RFwrite

That's roughly the sum of five block delays. The whole machine is paying for the worst case every cycle — even for an add that doesn't touch DMEM. This is why single-cycle is slow.

3.3Worked example — single-cycle (Sarah Harris Ch. 7)

Given delays: IMEM = 250 ps, RegFile read = 150 ps, ALU = 200 ps, DMEM = 250 ps, RegFile write (negligible).

T_c   ≥ 250 + 150 + 200 + 250 = 850 ps  (round to typical 750-1000 ps in textbooks)
CPI   = 1
For 100 billion instructions:
  Time = 10^11 × 1 × 750 ps = 75 seconds

This 75 s is the benchmark we'll beat with pipelining (43 s, see §7).

4.From single-cycle to multicycle

Single-cycle's killer flaw: every cycle is sized for the worst instruction. Multicycle fixes T_c by giving each step its own short clock — but at the cost of multiple clocks per instruction.

4.1The motivation — why split the cycle

In single-cycle, an add uses IMEM + RF + ALU + RF write (~600 ps) but the clock is sized for lw (~850 ps). The add wastes ~250 ps of every clock doing nothing.

Insight If we could let each task finish as soon as it's done — and only pay for the tasks an instruction actually needs — we'd get a much shorter average clock. That's the multicycle idea.

4.2The multicycle datapath

Same blocks as single-cycle (PC, IMEM, RegFile, ALU, DMEM), but now one block is active per clock. Between clocks, an internal register holds intermediate values (the ALU output, the fetched instruction, the memory data, etc.).

State	What happens	Active block
S1: Fetch	IR ← IMEM[PC]; PC ← PC+4	IMEM
S2: Decode	Read rs1, rs2; sign-extend imm	RegFile
S3: Execute	ALU op or address compute	ALU
S4: Memory	DMEM read/write (lw/sw only)	DMEM
S5: Writeback	RegFile ← result	RegFile

4.3Per-instruction state count

Not every instruction needs all 5 states — that's the whole point. The control unit is a small FSM that walks through only the states each instruction needs:

Instruction	States used	Cycles
R-type (`add`, etc.)	Fetch · Decode · Execute · Writeback	4
`lw`	Fetch · Decode · Execute · Memory · Writeback	5
`sw`	Fetch · Decode · Execute · Memory	4
`beq`	Fetch · Decode · Execute (compare + PC update)	3
`jal`	Fetch · Decode · Execute · Writeback	4

Multicycle FSM: instructions skip the Memory state if they don't need DMEM.

4.4Performance — CPI rises, T_c falls

Multicycle T_c

T_c ≥ t_{longest single block} ≈ max(t_IMEM, t_ALU, t_DMEM, …)
Roughly one fifth of single-cycle's T_c.

Multicycle CPI

CPI_multi = Σ frequency_i × states_i

4.4.1Worked example

Using the SPECINT2000 mix from §7.3: 25% lw (5 states), 10% sw (4), 13% branch (3), 52% R-type (4).

CPI_multi = 0.25·5 + 0.10·4 + 0.13·3 + 0.52·4
          = 1.25 + 0.40 + 0.39 + 2.08
          = 4.12

With T_c = 250 ps and 10¹¹ instructions:

Time_multi = 10^11 × 4.12 × 250 ps = 103 seconds

(Sarah Harris's textbook gets 155 s using slightly different delays — same shape.)

4.5Did we win or lose?

	Single-cycle	Multicycle
CPI	1	≈ 4
T_c	~850 ps	~250 ps
Time per inst	850 ps	≈ 1000 ps
Verdict	Multicycle is sometimes slower overall. The savings on simple instructions don't outweigh the FSM overhead.

Why we still study it

Multicycle teaches us how to chop an instruction into stages. Pipelining keeps those stages but overlaps them across instructions — that's where the speedup actually comes from. Multicycle alone = the wind-up. Pipelining = the pitch.

5.The pipelined datapath

Take the single-cycle datapath, split it into 5 sub-datapaths separated by clocked registers. Each sub-datapath is a stage. At any moment, 5 different instructions are in flight.

5.1The five stages

Fetch (IMEM, PC+4)

Decode + RF read

ALU

MEM

DMEM access (lw/sw)

RF write

These map 1-to-1 onto the five universal tasks from §2.1.

5.2Pipeline registers

Between every pair of stages we drop a clocked register: IF/ID, ID/EX, EX/MEM, MEM/WB. They carry forward whatever state the next stage needs (rs values, immediate, control signals, the destination register number, the ALU result, etc.).

Why they exist Without these latches, instruction A in EX would see B's wires flipping in IF. The latches freeze each stage's inputs at the clock edge so all 5 stages can operate in parallel without colliding.

5.3Steady-state throughput & speedup

Startup fill: first instruction finishes at cycle 5 (one full trip through the pipe).
Steady state: from cycle 5 onward, one instruction completes every cycle ⇒ ideal CPI = 1.
T_c: dictated by the slowest single stage, not the sum. Roughly 5× shorter than single-cycle.

Key takeaway

Pipelining keeps the single-cycle CPI of 1 and the multicycle short T_c. Best of both worlds — modulo hazards, which we tackle next.

6.Pipeline hazards

The price of overlap. Three families of hazard prevent the ideal CPI = 1.

6.1Data hazards — RAW, WAR, WAW

Type	Pattern	True dependency?	Visible in 5-stage in-order?
RAW (Read-After-Write)	I2 reads what I1 just wrote	yes — real	yes — common
WAR (anti-dependency)	I2 writes a reg I1 still reads	no — naming	no (in-order)
WAW (output dep.)	I2 writes same reg as I1	no — naming	no (in-order)

WAR and WAW only matter in out-of-order pipelines (see Advanced µArch). RAW is the one we deal with in the 5-stage in-order pipe.

6.2Forwarding (bypassing)

Most RAW hazards can be solved without stalling. The ALU result is already computed at the end of EX — feed it directly to the next instruction's ALU input, instead of waiting for it to land in the RegFile two cycles later.

EX/MEM → EX forward: previous instruction's ALU result → current instruction's ALU input.
MEM/WB → EX forward: instruction-two-ahead's result.

6.3The load-use hazard (one unavoidable stall)

Definition

Load-use hazard: a load is immediately followed by an instruction that consumes its result. Even with forwarding, one bubble is required — because the load doesn't produce its data until after MEM, which is one cycle too late for the dependent EX.

lw x1

add x3,x1,x2

Load CPI CPI_lw = P(no stall)·1 + P(stall)·2
e.g. 40% of loads stall ⇒ CPI_lw = 0.6·1 + 0.4·2 = 1.4

6.4Control hazards — branch flushes

A branch resolves in EX. By that point, two instructions are already in IF and ID. If the branch is taken (or mispredicted), both must be flushed (turned into NOPs) and the correct path re-fetched.

Branch CPI CPI_br = P(correct)·1 + P(mispredict)·3
Misprediction cost = 1 (the branch itself) + 2 flushed instructions = 3 cycles.

Branch prediction (covered in Advanced µArch) is how we get P(correct) close to 100%.

6.5Structural hazards

Two stages want the same hardware in the same cycle. The classic example: IF and MEM both touching memory.

Fix: split caches — separate I-cache (for IF) and D-cache (for MEM). RISC-V CPUs do this by convention.
Other fix: duplicate the unit, or stall one stage.

7.Pipelined performance math

Putting per-type CPIs and instruction mix together to get a real-program CPI.

7.1Per-type CPI

Type	Best-case CPI	Stall scenario	Stall cost
R-type, store	1	—	—
Load	1	Load-use	+1 (bubble)
Branch	1	Misprediction	+2 (flushes)
Jump (jal)	1	Always: target known too late	+2 typically

7.2Weighted-average CPI

Formula CPI_avg = Σ frequency_i × CPI_i

The recipe:

For each instruction type, find its share of the program (e.g. 25% loads).
For each type, compute its CPI: P(no stall)·1 + P(stall)·stallCost.
Sum the products. Done.

7.3SPECINT2000 worked example

Given: 25% loads, 10% stores, 13% branches, 52% R-type. 40% of loads cause a load-use stall. 50% of branches are mispredicted.

Type	Freq	CPI calculation	CPI
Loads	0.25	0.6·1 + 0.4·2 = 1.4	1.4
Stores	0.10	1	1.0
Branches	0.13	0.5·1 + 0.5·3 = 2.0	2.0
R-type	0.52	1	1.0

CPI_avg = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0
        = 0.35 + 0.10 + 0.26 + 0.52
        = 1.23

7.4Speedup recap

100 billion instructions, T_c = 350 ps:

Time_pipelined = 10^11 × 1.23 × 350 ps = 43 seconds

Design	Time	Speedup
Single-cycle	75 s	1×
Multicycle	155 s	0.5×
Pipelined	43 s	1.7×

Chapter summary

RISC-V code lives in 6 instruction formats (R, I, S, B, U, J).
Every instruction does five tasks: Fetch, Decode, Execute, Memory, Writeback.
Single-cycle = all five in one big clock (CPI=1, T_c huge).
Multicycle = one task per clock (T_c small, CPI≈4).
Pipelined = all five clocks in parallel for different instructions (CPI≈1, T_c small).
Hazards push CPI above 1: load-use (+1), branch mispredict (+2). Forwarding kills most RAW penalties.
Execution time = #Inst × CPI × T_c. Optimise any factor to go faster.

Source RiscV_Sarah_Harris.pdf Ch. 7. Slide deck mirrors at Microarchitecture/Ch7_MicroArch.pdf and Branch Prediction.pptx.pdf (examples 7.4–7.9).

Next: Advanced µArch → (branch prediction, superscalar, OoO + renaming)

Chapter — Microarchitecture (from assembly to pipelining)

Contents

1.RISC-V base — registers & instruction types

1.1Why we need a microarchitecture

1.2The RISC-V register file

1.3Categories of instructions

1.4Six instruction formats

1.4.1Bit layouts

2.The single-cycle datapath

2.1The five universal tasks

2.2Datapath components

2.3Walkthrough — R-type (add x3, x1, x2)

2.4Walkthrough — lw x3, 12(x1)

2.5Walkthrough — beq x1, x2, LABEL

3.Single-cycle performance

3.1Performance definitions

3.2The critical path

3.3Worked example — single-cycle (Sarah Harris Ch. 7)

4.From single-cycle to multicycle

4.1The motivation — why split the cycle

4.2The multicycle datapath

4.3Per-instruction state count

4.4Performance — CPI rises, Tc falls

4.4.1Worked example

4.5Did we win or lose?

5.The pipelined datapath

5.1The five stages

5.2Pipeline registers

5.3Steady-state throughput & speedup

6.Pipeline hazards

6.1Data hazards — RAW, WAR, WAW

6.2Forwarding (bypassing)

6.3The load-use hazard (one unavoidable stall)

6.4Control hazards — branch flushes

6.5Structural hazards

7.Pipelined performance math

7.1Per-type CPI

7.2Weighted-average CPI

7.3SPECINT2000 worked example

7.4Speedup recap

2.3Walkthrough — R-type (`add x3, x1, x2`)

2.4Walkthrough — `lw x3, 12(x1)`

2.5Walkthrough — `beq x1, x2, LABEL`

4.4Performance — CPI rises, T_c falls