Chapter — Advanced microarchitecture
From single-issue pipelining to multicycle pipelined superscalar — first in-order, then out-of-order with register renaming. Each step adds new pipeline stages and new hazards to resolve.
Contents
- Where the in-order pipeline runs out of road
- Deeper pipelines & µops
- Branch prediction (so we can afford a deeper pipe)
- Superscalar — issuing more than one per cycle
- Out-of-order — letting independent instructions overtake
- Register renaming — the hazard solver
- The Reorder Buffer (ROB) — keeping order at commit
- OoO performance calculations
1.Where the in-order pipeline runs out of road
| Knob | Move | New problem it creates |
|---|---|---|
| Shorten Tc | More stages (deeper pipeline) | Higher mispredict penalty |
| Crack instructions | Decode to µops (smaller, simpler ops) | More µops per instruction |
| Issue more per cycle | Superscalar (N-wide) | More dependencies to resolve |
2.Deeper pipelines & µops
2.1Adding stages
Splitting "EX" into two shorter EX stages means each stage's combinational delay halves ⇒ Tc halves ⇒ frequency doubles.
| Pipeline | Stages | Mispredict penalty |
|---|---|---|
| 5-stage classic RISC | IF · ID · EX · MEM · WB | 2 flushes |
| Intel Pentium 4 "Netburst" | ~31 stages | ~20 flushes |
| Modern x86 (Skylake, Zen) | ~14-19 stages | ~12-15 flushes |
2.2Cracking instructions into µops
x86: add [rax+8], rbx ; one ISA instruction
↓ decode
µop1: load tmp, [rax+8]
µop2: add tmp, tmp, rbx
µop3: store [rax+8], tmp
The pipeline now schedules µops, not ISA instructions. RISC-V has fewer cracks needed (most ops are already simple), but even RISC cores µ-op load-with-immediate-offset or fused branches.
3.Branch prediction
3.1Static vs dynamic
| Static | Dynamic | |
|---|---|---|
| Rule | Compile-time guess (e.g. "backwards taken") | Per-branch state table updated at runtime |
| Hardware | None — just the decoder | Branch Target Buffer (BTB) + state bits |
| Typical accuracy | 60-70% | 85-99% |
3.21-bit predictor
Each branch stores one bit: "predict taken" (1) or "predict not taken" (0). Flip on a wrong guess.
TAKEN
(1)
NOT TAKEN
(0)
3.32-bit saturating predictor
Two bits, four states. Need two consecutive wrongs to flip the predicted direction. One wrong only weakens it.
TAKEN
(11)
TAKEN
(10)
NOT TAKEN
(01)
NOT TAKEN
(00)
3.4Worked example — 7-iteration do-while loop
| Predictor | Mispredicts (per call) | Branch CPI |
|---|---|---|
| None (always not taken) | 6 of 7 | 1/7·1 + 6/7·3 = 2.71 |
| 1-bit | 2 of 7 | 5/7·1 + 2/7·3 = 1.57 |
| 2-bit (Weakly NT start) | 1 of 7 | 6/7·1 + 1/7·3 = 1.28 |
3.4.1GQ Q9 — 100-iteration loop with 2-bit predictor
addi x2, x0, 100
loop:
addi x2, x2, -1
bne x2, x0, loop # taken 99 times, not-taken on iter 100
Starting state = 01 (Weakly NT). Trace:
| Iter | State in | Predict | Actual | Wrong? | State out |
|---|---|---|---|---|---|
| 1 | 01 | NT | T | YES | 10 |
| 2 | 10 | T | T | — | 11 |
| 3-99 | 11 | T | T | — | 11 |
| 100 | 11 | T | NT | YES | 10 |
✓ Answer: 2 mispredictions.
4.Superscalar — issuing more than one per cycle
4.1In-order superscalar — same pipe, doubled lanes
An N-wide in-order superscalar fetches N instructions per cycle and feeds them through N parallel pipelines that still preserve program order.
4.2The new constraints
- Intra-bundle hazards: the two instructions issued together can't share a structural unit (only one load-store unit, etc.) and can't depend on each other (the producer wouldn't have finished yet).
- RAW with the previous bundle: if instruction N+1 needs the result of N from the previous cycle, it stalls at Issue (IS) until the value is ready.
- In-order: instruction N+1 cannot pass N even if N+1's operands are ready and N's aren't. That's the killer.
4.3Worked example — 2-wide IO superscalar IPC
Latencies: ALU = 1 cycle, MUL = 3 (pipelined), LW = 2 cycles.
I1: lw x1, 0(x10)
I2: lw x2, 4(x10)
I3: add x3, x1, x2 ← needs x1, x2
I4: mul x4, x3, x5 ← needs x3
I5: add x6, x4, x2 ← needs x4
I6: add x3, x7, x5 ← independent
The pipeline can issue two instructions per cycle but must stall when a dependency isn't ready. Walking it out:
| Cyc | Lane 0 | Lane 1 |
|---|---|---|
| 1 | I1 IF | I2 IF |
| 2 | I1 ID | I2 ID |
| 3 | I1 IS | I2 IS |
| 4 | I1 EX | I2 EX |
| 5 | I1 MEM | I2 MEM |
| 6 | I3 (waits, x1 ready) | I4 (waits for I3) |
| … | I3 finishes EX cycle 7, I4 starts EX cycle 8, MUL 3 cycles, … | |
| 9 | I5, I6 complete | |
Six instructions in 9 cycles ⇒ IPC = 6/9 ≈ 0.67. (Matches GQ Q7.)
5.Out-of-order — letting independent instructions overtake
5.1The three sections of an OoO core
| Section | Order | Speculative? | Key hardware |
|---|---|---|---|
| Front-End | In-Order | Yes | Branch predictor, Renamer |
| Execution Engine | Out-of-Order | Yes | Issue queue, Physical RegFile, FUs |
| Back-End | In-Order | No (at commit) | Reorder Buffer (ROB) |
5.2New pipeline stages introduced
The 5-stage RISC pipeline expands into something like:
5.3Why WAR & WAW return — and renaming kills them
In our 5-stage in-order pipe, WAR and WAW were silent because instructions wrote registers in order. The moment we let instructions complete out of order, a younger instruction can finish writing x1 before an older one reads it. That's a WAR. If both write x1, that's a WAW.
| Hazard | Pattern | True dep? | Fixed by renaming? |
|---|---|---|---|
| RAW | read after write — real data flow | YES | NO — survives |
| WAR | write after read — name collision | NO — just naming | YES |
| WAW | write after write — name collision | NO — just naming | YES |
6.Register renaming — the hazard solver
6.1Architectural vs physical registers
Physical registers (PRF): a larger pool inside the hardware — p0..p63 or more. What the OoO engine actually writes to.
Rename table: maps each architectural reg → the physreg holding its current value.
6.2Renaming walkthrough
Start map: x1→p1, x2→p2, x3→p3, x5→p5, x10→p10.
Bundle A — cycle 1
lw x1, 0(x10) → lw p20, 0(p10) ; allocate new physreg p20 for x1
lw x2, 4(x10) → lw p21, 4(p10) ; allocate new physreg p21 for x2
Rename table after bundle A: x1→p20, x2→p21, others unchanged.
Bundle B — cycle 2
add x3, x1, x2 → add p22, p20, p21 ; reads the NEW p20/p21
mul x4, x3, x5 → mul p23, p22, p5 ; chains to p22 (true RAW)
Final map: x1→p20, x2→p21, x3→p22, x4→p23.
6.3Why RAW still survives
In the example above, mul reads p22 which add just wrote. That's a real data dependency — mul can't execute until add's result lands in p22. Renaming gave us fresh names, but it can't invent data out of thin air.
7.The Reorder Buffer (ROB)
7.1How it works
- When the front-end dispatches an instruction, it allocates a new ROB entry at the tail, in program order.
- The instruction executes whenever its operands are ready (OoO). Its result lands in its physical register and a "done" flag flips in its ROB entry.
- The head of the ROB is the oldest in-flight instruction. Each cycle, if the head is done, it commits (retires) — its architectural state becomes visible.
- If the head is a mispredicted branch or an exception, the entire ROB after the head is flushed: physregs are freed, the rename map is rolled back, fetch restarts.
7.2When the ROB stalls issue
(Assuming no commits happen yet — worst case.)
The dispatch stage can't allocate a new ROB entry when the buffer is full. With 96 entries and 4-wide issue: 96/4 = 24 cycles. After that, every cycle has to wait for a commit at the head to free a slot.
7.3Mispredict / exception recovery
Because the ROB commits in order, the architectural state is always a correct prefix of the program. On any speculative failure (wrong branch, page fault, divide-by-zero), the ROB just throws away everything past the offending instruction. The hardware looks like nothing speculative ever happened.
8.OoO performance calculations
8.12-wide pipe IPC with RAW stalls
Same 6 instructions, same latencies as §4.3, but now OoO with renaming. Independent instructions can execute as soon as their operands are ready.
I1: lw x1, 0(x10)
I2: lw x2, 4(x10)
I3: add x3, x1, x2
I4: mul x4, x3, x5
I5: add x6, x4, x2
I6: add x3, x7, x5 ← totally independent — can run early
With renaming, I6 writes a fresh physreg (no WAW with I3). The OoO scheduler can issue I6 in parallel with the I1-I5 chain. With the right back-end this brings IPC closer to 1 even on a 2-wide machine.
For GQ Q7 specifically, the in-order assumption is fixed; OoO would do better.
8.2Distinct physical registers for a renamed sequence
add x5, x1, x2 ; x5 → p_new_1 (1st write to x5 since initial map)
add x5, x5, x3 ; x5 → p_new_2
mul x7, x5, x4 ; reads p_new_2
add x5, x6, x8 ; x5 → p_new_3
sub x9, x5, x10 ; reads p_new_3
Count of new physregs claimed by x5 writes = 3.
✓ GQ answer: 3.
8.3Cycles until the ROB blocks issue
Variant A — only ROB constraint
4-wide processor, 96-entry ROB: 96 / 4 = 24 cycles.
Variant B — PRF can be tighter than ROB
If PRF = 64 and ARF = 32 ⇒ only 32 free physregs available. At 4 writes/cycle: 32 / 4 = 8 cycles. Whichever resource is scarcest stalls first.
GQ Q8 — 3 inst/cycle
96-entry ROB at 3 inst/cycle = 96 / 3 = 32 cycles.
- Three ways to push past in-order CPI=1: deeper pipe, wider issue, OoO.
- Deeper pipes need better branch prediction to be worth it (2-bit ≥ 1-bit).
- µops let the backend schedule simpler operations than the ISA exposes.
- In-order superscalar boosts IPC but is throttled by intra-bundle deps.
- OoO unlocks IPC by letting independent instructions overtake — at the cost of new stages: Rename, Dispatch, Issue, Commit.
- Renaming kills WAR & WAW; RAW survives as the only real data dependency.
- ROB keeps execution out-of-order but commit in-order, preserving the illusion of sequentiality.
- Performance bottlenecks: physical register count, ROB depth, issue width — whichever runs out first.
Microarchitecture/Advanced_Microarchitecture_Notes.pdf — §1 overview, §2 front-end + renaming, §3 ROB, §5 summary table. Branch Prediction.pptx.pdf for prediction worked CPIs.
Next: Memory Systems → (the other half of the CPU performance story)