Chapter — Advanced microarchitecture

From single-issue pipelining to multicycle pipelined superscalar — first in-order, then out-of-order with register renaming. Each step adds new pipeline stages and new hazards to resolve.

Prereq Comfortable with the 5-stage in-order pipeline, hazards, and CPI math from Microarchitecture →. We build directly on top of it.

Where the in-order pipeline runs out of road
Deeper pipelines & µops
1. Adding stages — frequency vs. mispredict cost
2. Cracking instructions into µops
Branch prediction (so we can afford a deeper pipe)
Superscalar — issuing more than one per cycle
Out-of-order — letting independent instructions overtake
Register renaming — the hazard solver
The Reorder Buffer (ROB) — keeping order at commit
OoO performance calculations

1.Where the in-order pipeline runs out of road

The 5-stage in-order pipe peaks at CPI = 1. To go faster, we have three knobs.

Knob	Move	New problem it creates
Shorten T_c	More stages (deeper pipeline)	Higher mispredict penalty
Crack instructions	Decode to µops (smaller, simpler ops)	More µops per instruction
Issue more per cycle	Superscalar (N-wide)	More dependencies to resolve

Roadmap

Sections 2-3 attack deeper/cracking. Section 4 attacks issue width (in-order). Sections 5-7 introduce out-of-order to dodge data hazards that block superscalar IPC. Section 8 brings the calculations together.

2.Deeper pipelines & µops

2.1Adding stages

Splitting "EX" into two shorter EX stages means each stage's combinational delay halves ⇒ T_c halves ⇒ frequency doubles.

Pipeline	Stages	Mispredict penalty
5-stage classic RISC	IF · ID · EX · MEM · WB	2 flushes
Intel Pentium 4 "Netburst"	~31 stages	~20 flushes
Modern x86 (Skylake, Zen)	~14-19 stages	~12-15 flushes

Trade-off Doubling depth doubles frequency but also doubles the cost of a wrong branch. That's why deeper pipes require good branch prediction (§3) to be worth it.

2.2Cracking instructions into µops

Definition

µop (micro-operation): a simpler internal operation that the backend can schedule. Decode splits one ISA instruction into one or more µops.

x86: add  [rax+8], rbx     ; one ISA instruction
                              ↓ decode
     µop1: load tmp, [rax+8]
     µop2: add  tmp, tmp, rbx
     µop3: store [rax+8], tmp

The pipeline now schedules µops, not ISA instructions. RISC-V has fewer cracks needed (most ops are already simple), but even RISC cores µ-op load-with-immediate-offset or fused branches.

Why it matters

When the question asks about IPC, ask: ISA-instruction IPC or µop IPC? They can differ by 2-3×. Performance counters typically report µop IPC.

3.Branch prediction

If a deeper pipe makes mispredictions expensive, we'd better mispredict less often. Two strategies — static (cheap, dumb) and dynamic (smarter, hardware-tracked).

3.1Static vs dynamic

	Static	Dynamic
Rule	Compile-time guess (e.g. "backwards taken")	Per-branch state table updated at runtime
Hardware	None — just the decoder	Branch Target Buffer (BTB) + state bits
Typical accuracy	60-70%	85-99%

3.21-bit predictor

Each branch stores one bit: "predict taken" (1) or "predict not taken" (0). Flip on a wrong guess.

Predict
TAKEN
(1)

Predict
NOT TAKEN
(0)

Loop tax A tight loop that exits once mispredicts twice per call: once on the first iteration (state was wrong), once on the falling edge (loop exit flips it back). Bad for nested loops.

3.32-bit saturating predictor

Two bits, four states. Need two consecutive wrongs to flip the predicted direction. One wrong only weakens it.

Strongly
TAKEN
(11)

Weakly
TAKEN
(10)

Weakly
NOT TAKEN
(01)

Strongly
NOT TAKEN
(00)

Hysteresis

A single loop-exit doesn't destabilise the predictor. The state walks to Weakly Taken, then next call it goes back to Strongly Taken on the first correct guess. Result: 1 mispredict per loop instead of 2.

3.4Worked example — 7-iteration do-while loop

Predictor	Mispredicts (per call)	Branch CPI
None (always not taken)	6 of 7	1/7·1 + 6/7·3 = 2.71
1-bit	2 of 7	5/7·1 + 2/7·3 = 1.57
2-bit (Weakly NT start)	1 of 7	6/7·1 + 1/7·3 = 1.28

3.4.1GQ Q9 — 100-iteration loop with 2-bit predictor

addi x2, x0, 100
loop:
    addi x2, x2, -1
    bne  x2, x0, loop   # taken 99 times, not-taken on iter 100

Starting state = 01 (Weakly NT). Trace:

Iter	State in	Predict	Actual	Wrong?	State out
1	01	NT	T	YES	10
2	10	T	T	—	11
3-99	11	T	T	—	11
100	11	T	NT	YES	10

✓ Answer: 2 mispredictions.

4.Superscalar — issuing more than one per cycle

CPI = 1 is the ceiling for a single-issue pipe. To go below 1 (i.e. IPC > 1), fetch and execute multiple instructions per cycle. Each pipeline stage becomes N-wide.

4.1In-order superscalar — same pipe, doubled lanes

An N-wide in-order superscalar fetches N instructions per cycle and feeds them through N parallel pipelines that still preserve program order.

2-wide in-order superscalar — two instructions advance through every stage each cycle.

4.2The new constraints

Intra-bundle hazards: the two instructions issued together can't share a structural unit (only one load-store unit, etc.) and can't depend on each other (the producer wouldn't have finished yet).
RAW with the previous bundle: if instruction N+1 needs the result of N from the previous cycle, it stalls at Issue (IS) until the value is ready.
In-order: instruction N+1 cannot pass N even if N+1's operands are ready and N's aren't. That's the killer.

4.3Worked example — 2-wide IO superscalar IPC

Latencies: ALU = 1 cycle, MUL = 3 (pipelined), LW = 2 cycles.

I1: lw  x1, 0(x10)
I2: lw  x2, 4(x10)
I3: add x3, x1, x2     ← needs x1, x2
I4: mul x4, x3, x5     ← needs x3
I5: add x6, x4, x2     ← needs x4
I6: add x3, x7, x5     ← independent

The pipeline can issue two instructions per cycle but must stall when a dependency isn't ready. Walking it out:

Cyc	Lane 0	Lane 1
1	I1 IF	I2 IF
2	I1 ID	I2 ID
3	I1 IS	I2 IS
4	I1 EX	I2 EX
5	I1 MEM	I2 MEM
6	I3 (waits, x1 ready)	I4 (waits for I3)
…	I3 finishes EX cycle 7, I4 starts EX cycle 8, MUL 3 cycles, …
9	I5, I6 complete

Six instructions in 9 cycles ⇒ IPC = 6/9 ≈ 0.67. (Matches GQ Q7.)

5.Out-of-order — letting independent instructions overtake

The in-order rule wastes the back-end whenever I_k stalls — even if I_k+1 is ready to go. OoO lifts that rule, but at the cost of new pipeline machinery.

5.1The three sections of an OoO core

Section	Order	Speculative?	Key hardware
Front-End	In-Order	Yes	Branch predictor, Renamer
Execution Engine	Out-of-Order	Yes	Issue queue, Physical RegFile, FUs
Back-End	In-Order	No (at commit)	Reorder Buffer (ROB)

The trick

Front-end and back-end keep program order alive (so exceptions and architectural state look sequential). The middle is free to reorder for performance. The illusion of sequential execution is preserved at commit, not during execution.

5.2New pipeline stages introduced

The 5-stage RISC pipeline expands into something like:

In-order → OoO → In-order. New stages: Rename, Dispatch, Issue (replaces in-order issue), and Commit (separate from writeback).

5.3Why WAR & WAW return — and renaming kills them

In our 5-stage in-order pipe, WAR and WAW were silent because instructions wrote registers in order. The moment we let instructions complete out of order, a younger instruction can finish writing x1 before an older one reads it. That's a WAR. If both write x1, that's a WAW.

Hazard	Pattern	True dep?	Fixed by renaming?
RAW	read after write — real data flow	YES	NO — survives
WAR	write after read — name collision	NO — just naming	YES
WAW	write after write — name collision	NO — just naming	YES

6.Register renaming — the hazard solver

6.1Architectural vs physical registers

Definitions

Architectural registers (ARF): the ones the ISA defines — x0..x31. What assembly code names.
Physical registers (PRF): a larger pool inside the hardware — p0..p63 or more. What the OoO engine actually writes to.
Rename table: maps each architectural reg → the physreg holding its current value.

6.2Renaming walkthrough

Start map: x1→p1, x2→p2, x3→p3, x5→p5, x10→p10.

Bundle A — cycle 1

lw x1, 0(x10)   →  lw p20, 0(p10)     ; allocate new physreg p20 for x1
lw x2, 4(x10)   →  lw p21, 4(p10)     ; allocate new physreg p21 for x2

Rename table after bundle A: x1→p20, x2→p21, others unchanged.

Bundle B — cycle 2

add x3, x1, x2  →  add p22, p20, p21  ; reads the NEW p20/p21
mul x4, x3, x5  →  mul p23, p22, p5   ; chains to p22 (true RAW)

Final map: x1→p20, x2→p21, x3→p22, x4→p23.

6.3Why RAW still survives

In the example above, mul reads p22 which add just wrote. That's a real data dependency — mul can't execute until add's result lands in p22. Renaming gave us fresh names, but it can't invent data out of thin air.

One-line summary

Renaming = fresh physical register on every write. Kills WAR + WAW. RAW is a true dependency and cannot be removed by renaming — it can only be hidden by waiting (or sometimes by forwarding from the result bus).

7.The Reorder Buffer (ROB)

7.1How it works

When the front-end dispatches an instruction, it allocates a new ROB entry at the tail, in program order.
The instruction executes whenever its operands are ready (OoO). Its result lands in its physical register and a "done" flag flips in its ROB entry.
The head of the ROB is the oldest in-flight instruction. Each cycle, if the head is done, it commits (retires) — its architectural state becomes visible.
If the head is a mispredicted branch or an exception, the entire ROB after the head is flushed: physregs are freed, the rename map is rolled back, fetch restarts.

7.2When the ROB stalls issue

ROB-stall formula Cycles until the front-end stalls = ROB_entries / issue_width
(Assuming no commits happen yet — worst case.)

The dispatch stage can't allocate a new ROB entry when the buffer is full. With 96 entries and 4-wide issue: 96/4 = 24 cycles. After that, every cycle has to wait for a commit at the head to free a slot.

7.3Mispredict / exception recovery

Because the ROB commits in order, the architectural state is always a correct prefix of the program. On any speculative failure (wrong branch, page fault, divide-by-zero), the ROB just throws away everything past the offending instruction. The hardware looks like nothing speculative ever happened.

8.OoO performance calculations

8.12-wide pipe IPC with RAW stalls

Same 6 instructions, same latencies as §4.3, but now OoO with renaming. Independent instructions can execute as soon as their operands are ready.

I1: lw  x1, 0(x10)
I2: lw  x2, 4(x10)
I3: add x3, x1, x2
I4: mul x4, x3, x5
I5: add x6, x4, x2
I6: add x3, x7, x5     ← totally independent — can run early

With renaming, I6 writes a fresh physreg (no WAW with I3). The OoO scheduler can issue I6 in parallel with the I1-I5 chain. With the right back-end this brings IPC closer to 1 even on a 2-wide machine.

For GQ Q7 specifically, the in-order assumption is fixed; OoO would do better.

8.2Distinct physical registers for a renamed sequence

add x5, x1, x2     ; x5 → p_new_1   (1st write to x5 since initial map)
add x5, x5, x3     ; x5 → p_new_2
mul x7, x5, x4     ; reads p_new_2
add x5, x6, x8     ; x5 → p_new_3
sub x9, x5, x10    ; reads p_new_3

Count of new physregs claimed by x5 writes = 3.

✓ GQ answer: 3.

8.3Cycles until the ROB blocks issue

Variant A — only ROB constraint

4-wide processor, 96-entry ROB: 96 / 4 = 24 cycles.

Variant B — PRF can be tighter than ROB

If PRF = 64 and ARF = 32 ⇒ only 32 free physregs available. At 4 writes/cycle: 32 / 4 = 8 cycles. Whichever resource is scarcest stalls first.

GQ Q8 — 3 inst/cycle

96-entry ROB at 3 inst/cycle = 96 / 3 = 32 cycles.

Chapter summary

Three ways to push past in-order CPI=1: deeper pipe, wider issue, OoO.
Deeper pipes need better branch prediction to be worth it (2-bit ≥ 1-bit).
µops let the backend schedule simpler operations than the ISA exposes.
In-order superscalar boosts IPC but is throttled by intra-bundle deps.
OoO unlocks IPC by letting independent instructions overtake — at the cost of new stages: Rename, Dispatch, Issue, Commit.
Renaming kills WAR & WAW; RAW survives as the only real data dependency.
ROB keeps execution out-of-order but commit in-order, preserving the illusion of sequentiality.
Performance bottlenecks: physical register count, ROB depth, issue width — whichever runs out first.

Source Microarchitecture/Advanced_Microarchitecture_Notes.pdf — §1 overview, §2 front-end + renaming, §3 ROB, §5 summary table. Branch Prediction.pptx.pdf for prediction worked CPIs.

Next: Memory Systems → (the other half of the CPU performance story)

Chapter — Advanced microarchitecture

Contents

1.Where the in-order pipeline runs out of road

2.Deeper pipelines & µops

2.1Adding stages

2.2Cracking instructions into µops

3.Branch prediction

3.1Static vs dynamic

3.21-bit predictor

3.32-bit saturating predictor

3.4Worked example — 7-iteration do-while loop

3.4.1GQ Q9 — 100-iteration loop with 2-bit predictor

4.Superscalar — issuing more than one per cycle

4.1In-order superscalar — same pipe, doubled lanes

4.2The new constraints

4.3Worked example — 2-wide IO superscalar IPC

5.Out-of-order — letting independent instructions overtake

5.1The three sections of an OoO core

5.2New pipeline stages introduced

5.3Why WAR & WAW return — and renaming kills them

6.Register renaming — the hazard solver

6.1Architectural vs physical registers

6.2Renaming walkthrough

Bundle A — cycle 1

Bundle B — cycle 2

6.3Why RAW still survives

7.The Reorder Buffer (ROB)

7.1How it works

7.2When the ROB stalls issue

7.3Mispredict / exception recovery

8.OoO performance calculations

8.12-wide pipe IPC with RAW stalls

8.2Distinct physical registers for a renamed sequence

8.3Cycles until the ROB blocks issue

Variant A — only ROB constraint

Variant B — PRF can be tighter than ROB

GQ Q8 — 3 inst/cycle