Chapter — Advanced microarchitecture

From single-issue pipelining to multicycle pipelined superscalar — first in-order, then out-of-order with register renaming. Each step adds new pipeline stages and new hazards to resolve.

Prereq Comfortable with the 5-stage in-order pipeline, hazards, and CPI math from Microarchitecture →. We build directly on top of it.

Contents

  1. Where the in-order pipeline runs out of road
  2. Deeper pipelines & µops
    1. Adding stages — frequency vs. mispredict cost
    2. Cracking instructions into µops
  3. Branch prediction (so we can afford a deeper pipe)
    1. Static vs dynamic
    2. 1-bit predictor — and its loop tax
    3. 2-bit saturating predictor — hysteresis fixes it
    4. Branch CPI & misprediction counts (worked)
  4. Superscalar — issuing more than one per cycle
    1. In-order superscalar: same pipe, doubled lanes
    2. New constraints when issuing multiple
    3. In-order superscalar IPC (worked)
  5. Out-of-order — letting independent instructions overtake
    1. The three sections of an OoO core
    2. New pipeline stages introduced
    3. WAR & WAW return — and how renaming kills them
  6. Register renaming — the hazard solver
    1. Architectural vs physical registers
    2. Renaming walkthrough
    3. Why RAW survives
  7. The Reorder Buffer (ROB) — keeping order at commit
    1. How the ROB works
    2. When the ROB stalls issue
    3. Mispredict / exception recovery
  8. OoO performance calculations
    1. IPC with RAW stalls in a 2-wide pipe
    2. Physical regs needed for a renamed sequence
    3. Cycles until the ROB blocks issue

1.Where the in-order pipeline runs out of road

The 5-stage in-order pipe peaks at CPI = 1. To go faster, we have three knobs.
KnobMoveNew problem it creates
Shorten TcMore stages (deeper pipeline)Higher mispredict penalty
Crack instructionsDecode to µops (smaller, simpler ops)More µops per instruction
Issue more per cycleSuperscalar (N-wide)More dependencies to resolve
Roadmap
Sections 2-3 attack deeper/cracking. Section 4 attacks issue width (in-order). Sections 5-7 introduce out-of-order to dodge data hazards that block superscalar IPC. Section 8 brings the calculations together.

2.Deeper pipelines & µops

2.1Adding stages

Splitting "EX" into two shorter EX stages means each stage's combinational delay halves ⇒ Tc halves ⇒ frequency doubles.

PipelineStagesMispredict penalty
5-stage classic RISCIF · ID · EX · MEM · WB2 flushes
Intel Pentium 4 "Netburst"~31 stages~20 flushes
Modern x86 (Skylake, Zen)~14-19 stages~12-15 flushes
Trade-off Doubling depth doubles frequency but also doubles the cost of a wrong branch. That's why deeper pipes require good branch prediction (§3) to be worth it.

2.2Cracking instructions into µops

Definition
µop (micro-operation): a simpler internal operation that the backend can schedule. Decode splits one ISA instruction into one or more µops.
x86: add  [rax+8], rbx     ; one ISA instruction
                              ↓ decode
     µop1: load tmp, [rax+8]
     µop2: add  tmp, tmp, rbx
     µop3: store [rax+8], tmp

The pipeline now schedules µops, not ISA instructions. RISC-V has fewer cracks needed (most ops are already simple), but even RISC cores µ-op load-with-immediate-offset or fused branches.

Why it matters
When the question asks about IPC, ask: ISA-instruction IPC or µop IPC? They can differ by 2-3×. Performance counters typically report µop IPC.

3.Branch prediction

If a deeper pipe makes mispredictions expensive, we'd better mispredict less often. Two strategies — static (cheap, dumb) and dynamic (smarter, hardware-tracked).

3.1Static vs dynamic

StaticDynamic
RuleCompile-time guess (e.g. "backwards taken")Per-branch state table updated at runtime
HardwareNone — just the decoderBranch Target Buffer (BTB) + state bits
Typical accuracy60-70%85-99%

3.21-bit predictor

Each branch stores one bit: "predict taken" (1) or "predict not taken" (0). Flip on a wrong guess.

Predict
TAKEN
(1)
Predict
NOT TAKEN
(0)
Loop tax A tight loop that exits once mispredicts twice per call: once on the first iteration (state was wrong), once on the falling edge (loop exit flips it back). Bad for nested loops.

3.32-bit saturating predictor

Two bits, four states. Need two consecutive wrongs to flip the predicted direction. One wrong only weakens it.

Strongly
TAKEN
(11)
Weakly
TAKEN
(10)
Weakly
NOT TAKEN
(01)
Strongly
NOT TAKEN
(00)
Hysteresis
A single loop-exit doesn't destabilise the predictor. The state walks to Weakly Taken, then next call it goes back to Strongly Taken on the first correct guess. Result: 1 mispredict per loop instead of 2.

3.4Worked example — 7-iteration do-while loop

PredictorMispredicts (per call)Branch CPI
None (always not taken)6 of 71/7·1 + 6/7·3 = 2.71
1-bit2 of 75/7·1 + 2/7·3 = 1.57
2-bit (Weakly NT start)1 of 76/7·1 + 1/7·3 = 1.28

3.4.1GQ Q9 — 100-iteration loop with 2-bit predictor

addi x2, x0, 100
loop:
    addi x2, x2, -1
    bne  x2, x0, loop   # taken 99 times, not-taken on iter 100

Starting state = 01 (Weakly NT). Trace:

IterState inPredictActualWrong?State out
101NTTYES10
210TT11
3-9911TT11
10011TNTYES10

✓ Answer: 2 mispredictions.

4.Superscalar — issuing more than one per cycle

CPI = 1 is the ceiling for a single-issue pipe. To go below 1 (i.e. IPC > 1), fetch and execute multiple instructions per cycle. Each pipeline stage becomes N-wide.

4.1In-order superscalar — same pipe, doubled lanes

An N-wide in-order superscalar fetches N instructions per cycle and feeds them through N parallel pipelines that still preserve program order.

IF ID IS EX MEM WB IF ID IS EX MEM WB Lane 0 Lane 1
2-wide in-order superscalar — two instructions advance through every stage each cycle.

4.2The new constraints

4.3Worked example — 2-wide IO superscalar IPC

Latencies: ALU = 1 cycle, MUL = 3 (pipelined), LW = 2 cycles.

I1: lw  x1, 0(x10)
I2: lw  x2, 4(x10)
I3: add x3, x1, x2     ← needs x1, x2
I4: mul x4, x3, x5     ← needs x3
I5: add x6, x4, x2     ← needs x4
I6: add x3, x7, x5     ← independent

The pipeline can issue two instructions per cycle but must stall when a dependency isn't ready. Walking it out:

CycLane 0Lane 1
1I1 IFI2 IF
2I1 IDI2 ID
3I1 ISI2 IS
4I1 EXI2 EX
5I1 MEMI2 MEM
6I3 (waits, x1 ready)I4 (waits for I3)
I3 finishes EX cycle 7, I4 starts EX cycle 8, MUL 3 cycles, …
9I5, I6 complete

Six instructions in 9 cycles ⇒ IPC = 6/9 ≈ 0.67. (Matches GQ Q7.)

5.Out-of-order — letting independent instructions overtake

The in-order rule wastes the back-end whenever Ik stalls — even if Ik+1 is ready to go. OoO lifts that rule, but at the cost of new pipeline machinery.

5.1The three sections of an OoO core

SectionOrderSpeculative?Key hardware
Front-EndIn-OrderYesBranch predictor, Renamer
Execution EngineOut-of-OrderYesIssue queue, Physical RegFile, FUs
Back-EndIn-OrderNo (at commit)Reorder Buffer (ROB)
The trick
Front-end and back-end keep program order alive (so exceptions and architectural state look sequential). The middle is free to reorder for performance. The illusion of sequential execution is preserved at commit, not during execution.

5.2New pipeline stages introduced

The 5-stage RISC pipeline expands into something like:

Fetch IF Decode ID Rename arch→phys DispatchROB+IQ Issue oldest-ready ExecuteFUs Writeback Commitin-order
In-order → OoO → In-order. New stages: Rename, Dispatch, Issue (replaces in-order issue), and Commit (separate from writeback).

5.3Why WAR & WAW return — and renaming kills them

In our 5-stage in-order pipe, WAR and WAW were silent because instructions wrote registers in order. The moment we let instructions complete out of order, a younger instruction can finish writing x1 before an older one reads it. That's a WAR. If both write x1, that's a WAW.

HazardPatternTrue dep?Fixed by renaming?
RAWread after write — real data flowYESNO — survives
WARwrite after read — name collisionNO — just namingYES
WAWwrite after write — name collisionNO — just namingYES

6.Register renaming — the hazard solver

6.1Architectural vs physical registers

Definitions
Architectural registers (ARF): the ones the ISA defines — x0..x31. What assembly code names.
Physical registers (PRF): a larger pool inside the hardware — p0..p63 or more. What the OoO engine actually writes to.
Rename table: maps each architectural reg → the physreg holding its current value.

6.2Renaming walkthrough

Start map: x1→p1, x2→p2, x3→p3, x5→p5, x10→p10.

Bundle A — cycle 1

lw x1, 0(x10)   →  lw p20, 0(p10)     ; allocate new physreg p20 for x1
lw x2, 4(x10)   →  lw p21, 4(p10)     ; allocate new physreg p21 for x2

Rename table after bundle A: x1→p20, x2→p21, others unchanged.

Bundle B — cycle 2

add x3, x1, x2  →  add p22, p20, p21  ; reads the NEW p20/p21
mul x4, x3, x5  →  mul p23, p22, p5   ; chains to p22 (true RAW)

Final map: x1→p20, x2→p21, x3→p22, x4→p23.

6.3Why RAW still survives

In the example above, mul reads p22 which add just wrote. That's a real data dependency — mul can't execute until add's result lands in p22. Renaming gave us fresh names, but it can't invent data out of thin air.

One-line summary
Renaming = fresh physical register on every write. Kills WAR + WAW. RAW is a true dependency and cannot be removed by renaming — it can only be hidden by waiting (or sometimes by forwarding from the result bus).

7.The Reorder Buffer (ROB)

7.1How it works

  1. When the front-end dispatches an instruction, it allocates a new ROB entry at the tail, in program order.
  2. The instruction executes whenever its operands are ready (OoO). Its result lands in its physical register and a "done" flag flips in its ROB entry.
  3. The head of the ROB is the oldest in-flight instruction. Each cycle, if the head is done, it commits (retires) — its architectural state becomes visible.
  4. If the head is a mispredicted branch or an exception, the entire ROB after the head is flushed: physregs are freed, the rename map is rolled back, fetch restarts.

7.2When the ROB stalls issue

ROB-stall formula Cycles until the front-end stalls = ROB_entries / issue_width
(Assuming no commits happen yet — worst case.)

The dispatch stage can't allocate a new ROB entry when the buffer is full. With 96 entries and 4-wide issue: 96/4 = 24 cycles. After that, every cycle has to wait for a commit at the head to free a slot.

7.3Mispredict / exception recovery

Because the ROB commits in order, the architectural state is always a correct prefix of the program. On any speculative failure (wrong branch, page fault, divide-by-zero), the ROB just throws away everything past the offending instruction. The hardware looks like nothing speculative ever happened.

8.OoO performance calculations

8.12-wide pipe IPC with RAW stalls

Same 6 instructions, same latencies as §4.3, but now OoO with renaming. Independent instructions can execute as soon as their operands are ready.

I1: lw  x1, 0(x10)
I2: lw  x2, 4(x10)
I3: add x3, x1, x2
I4: mul x4, x3, x5
I5: add x6, x4, x2
I6: add x3, x7, x5     ← totally independent — can run early

With renaming, I6 writes a fresh physreg (no WAW with I3). The OoO scheduler can issue I6 in parallel with the I1-I5 chain. With the right back-end this brings IPC closer to 1 even on a 2-wide machine.

For GQ Q7 specifically, the in-order assumption is fixed; OoO would do better.

8.2Distinct physical registers for a renamed sequence

add x5, x1, x2     ; x5 → p_new_1   (1st write to x5 since initial map)
add x5, x5, x3     ; x5 → p_new_2
mul x7, x5, x4     ; reads p_new_2
add x5, x6, x8     ; x5 → p_new_3
sub x9, x5, x10    ; reads p_new_3

Count of new physregs claimed by x5 writes = 3.

✓ GQ answer: 3.

8.3Cycles until the ROB blocks issue

Variant A — only ROB constraint

4-wide processor, 96-entry ROB: 96 / 4 = 24 cycles.

Variant B — PRF can be tighter than ROB

If PRF = 64 and ARF = 32 ⇒ only 32 free physregs available. At 4 writes/cycle: 32 / 4 = 8 cycles. Whichever resource is scarcest stalls first.

GQ Q8 — 3 inst/cycle

96-entry ROB at 3 inst/cycle = 96 / 3 = 32 cycles.

Chapter summary
  1. Three ways to push past in-order CPI=1: deeper pipe, wider issue, OoO.
  2. Deeper pipes need better branch prediction to be worth it (2-bit ≥ 1-bit).
  3. µops let the backend schedule simpler operations than the ISA exposes.
  4. In-order superscalar boosts IPC but is throttled by intra-bundle deps.
  5. OoO unlocks IPC by letting independent instructions overtake — at the cost of new stages: Rename, Dispatch, Issue, Commit.
  6. Renaming kills WAR & WAW; RAW survives as the only real data dependency.
  7. ROB keeps execution out-of-order but commit in-order, preserving the illusion of sequentiality.
  8. Performance bottlenecks: physical register count, ROB depth, issue width — whichever runs out first.
Source Microarchitecture/Advanced_Microarchitecture_Notes.pdf — §1 overview, §2 front-end + renaming, §3 ROB, §5 summary table. Branch Prediction.pptx.pdf for prediction worked CPIs.

Next: Memory Systems → (the other half of the CPU performance story)