Chapter — Combined system performance

Putting it all together: the real CPI, IPC, and execution time of a processor when the pipeline and the memory hierarchy are both in the picture.

What this chapter is for Each prior chapter computed one piece of the puzzle: Real processors pay both. This page combines them into the full equation and walks through a worked example.

Contents

  1. Why we need a combined model
  2. The effective CPI equation
    1. Per-stall contribution accounting
    2. From CPIeff to time & IPC
  3. End-to-end worked example
    1. Step 1 — pipeline CPI (from µArch ch.)
    2. Step 2 — memory penalty (from Memory ch.)
    3. Step 3 — combined effective CPI
    4. Step 4 — IPC & total time
  4. What to optimise — where is the bottleneck?
  5. Speedup analyses you'll be asked about
    1. Perfect cache speedup
    2. Better branch predictor speedup
    3. Doubling clock frequency

1.Why we need a combined model

If you only count pipeline stalls, you get a CPI like 1.23 for SPECINT — that's the textbook number. But you can't actually run a program with that CPI, because the memory hasn't been modelled.

If you only count memory stalls, you ignore the cost of branch mispredictions, load-use hazards, etc.

Big idea
Pipeline and memory penalties add. They're independent contributions to CPI. The real CPI is the sum (plus second-order interactions, which we usually neglect at this level).

2.The effective CPI equation

2.1Per-stall contribution accounting

Start from the ideal pipelined CPI of 1, then add every average penalty each instruction type contributes:

General form CPIeff = 1
     + flw × P(load-use stall) × 1
     + fbr × P(mispredict) × 2
     + fmem × MRcache × Penaltymiss

Or — equivalently — start from the pipelined CPI you already computed and add only the memory term:

Compact form CPIeff = CPIpipe + fmem × MR × MissPenalty
TermWhere it comes fromTypical magnitude
1Ideal pipeline1
Load-use penaltyµArch §6.3~0.1 (e.g. 0.25·0.4·1)
Branch mispredictµArch §6.4 + Adv §3~0.1 (e.g. 0.13·0.5·2)
Memory stallMemory §7~0.05–2.0 (huge range!)

2.2From CPIeff to time & IPC

Final timings IPCeff = 1 / CPIeff
Time = Ninstructions × CPIeff × Tc
Throughput = IPCeff × frequency

3.End-to-end worked example

Scenario Pipelined 5-stage RISC-V, Tc = 350 ps (f = 2.86 GHz). SPECINT2000-style mix: 25% lw, 10% sw, 13% branches, 52% R-type. 100 billion instructions.

3.1Step 1 — pipeline CPI

From the µArch chapter (§7.3):

TypeFreqCPI
lw0.250.6·1 + 0.4·2 = 1.4
sw0.101.0
br0.130.5·1 + 0.5·3 = 2.0
R0.521.0
CPI_pipe = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0 = 1.23

3.2Step 2 — memory penalty

From the Memory chapter (§7): assume L1 hit time = 1 cyc, L1 miss rate = 4%, miss penalty (to L2/MM) = 100 cyc. fmem = loads + stores = 0.35.

Memory stall term = f_mem × MR × Penalty
                  = 0.35 × 0.04 × 100
                  = 1.40 cycles/instruction

3.3Step 3 — combined effective CPI

CPI_eff = CPI_pipe + memory stall
        = 1.23     + 1.40
        = 2.63

Notice: memory contributes more than the entire pipeline did. That's why memory dominates real-world performance.

3.4Step 4 — IPC & total time

IPC_eff   = 1 / 2.63 = 0.38
Frequency = 1 / 350 ps = 2.86 GHz
Throughput = 2.86 × 10^9 × 0.38 = 1.09 × 10^9 instructions/s

Time = 10^11 × 2.63 × 350 ps
     = 92.05 seconds
ModelCPITime
Single-cycle (µArch §3.3)1.075 s
Multicycle (µArch §4.4)4.12103 s
Pipelined, perfect cache1.2343 s
Pipelined + real cache2.6392 s

A "real" pipelined processor with realistic memory ends up not much faster than a single-cycle CPU with no memory penalty — unless you also tame the memory hierarchy.

4.What to optimise — where is the bottleneck?

Break down CPIeff = 2.63 by source:

SourceContribution% of CPI
Ideal pipeline1.0038%
Load-use stalls0.104%
Branch mispredicts0.135%
Memory misses1.4053%
Where to invest
Memory is the single biggest chunk of execution time. That's why real CPUs spend half their silicon on caches and out-of-order machinery that hides memory latency (prefetchers, larger ROBs, non-blocking caches). Pipeline depth tuning gives you a few %; cache improvements give you tens of %.

5.Speedup analyses you'll be asked about

5.1Perfect cache (MR = 0)

CPI_eff_perfect = 1.23 (memory term → 0)
Speedup         = CPI_eff_real / CPI_eff_perfect
                = 2.63 / 1.23 = 2.14×

Equivalent formulation (GQ Q4-style):

MSCPI       = CPI_base + f · MR · Penalty
MSCPI_ideal = CPI_base
Speedup     = MSCPI / MSCPI_ideal

5.2Better branch predictor

Drop mispredict rate from 50% to 10%:

CPI_br new   = 0.9·1 + 0.1·3 = 1.20    (vs 2.0 before)
CPI_pipe new = 0.25·1.4 + 0.10·1 + 0.13·1.20 + 0.52·1 = 1.126
CPI_eff new  = 1.126 + 1.40 = 2.526
Speedup      = 2.63 / 2.526 = 1.04×

Modest gain because memory dominates. With a perfect cache, the same change is much more impactful.

5.3Doubling clock frequency

Halve Tc from 350 ps → 175 ps. But the memory penalty is measured in cycles, so a 100-cycle miss now takes 200 cycles at the new clock to cover the same wall-clock latency.

Memory term new = 0.35 · 0.04 · 200 = 2.80
CPI_eff new     = 1.23 + 2.80 = 4.03
Time new        = 10^11 × 4.03 × 175 ps = 70.5 s

Speedup vs original = 92 / 70.5 = 1.30×

Half the speedup you'd "expect" from doubling frequency — because memory latency is fixed in real time, not cycles.

Chapter summary
  1. Pipeline and memory penalties add in CPIeff.
  2. Compact form: CPIeff = CPIpipe + fmem · MR · Penalty.
  3. Real-world CPI is usually dominated by memory, not pipeline hazards.
  4. Time = N · CPIeff · Tc. IPCeff = 1 / CPIeff.
  5. Doubling clock frequency rarely doubles performance — memory latency is fixed in real time.
  6. Perfect-cache speedup ≈ 2× in our example — that's the prize for nailing the memory hierarchy.

Last stop: Worked Examples → for more GQ-style numeric drills.