Chapter — Combined system performance

Putting it all together: the real CPI, IPC, and execution time of a processor when the pipeline and the memory hierarchy are both in the picture.

What this chapter is for Each prior chapter computed one piece of the puzzle:

Microarchitecture gave us CPI_pipe — pipeline hazards (load-use, branch mispredict, etc.).
Memory Systems gave us AMAT and the memory-stall CPI term.

Real processors pay both. This page combines them into the full equation and walks through a worked example.

Why we need a combined model
The effective CPI equation
1. Per-stall contribution accounting
2. From CPI_eff to time & IPC
End-to-end worked example
What to optimise — where is the bottleneck?
Speedup analyses you'll be asked about

1.Why we need a combined model

If you only count pipeline stalls, you get a CPI like 1.23 for SPECINT — that's the textbook number. But you can't actually run a program with that CPI, because the memory hasn't been modelled.

If you only count memory stalls, you ignore the cost of branch mispredictions, load-use hazards, etc.

Big idea

Pipeline and memory penalties add. They're independent contributions to CPI. The real CPI is the sum (plus second-order interactions, which we usually neglect at this level).

2.The effective CPI equation

2.1Per-stall contribution accounting

Start from the ideal pipelined CPI of 1, then add every average penalty each instruction type contributes:

General form CPI_eff = 1
     + f_lw × P(load-use stall) × 1
     + f_br × P(mispredict) × 2
     + f_mem × MR_cache × Penalty_miss

Or — equivalently — start from the pipelined CPI you already computed and add only the memory term:

Compact form CPI_eff = CPI_pipe + f_mem × MR × MissPenalty

Term	Where it comes from	Typical magnitude
1	Ideal pipeline	1
Load-use penalty	µArch §6.3	~0.1 (e.g. 0.25·0.4·1)
Branch mispredict	µArch §6.4 + Adv §3	~0.1 (e.g. 0.13·0.5·2)
Memory stall	Memory §7	~0.05–2.0 (huge range!)

2.2From CPI_eff to time & IPC

Final timings IPC_eff = 1 / CPI_eff
Time = N_instructions × CPI_eff × T_c
Throughput = IPC_eff × frequency

3.End-to-end worked example

Scenario Pipelined 5-stage RISC-V, T_c = 350 ps (f = 2.86 GHz). SPECINT2000-style mix: 25% lw, 10% sw, 13% branches, 52% R-type. 100 billion instructions.

3.1Step 1 — pipeline CPI

From the µArch chapter (§7.3):

Type	Freq	CPI
lw	0.25	0.6·1 + 0.4·2 = 1.4
sw	0.10	1.0
br	0.13	0.5·1 + 0.5·3 = 2.0
R	0.52	1.0

CPI_pipe = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0 = 1.23

3.2Step 2 — memory penalty

From the Memory chapter (§7): assume L1 hit time = 1 cyc, L1 miss rate = 4%, miss penalty (to L2/MM) = 100 cyc. f_mem = loads + stores = 0.35.

Memory stall term = f_mem × MR × Penalty
                  = 0.35 × 0.04 × 100
                  = 1.40 cycles/instruction

3.3Step 3 — combined effective CPI

CPI_eff = CPI_pipe + memory stall
        = 1.23     + 1.40
        = 2.63

Notice: memory contributes more than the entire pipeline did. That's why memory dominates real-world performance.

3.4Step 4 — IPC & total time

IPC_eff   = 1 / 2.63 = 0.38
Frequency = 1 / 350 ps = 2.86 GHz
Throughput = 2.86 × 10^9 × 0.38 = 1.09 × 10^9 instructions/s

Time = 10^11 × 2.63 × 350 ps
     = 92.05 seconds

Model	CPI	Time
Single-cycle (µArch §3.3)	1.0	75 s
Multicycle (µArch §4.4)	4.12	103 s
Pipelined, perfect cache	1.23	43 s
Pipelined + real cache	2.63	92 s

A "real" pipelined processor with realistic memory ends up not much faster than a single-cycle CPU with no memory penalty — unless you also tame the memory hierarchy.

4.What to optimise — where is the bottleneck?

Break down CPI_eff = 2.63 by source:

Source	Contribution	% of CPI
Ideal pipeline	1.00	38%
Load-use stalls	0.10	4%
Branch mispredicts	0.13	5%
Memory misses	1.40	53%

Where to invest

Memory is the single biggest chunk of execution time. That's why real CPUs spend half their silicon on caches and out-of-order machinery that hides memory latency (prefetchers, larger ROBs, non-blocking caches). Pipeline depth tuning gives you a few %; cache improvements give you tens of %.

5.Speedup analyses you'll be asked about

5.1Perfect cache (MR = 0)

CPI_eff_perfect = 1.23 (memory term → 0)
Speedup         = CPI_eff_real / CPI_eff_perfect
                = 2.63 / 1.23 = 2.14×

Equivalent formulation (GQ Q4-style):

MSCPI       = CPI_base + f · MR · Penalty
MSCPI_ideal = CPI_base
Speedup     = MSCPI / MSCPI_ideal

5.2Better branch predictor

Drop mispredict rate from 50% to 10%:

CPI_br new   = 0.9·1 + 0.1·3 = 1.20    (vs 2.0 before)
CPI_pipe new = 0.25·1.4 + 0.10·1 + 0.13·1.20 + 0.52·1 = 1.126
CPI_eff new  = 1.126 + 1.40 = 2.526
Speedup      = 2.63 / 2.526 = 1.04×

Modest gain because memory dominates. With a perfect cache, the same change is much more impactful.

5.3Doubling clock frequency

Halve T_c from 350 ps → 175 ps. But the memory penalty is measured in cycles, so a 100-cycle miss now takes 200 cycles at the new clock to cover the same wall-clock latency.

Memory term new = 0.35 · 0.04 · 200 = 2.80
CPI_eff new     = 1.23 + 2.80 = 4.03
Time new        = 10^11 × 4.03 × 175 ps = 70.5 s

Speedup vs original = 92 / 70.5 = 1.30×

Half the speedup you'd "expect" from doubling frequency — because memory latency is fixed in real time, not cycles.

Chapter summary

Pipeline and memory penalties add in CPI_eff.
Compact form: CPI_eff = CPI_pipe + f_mem · MR · Penalty.
Real-world CPI is usually dominated by memory, not pipeline hazards.
Time = N · CPI_eff · T_c. IPC_eff = 1 / CPI_eff.
Doubling clock frequency rarely doubles performance — memory latency is fixed in real time.
Perfect-cache speedup ≈ 2× in our example — that's the prize for nailing the memory hierarchy.

Last stop: Worked Examples → for more GQ-style numeric drills.

Chapter — Combined system performance

Contents

1.Why we need a combined model

2.The effective CPI equation

2.1Per-stall contribution accounting

2.2From CPIeff to time & IPC

3.End-to-end worked example

3.1Step 1 — pipeline CPI

3.2Step 2 — memory penalty

3.3Step 3 — combined effective CPI

3.4Step 4 — IPC & total time

4.What to optimise — where is the bottleneck?

5.Speedup analyses you'll be asked about

5.1Perfect cache (MR = 0)

5.2Better branch predictor

5.3Doubling clock frequency

2.2From CPI_eff to time & IPC