Quiz — CAAL Finals

Pipelining

1An ideal pipelined processor has a CPI of:

1

0

5 (number of stages)

Equal to the number of instructions

1. One instruction completes per cycle in steady state — that's what pipelining buys you. Hazards push CPI above 1.

2Why is a multicycle processor sometimes slower than single-cycle for the same program?

It uses more transistors

Each instruction takes multiple cycles, raising CPI well above 1

It has no forwarding

Branch misprediction penalty is higher

Multicycle. The clock is faster but every instruction takes 3-5 cycles. Total time = #inst × CPI × T_c, and CPI > 1 dominates. Pipelining keeps T_c short AND CPI ≈ 1.

3Which hazard cannot be eliminated by forwarding alone?

ALU → ALU RAW

Load-use (lw immediately followed by a dependent op)

EX → MEM forwarding

WAW hazards

Load-use. The load result isn't available until after MEM, so a dependent instruction in EX must stall by 1 bubble even with forwarding.

4In a 5-stage pipeline, a mispredicted branch costs how many flushed instructions?

5

3

2

0

2. The branch resolves in EX, by which point 2 instructions are already in IF and ID. Both must be flushed (branch CPI = 3).

5SPECINT2000 mix: 25% loads (40% stall), 10% stores, 13% branches (50% mispredict), 52% R-type. Average CPI ≈

1.00

1.23

1.57

2.00

1.23. CPI_lw=0.6·1+0.4·2=1.4. CPI_br=0.5·1+0.5·3=2. Avg = 0.25·1.4 + 0.10·1 + 0.13·2 + 0.52·1 = 1.23.

Advanced µArch

1Register renaming eliminates which hazards?

RAW only

RAW and WAW

WAR and WAW

All hazards

WAR & WAW. These are name hazards — fresh physical registers eliminate the conflict. RAW is a true data dependency and survives renaming.

2The execution-order pattern of a modern OoO core is:

In-order, In-order, Out-of-order

Out-of-order, In-order, Out-of-order

In-order, Out-of-order, In-order

Out-of-order, Out-of-order, In-order

In, Out, In. Front-end fetches/decodes/renames in order. Execution engine schedules out of order. Back-end (ROB) commits in order so the program appears sequential.

3A 2-bit predictor starting at "Weakly Not Taken" runs a 100-iteration loop. Total mispredictions =

1

2

3

100

2. One on iteration 1 (state was 01, branch taken). The state walks to Strongly Taken and stays. The 100th iteration is the loop exit (not taken) → mispredict #2.

4A 4-wide superscalar with a 96-entry ROB. Each instruction writes one register. Cycles before the ROB blocks issue (assuming no commits yet)?

96

32

24

8

24. ROB capacity / issue width = 96 / 4 = 24 cycles before the buffer fills.

52-bit predictors are better than 1-bit predictors because they cause:

Increased CPI

Reduced IPC

Fewer mispredictions on stable loops (need 2 wrongs to flip direction)

No hardware cost

Hysteresis. 1-bit flips on any wrong guess; 2-bit needs two consecutive wrongs. Loop-exits don't destabilise the predictor.

Memory Systems

1If the degree of associativity is 1, the cache is:

1-Way Set Associative only

Direct Mapped only

Both 1 and 2 (they're the same thing)

Neither

Both. A 1-way set-associative cache is a direct-mapped cache. Two names, one structure.

2The TLB is:

A small cache for data

A small cache for instructions

A small cache for address translations

A small cache for page-fault records

Translations. Translation Lookaside Buffer holds recent VPN → PPN mappings so we skip the page-table walk on hits.

3Page size = 2 Kiword, 4 segments (text/data/heap/stack). Total cumulative bytes?

2¹³

2¹⁵

2¹⁷

2¹⁰

2¹⁵. 4 × 2 × 1024 words = 2¹³ words. Times 4 bytes/word = 2¹⁵ bytes.

4CPI_base = 2, miss rate 4%, miss penalty 100 cyc, f = 30%. Speedup from a perfect cache:

1.20×

1.60×

2.00×

3.20×

1.60×. MSCPI = 2 + 0.30·0.04·100 = 3.20. Speedup = 3.20 / 2 = 1.60.

5Increasing the cache block size will:

Always improve miss rate

Always hurt miss rate

Exploit spatial locality but may raise conflict misses & penalty

Have no effect on miss penalty

Trade-off. Bigger blocks pre-fetch more useful neighbours (spatial locality) but reduce the # of blocks ⇒ more conflicts, and each miss transfers more data ⇒ higher penalty.

Vector / RVV

1SIMD / Vector exploits which kind of parallelism?

Instruction-Level

Thread-Level

Data-Level

Bit-Level

Data-level. Same operation across many data elements simultaneously.

2VLEN = 256 bits, SEW = 32 bits, LMUL = 1. VLMAX is:

256

32

8

1

8. VLMAX = LMUL × VLEN / SEW = 1 × 256 / 32 = 8 elements per vector register.

3With AVL = 5 and VLMAX = 8, the runtime VL set by vsetvli is:

8

5

3

13

5. VL = min(AVL, VLMAX) = min(5, 8) = 5. The other 3 lanes are tail.

4The biggest advantage of RVV over traditional SIMD (SSE/AVX) is:

It's faster on every operation

It uses less power

VL is runtime-variable, so the same binary scales across hardware widths

It avoids cache misses

Portability. SSE/AVX bake width into opcodes (movdqa 128, AVX-512). RVV uses a runtime VL — no recompile, automatic tail handling.

5With "tail agnostic", the hardware may leave inactive lanes:

Always zeroed

Always preserved (old values)

Anything — old, all-1s, undefined; don't rely on them

Set to NaN

Anything. Agnostic = the spec doesn't require any specific value. Use tail undisturbed if you need preservation.

Green Computing + Design Verification

1Dynamic power scales as:

P ∝ V

P ∝ V·f

P ∝ ½ C V² f

P ∝ I·V (leakage only)

½ C V² f. Voltage is squared — that's why voltage scaling delivers the biggest power wins.

2"Dark silicon" refers to:

Defective transistors

Areas of the die left unpowered because we can't dissipate the heat

Carbon emissions from chip manufacture

Off-state leakage

Unpowered areas. Dennard scaling broke ~2003 — more transistors fit, but per-transistor power didn't fall, so portions of the die must stay dark to stay thermally safe.

3Lower PUE means a data centre is:

Hotter

More efficient (less overhead beyond IT load)

Less reliable

Slower

Efficient. PUE = total facility power / IT power. Ideal = 1.0 (no overhead). Industry avg ~1.5, Google ~1.1.

4Verilog is used to code:

Microarchitecture (datapath, control, hardware)

Architecture (ISA)

Assembly

High-level language programs

Microarchitecture. Verilog is an HDL — it describes the actual hardware that implements the ISA.

5In a UVM testbench, the component that observes DUT signals passively and broadcasts transactions via an analysis port is the:

Driver

Sequencer

Monitor

Scoreboard

Monitor. Passive — never drives signals. Captures activity → transaction → analysis port → scoreboard/coverage.

Mini quizzes

Pipelining

Advanced µArch

Memory Systems

Vector / RVV

Green Computing + Design Verification