































## **Blocked MatMul Example**

Example: N = 8; sub-block size = 4

Key idea: Sub-blocks (i.e.,  $\mathbf{A}_{xy}$ ) can be treated just like scalars.

$$\begin{split} &C_{11} \ = \ A_{11}B_{11} + A_{12}B_{21} \qquad C_{12} \ = \ A_{11}B_{12} + A_{12}B_{22} \\ &C_{21} \ = \ A_{21}B_{11} + A_{22}B_{21} \qquad C_{22} \ = \ A_{21}B_{12} + A_{22}B_{22} \end{split}$$

CS6810



## **Others**

- Prefetch
  - · reduces miss penalty and miss rate
    - » If done right
    - » added complexity, power, and screw up potential · discussed last le

  - · can be done either by HW or SW
- Next level cache
  - reduces miss penalty » in best case
  - Increases miss penalty
    - » in worst case
    - » "swing to miss" principle



19

CS6810

## **Ancillary Caches**

- Victim cache (Jouppi)
  - small cache to hold victimized lines
  - idea allows arbitrary associativity for small number of lines
    » total extra associativity = size of victim cache

  - downside
    - » parallel check of regular and victim
    - » fully associative
- · Trace cache (Weiser, Peleg)
  - Intel P4
  - » expensive many instruction copies
- Assist cache (HP and somebody you know)
  - 1st touch goes to assist
  - 2<sup>nd</sup> touch goes to regular cache
    - » makes prefetch less likely to contaminate cache
  - downside
    - » similar to victim cache

School of Computing University of Utah

20

CS6810

| Technique                             | Miss<br>Rate | Miss<br>Penalty | Hit<br>Time | HW<br>Complex<br>ity | Comments                                                                |
|---------------------------------------|--------------|-----------------|-------------|----------------------|-------------------------------------------------------------------------|
| Larger Cache                          | win          |                 | lose        | easy                 | cost is approx. linear                                                  |
| Larger Block Size                     | win          | lose            |             | easy                 | trivial engineering effort                                              |
| Higher Associativity                  | win          |                 | lose        | 1                    | associative match isn't free                                            |
| Victim Caches                         | win          |                 |             | 2                    | e.g. HP 7200                                                            |
| Pseudo-Associative                    | win          |                 |             | 2                    | Used in L2 of MIPS R10000                                               |
| HW Prefetch of I and D                | win          |                 |             | 2                    | D fetch hard to do correctly                                            |
| Compiler controlled prefetch          | win          |                 |             | 3                    | Needs non-blocking cache                                                |
| Compiler cache scheduling (blocking,) | win          |                 |             | 0                    | Too bad it's hard to do for<br>all applications - loop focus<br>for now |



