







| Loop Unrolling → Bigger Basic Block                                                                                                                                     | 4x Unroll                                               |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| <ul> <li>Basic idea         <ul> <li>take n loop bodies and catentate them</li> <li>can't use the same target registers or Wax stalls are a problem</li></ul></li></ul> | $ \begin{array}{ c c c c c c c c c c c c c c c c c c c$ |
| School of Computing 5 CS6810                                                                                                                                            | University of Utah 6 CS681                              |

| Loop:                   | L.D                                             | F0, 0(R1)                                                                   |         | Loop:                               | L.D                          | F0, 0(R1)                               |
|-------------------------|-------------------------------------------------|-----------------------------------------------------------------------------|---------|-------------------------------------|------------------------------|-----------------------------------------|
|                         | ADD.D                                           | F4, F0, F2                                                                  |         |                                     | L.D                          | F6, -8(R1)                              |
|                         | S.D                                             | F4, 0(R1)                                                                   |         |                                     | L.D                          | F10,-16(R1)                             |
|                         | L.D                                             | F6, -8(R1)                                                                  |         |                                     | L.D                          | F14, -24(R1)                            |
|                         | ADD.D                                           | F8, F6, F2                                                                  |         |                                     | ADD.D                        | 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - |
|                         | S.D                                             | F8, -8(R1)                                                                  |         |                                     |                              | F8, F6, F2                              |
|                         | L.D                                             | F10,-16(R1)                                                                 |         |                                     | ADD.D                        | F12, F10, F2                            |
|                         | ADD.D                                           | F12, F10, F2                                                                |         |                                     | ADD.D                        | F16, F14, F2                            |
|                         | S.D                                             | F12, -16(R1)                                                                |         |                                     | S.D                          | F4, 0(R1)                               |
|                         | L.D                                             | F14, -24(R1)                                                                |         |                                     | S.D                          | F8, -8(R1)                              |
|                         | ADD.D                                           | F16, F14, F2                                                                |         |                                     | DADDU                        | I R1, R1, # -32                         |
|                         | S.D                                             | F16, -24(R1)                                                                |         |                                     | S.D                          | F12, 16(R1)                             |
|                         | DADDU                                           | I R1, R1, #-32                                                              |         |                                     | BNE                          | R1,R2, Loop                             |
|                         | BNE                                             | R1,R2, Loop                                                                 |         |                                     | S.D                          | F16, 8(R1)                              |
| 4x(1 post<br>⊦ 1 post ∣ | ty cycles per<br>t L.D. stall + 2<br>DADDUI and | loop?<br>2 post ADD.D stalls<br>1 post BNE stall 14<br>rap still only 50% e | 4 total | no stalls<br>3.5 cycle<br>2.857x sp | s per iterati<br>peedup over | ons & 4 loops                           |























| (m,n) Predictor Problem                   |                   |         |  |  |  |
|-------------------------------------------|-------------------|---------|--|--|--|
| • Assume                                  |                   |         |  |  |  |
| • m=10, n=2                               |                   |         |  |  |  |
| • and branch ID is 10 b                   | lts               |         |  |  |  |
| • If we use all 20 bits                   |                   |         |  |  |  |
| • need a 4M x 2-bit = 1                   | МВ ВНТ            |         |  |  |  |
| <ul> <li>TOO EXPENSIVE</li> </ul>         |                   |         |  |  |  |
| • What should we do?                      |                   |         |  |  |  |
| <ul> <li>hash the 20 bits into</li> </ul> | something smaller |         |  |  |  |
| <ul> <li>XOR is a good hash f</li> </ul>  | unction           |         |  |  |  |
| » cheap and fast                          |                   |         |  |  |  |
|                                           |                   |         |  |  |  |
|                                           |                   |         |  |  |  |
|                                           |                   |         |  |  |  |
|                                           |                   |         |  |  |  |
|                                           |                   |         |  |  |  |
| School of Computing                       | 19                | C\$6810 |  |  |  |











