## Big Iron Today's topics: Vector Processors and Supercomputers VP's came first – now exist as GPGPU's figure source: text Appendix F Supercomputers lots of microprocessors with a fancy interconnect – a look at the top500 Datacenter "cloud" Computing lots of blades w/ fancy interconnect AND fancy storage systems (this is not DRAM!) School of Computing













| Processor (year)                     | Vector<br>clock<br>rate<br>(MHz) | Vector<br>registers | Elements per<br>register<br>(64-bit<br>elements) | Vector arithmetic units                                                                                      | Vector<br>load-store<br>units | Lanes                    |
|--------------------------------------|----------------------------------|---------------------|--------------------------------------------------|--------------------------------------------------------------------------------------------------------------|-------------------------------|--------------------------|
| Cray-1 (1976)                        | 80                               | 8                   | 64                                               | 6: FP add, FP multiply, FP reciprocal,<br>integer add, logical, shift                                        | 1                             | 1                        |
| Cray X-MP (1983)<br>Cray Y-MP (1988) | 118<br>166                       | 8                   | 64                                               | 8: FP add, FP multiply, FP reciprocal,<br>integer add, 2 logical, shift, population<br>count/parity          | 2 loads<br>1 store            | 1                        |
| Cray-2 (1985)                        | 244                              | 8                   | 64                                               | <ol> <li>FP add, FP multiply, FP reciprocal/sqrt,<br/>integer add/shift/population count, logical</li> </ol> | 1                             | 1                        |
| Fujitsu VP100/<br>VP200 (1982)       | 133                              | 8-256               | 32-1024                                          | 3: FP or integer add/logical, multiply, divide                                                               | 2                             | 1 (VP100)<br>2 (VP200)   |
| Hitachi S810/S820<br>(1983)          | 71                               | 32                  | 256                                              | 4: FP multiply-add, FP multiply/divide-add<br>unit, 2 integer add/logical                                    | 3 loads<br>1 store            | 1 (S810)<br>2 (S820)     |
| Convex C-1 (1985)                    | 10                               | 8                   | 128                                              | 2: FP or integer multiply/divide, add/logical                                                                | 1                             | 1 (64 bit)<br>2 (32 bit) |
| NEC SX/2 (1985)                      | 167                              | 8 + 32              | 256                                              | 4: FP multiply/divide, FP add, integer add/<br>logical, shift                                                | 1                             | 4                        |
| Cray C90 (1991)<br>Cray T90 (1995)   | 240<br>460                       | 8                   | 128                                              | 8: FP add, FP multiply, FP reciprocal,<br>integer add, 2 logical, shift, population<br>count/parity          | 2 loads<br>1 store            | 2                        |
| NEC SX/5 (1998)                      | 312                              | 8 + 64              | 512                                              | 4: FP or integer add/shift, multiply, divide,<br>logical                                                     | 1                             | 16                       |

| Instruction                  | Operands                         | Function                                                                                                                                                                               |
|------------------------------|----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ADDV.D<br>ADDVS.D            | V1,V2,V3<br>V1,V2,F0             | Add elements of V2 and V3, then put each result in V1. Add F0 to each element of V2, then put each result in V1.                                                                       |
| SUBV.D<br>SUBVS.D<br>SUBSV.D | V1,V2,V3<br>V1,V2,F0<br>V1,F0,V2 | Subtract elements of V3 from V2, then put each result in V1. Subtract F0 from elements of V2, then put each result in V1. Subtract elements of V2 from F0, then put each result in V1. |
| MULV.D<br>MULVS.D            | V1,V2,V3<br>V1,V2,F0             | Multiply elements of V2 and V3, then put each result in V1. Multiply each element of V2 by F0, then put each result in V1.                                                             |
| DIVV.D<br>DIVVS.D<br>DIVSV.D | V1,V2,V3<br>V1,V2,F0<br>V1,F0,V2 | Divide elements of V2 by V3, then put each result in V1. Divide elements of V2 by F0, then put each result in V1. Divide F0 by elements of V2, then put each result in V1.             |

| LV            | V1,R1          | Load vector register V1 from memory starting at address R1.                                                                                                                                                                                                                                         |           |  |  |
|---------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|--|--|
| SV            | R1,V1          | Store vector register V1 into memory starting at address R1.                                                                                                                                                                                                                                        |           |  |  |
| LVWS          | V1,(R1,R2)     | Load V1 from address at R1 with stride in R2, i.e., R1+1 × R2.                                                                                                                                                                                                                                      |           |  |  |
| SVWS          | (R1,R2),V1     | Store V1 from address at R1 with stride in R2, i.e., R1+1 × R2.                                                                                                                                                                                                                                     |           |  |  |
| LVI           | V1,(R1+V2)     | Load V1 with vector whose elements are at R1+V2(1), i.e., V2 is an index.                                                                                                                                                                                                                           |           |  |  |
| SVI           | (R1+V2),V1     | Store V1 to vector whose elements are at R1+V2(1), i.e., V2 is an index.                                                                                                                                                                                                                            |           |  |  |
| CVI           | V1,R1          | Create an index vector by storing the values 0, $1 \times R1$ , $2 \times R1$ ,, $63 \times$                                                                                                                                                                                                        | R1 into V |  |  |
| SV.D<br>SVS.D | V1,V2<br>V1,F0 | Compare the elements (EQ, NE, GT, LT, GE, LE) in V1 and V2. If condition is true, pu<br>a 1 in the corresponding bit vector; otherwise put 0. Put resulting bit vector in vector<br>mask register (VM). The instruction SVS.D performs the same compare but using a<br>scalar value as one operand. |           |  |  |
| POP           | R1,VM          | Count the 1s in the vector-mask register and store count in R1.                                                                                                                                                                                                                                     |           |  |  |
| CVM           |                | Set the vector-mask register to all 1s.                                                                                                                                                                                                                                                             |           |  |  |
| MTC1          | VLR,R1         | Move contents of R1 to the vector-length register.                                                                                                                                                                                                                                                  |           |  |  |
| MFC1          | R1,VLR         | Move the contents of the vector-length register to R1.                                                                                                                                                                                                                                              |           |  |  |
| MVTM          | VM,FO          | Move contents of F0 to the vector-mask register.                                                                                                                                                                                                                                                    |           |  |  |
| MVFM          | FO,VM          | Move contents of vector-mask register to F0.                                                                                                                                                                                                                                                        |           |  |  |
|               |                |                                                                                                                                                                                                                                                                                                     |           |  |  |

|               | LV<br>ADDV.D<br>SV   | V3,Ry<br>V4,V2,V3<br>Ry,V4 | ;load vector Y<br>;add<br>;store the result |
|---------------|----------------------|----------------------------|---------------------------------------------|
|               | L.D<br>LV<br>MULVS.D | ,                          |                                             |
| Here is the   | VMIPS code           | for DAXPY.                 |                                             |
|               | DSUBU<br>BNEZ        | R20,R4,Rx<br>R20,Loop      | <pre>;compute bound ;check if done</pre>    |
|               | DADDIU               | Ry,Ry,#8                   |                                             |
| IC = 6 vs 600 | DADDIU               | Rx,Rx,#8                   |                                             |
|               | S.D                  | 0(Ry),F4                   |                                             |
|               | ADD.D                | F4,F4,F2                   |                                             |
|               | L.D                  | F4,0(Ry)                   |                                             |
| Loop:         | L.D<br>MUL.D         | F2,0(Rx)<br>F2,F2,F0       |                                             |
|               | DADDIU               | R4,Rx,#512                 |                                             |
|               | L.D                  | F0,a                       | ;load scalar a                              |

## **Performance** Vector execution time • f(vector length, structural hazards, data hazards) » Initiation rate: # of operands consumed or produced per cycle » multi-lane architecture each vector lane can carry n values per cycle orten 2 or more # vector lanes \* lane width = initiation rate · also dependent on pipeline fill and spill • Convoys (made up term) - set of independent vector instructions » similar to an EPIC VLIW bundle • Chime • time it takes to execute 1 convoy • Start up time • All contribute to execution time School of Computing University of Utah

12

CS6810

## • Lots of bandwidth required to feed lots of XU's • very wide data bus • banked memory » each bank indpendently addressed • not interieaved • multiple load and stores issued per cycle • each bank serves a particular load or store • assuming no bank conflict • compiler tries hard to avoid conflict • latency can be high for DRAM based memory • but bendwidth can be quite good • early CRAY machines used SSRAM's - too expensive today » addressing? where are the bank select bits?

CS6810

School of Computing University of Utah































## **Concluding Remarks**

- For supercomputers what matters most?
   blade configuration

  - · rack configuration
  - Interconnect
    - » on the blade
    - » in the rack
    - » between racks
  - how memory is partitioned
     » remote vs. local access latencies and bandwidths
  - memory capacity and organization

  - power and performance on big benchmarks
- If you choose one for your final HW
   there's a lot of advertizing copy

  - · try to dig past that

School of Computing University of Utah

CS6810