Lecture Notes CS/EE 6810 Chapter 2 and Appendix G: Static ILP (Lectures 5, 6, 7) Most low-end processors cannot invest enough silicon or power to implement fancy hardware mechanisms. They have to rely on good compiler scheduling to find parallelism and minimize stalls. Hence, most embedded and low-end processors typically employ simple in-order pipelines and rely on the compiler for good performance. This argument also applies to some modern high-performance processors that choose to maximize throughput by incorporating many simple cores on a single die. The argument can also be made that the simple pipeline will enable higher clock speeds, thereby out-performing out-of-order pipelines. This was one of the motivations behind Intel and HP creating the IA-64 architecture (the Itanium processor). Unfortunately, that project was not a success and we'll examine some of the reasons why. But even though static (compiler-based) scheduling has been unable to out-perform dynamic (hardware-based) scheduling, static scheduling is the only option for better performance in a simple core. We'll assume the simple 5-stage in-order pipeline for most of our examples. We'll focus on loops because a program spends most of its time within loops and loops are easier to analyze because of their regularity. In earlier exercises, we had seen that an int-ALU operation completes in a single cycle and does not cause stalls for dependent instructions. A load, on the other hand, has a 2-cycle latency, resulting in a single stall cycle for any dependent instruction. The FP-ALU operation takes 4 cycles to complete, meaning that a dependent FP-ALU operation incurs 3 stall cycles, while a dependent store operation incurs 2 stall cycles. These assumptions are important when determining an optimal schedule for instructions. Consider the loop on slide 4 (lecture 5) -- to each element of an array, we are adding a scalar value. The loop body consists of 3 main instructions: one to load the value from memory, one floating-point add, and then the store back to memory. There are two loop overhead instructions that decrement the address (so we can access the next array element), and branch back if the loop-terminate condition has not been met. If the default schedule executes on our in-order processor, a number of stalls are introduced and each loop iteration takes 10 cycles to execute. On slide 6, we can see how some of these stalls can be eliminated by just re-ordering the instructions. For example, the store goes in the branch delay slot and the integer add (DADDUI) is moved much higher. Since R1 gets updated early, a displacement has been provided when the Store accesses R1 -- this was easy to do because R1 is updated by a constant value every iteration. With the new schedule, there is a single stall cycle and each iteration takes 6 cycles (3 cycles of work, 2 cycles of loop overhead, and 1 cycle stall). Loop unrolling helps reduce the loop overhead and also provides the opportunity to reduce stalls if there are no dependences across loops (i.e., if there are no dependences across loops, an instruction from the next iteration can be hoisted to fill in a stall cycle). Slide 7 shows an unrolled schedule with a degree of unroll = 4. The default schedule reduces loop overhead, but still suffers from stalls. The schedule on slide 8 re-orders instructions so there are no stalls. Note that more register names are required and displacements have to be adjusted for each load/store instruction. Loop unrolling also increases the total code size (perhaps, a cause for concern on embedded systems that only have a few kilobytes of memory). The degree of unroll k may not be an integral divisor of n, the number of loop iterations. Hence, the compiler generates two successive loops, the first executes the original loop for (n mod k) iterations and then we can enter the unrolled loop and execute it for n/k (truncated) iterations. If the pipeline allows multiple instructions to execute at the same time, even greater speedups are possible. On slide 12, we show the schedule on an in-order processor that has an Int pipeline and an FP pipeline. In order to not have any stalls in the Int pipeline, the loop must be unrolled at least 5 times. There are different ways that the compiler can specify that the L.D and the ADD.D can execute together in the 3rd cycle. It can introduce bits in the executable that specify the "end of parallelism" for that cycle and the hardware must wait until the next cycle to execute any subsequent instructions. Alternatively, the compiler produces a fixed number of instructions in each packet and inserts NOPs if we can't find useful work in one of the slots. The latter technique is known as VLIW and simplifies the hardware, but can lead to large code size because of numerous NOPs in the code. While loops can be relatively easily unrolled, instructions within the loop can be re-ordered to avoid stalls only if dependences do not exist. Hence, a parallel loop is likely to yield more benefits than a loop where there are dependences across loop iterations. Hence, the compiler carries out analyses to determine the degree of parallelism within loops. Consider the examples on slide 4 (lecture 6). The first example is our case study so far and has no loop-carried dependences (it's a parallel loop and we were able to re-order instructions freely after the unroll). For a loop to be parallel, the only dependences must be within an iteration. The second example has an intra-loop dependence (A[i+1] is written by S1 and read by S2). It also has two loop-carried dependences as S1 depends on S1 from the previous iteration (and similarly for S2). Hence, loop unrolling may not give us a big win as instructions cannot be re-ordered (loop unrolling will reduce loop overhead, though). The fourth example has a loop-carried dependence as well, but the dependence distance is 3, which means that three consecutive loops are parallel and unrolling by degree 3 will allow us to hide stalls by re-ordering instructions. Unrolling by degree 4 or more will require us to be cognizant of dependences. The third example on slide 4 has a loop carried dependence as S1 depends on S2 from the previous iteration. If each instruction is represented by a vertex in a graph and we draw edges for dependences, it turns out that a loop is parallel if there are no cycles in the graph. For this third example, while there are loop-carried dependences, the graph is acyclic, which means that the loop can be restructured so that the loops are indeed parallel. Slide 5 shows such a re-structuring. Each new loop iteration now has S2 from the old loop and the dependent S1 from the next iteration of the old loop. Thus, all dependences are now within a single iteration and the loops are indeed parallel. When dealing with loops that access arrays with affine indices (an affine index is expressible as a integer linear combination of the loop index), to determine if the loop is parallel, we must find out if two different accesses to the same array can have the same index in different iterations. A naive algorithm would substitute the possible values for the loop index in each array access, try all possible combinations and see if there was ever a match (a dependence). This is not always do-able as array bounds are often not known and even if array bounds are known, the complexity for this analysis can quickly become very high, especially if there are many accesses to that array within the loop. A quick test that can identify some examples of parallelism is the GCD test. For any two array accesses to be the same, the array indices for each access must be equal for some iterations j and k, i.e., aj + b = ck + d. After re-organizing and dividing by the greatest common divisor (GCD) of a and c, we can see that the term on the left is guaranteed to be an integer. If the term on the right is not an integer, we can be sure that the equality is never true and there will never be any dependence across loops. As an example, verify this test for two array accesses, one that accesses all the odd elements of the array with index 2i+1 and the other that accesses alternate even elements of the array with index 4i+2. The compiler may be able to schedule instructions to mimic a pipeline. As shown on slide 7, in cycle 6, the compiler can try to schedule the S.D of the 1st iteration, the ADD.D of the 4th iteration, and the L.D and DADDUI of the 6th iteration. Slide 8 shows how the compiler can produce a software pipelined version of the original code. Within each new iteration, we are computing S.D of old iteration i, the ADD.D of old iteration i+1, and the L.D of old iteration i+2. The order of instructions is reversed so that the correct register dependences are arranged. The L.D writes into F0, which is read next by the ADD.D in the next iteration and the ADD.D writes into F4, which is read next by the S.D in the following iteration. To eliminate all stalls, a combination of loop unrolling and software pipelining may be required. In the examples so far, we were easily able to re-order instructions within a basic block (a sequence of instructions without a branch). Branches are problematic if they are hard to predict and can severely impede the re-ordering of instructions. Some branches can be done away with if the hardware provides support for "predication". Predication allows control dependences to be converted into data dependences. Consider the example on slide 10, lecture 6. A short if-then-else sequence is converted to a sequence that has no branches. However, each instruction now has an additional operand, known as the predicate. The instruction is finally allowed to complete and write its result into a register only if the predicate is true. If the predicate is not known, there is a stall in the pipeline. Thus, a control hazard has now been converted into a data hazard. Note that we now need a predicate (R7) for instructions in the then-part and another predicate (R1) for instructions in the else-part. Also note that an instruction in the else-part is reading a register value (R2) that gets updated by the then-part. So that the hardware does not flag this as a data dependence, different register names must be used. Hence, at the start, R2 is copied into another register R8. Thus, predication increases register pressure, but eliminates the penalty from branch mispredicts. Of course, some cycles will be wasted executing instructions along the wrong path. Overheads associated with predication: longer instruction length because of the additional operand, wasted power, more register/bypassing ports. When the compiler re-orders instructions, it is easy to ensure that register dependences are not violated. But what if a load instruction is hoised over an earlier store (shown in slide 12, lecture 6)? What if an instruction is hoisted over a branch and raises an exception (that should really not be signaled if the branch goes the other way)? Some exceptions do not need special treatment. They are the kind that may only impact performance. For example if the hoisted instruction causes a page fault, the only penalty is time spent servicing the page fault even though the branch goes the other way. However, if the exception terminates the program, we do not want to service it unless the branch does go left in the figure. Hence, such hoisted instructions are marked speculative. They do not raise exceptions right away -- instead, they write a special Not A Thing (NAT) value into their register. If some other (hoisted) instructions read a NAT value, they also write NAT into their destination registers. When we reach the original location of the hoisted instruction, a "sentinel" instruction is executed that examines the value in the destination register. If it finds a NAT, it signals an exception at this point and potentially executes some instructions to un-do the effects of the hoisted instructions. A similar policy is adopted when hoisting loads above stores. The hoisted load is marked as speculative. When it executes, it inserts its address into a special table (known as the ALAT in the IA-64 architecture). Subsequent stores will compare their store address with the addresses in the ALAT. If there is a match, that load address is marked as having a conflict. Finally, when we get to the hoisted load's original location, a special instruction is issued to check the ALAT. If it turns out that the hoisted load's address is marked as having a conflict, we jump to some recovery code that re-executes the load and its dependents. At the very basic level, an in-order processor must have low design complexity (low power, potential for fast clocks, low design effort, low verification effort). While the compiler can improve the execution time of the code, if we need significant flexibility to move instructions around, we'll need support for predication, and speculative/sentinel instructions. All of this additional hardware starts to erode some of the simplicity benefits of the in-order processor. The IA-64 project made this mistake and kept adding frills to improve performance. The result was a processor that had more area/power and less performance/clock speed than the competing out-of-order processors (such as the Pentium 4 and Alpha 21264). While it is questionable if static ILP is the path to high performance, static ILP is certainly the path to power and area-efficient computing, which is what embedded domains (and potentially high-throughput domains) require.