Lecture Notes CS/EE 6810 Appendix A: Basic Pipelining (Lectures 2 and 3 and 4) Consider an unpipelined processor implementation. The first instruction is fetched, decoded, executed, and finally produces a result. When all circuits have stabilized, a clock is provided so that the result can be stored in a latch (register). At that point, the processor can start working on the second instruction, and so on. If it takes time t1 to complete one instruction, the clock cycle time is t1+t0 (where t0 is the time taken to store a result in the latch), the clock speed is 1/(t1+t0), a new instruction enters/leaves the processor every 1 cycle (time t1+t0). Next, we'll try to speed up the processor with pipelining. The execution of the instruction is broken up into three equal stages: instruction fetch and decode; register read and ALU compute; register write. These stages are labeled A, B, and C on slide 12, lecture 2. Stage A begins by operating on instruction 1. After all circuits have stabilized, a clock is provided and the result is stored in a latch. This result will remain in the latch until the next clock edge is provided. This latch essentially maintains a constant input to the second stage B until the next clock edge. Thus, after three clock edges, the final result produced by the instruction is stored in the final latch. The whole operation takes 3 cycles, where the time between clock edges equals (t1/3)+t0. While instruction 1 is in stage B, instruction 2 can be operated on by stage A. This introduces parallelism within the processor. In the steady state, stage C will operate on instruction i, stage B will operate on instruction i+1, and stage A will operate on instruction i+2. Hence, up to 3 instructions can be simultaneously processed. How does this result in better performance? A new instruction can leave/enter the pipeline every clock cycle. Since the clock speed has nearly tripled, the pipelined implementation can be expected to have nearly thrice the throughput. Note that this has happened in spite of the fact that a single instruction now takes longer to execute (t1 + 3*t0) than in the unpipelined case (t1 + t0). We also assumed above that the work can be split into 3 equal parts of length t1/3...usually, some stages are longer than others and the cycle time is limited by the longest stage. The "magical" improvement with pipelining came about because we assumed that there was no dependence between the operations that happened in parallel. What if there was a dependence? Consider two successive instructions: R1 <-- R2 + R3; R5 <-- R1 + R4. Instruction 1 writes the value into R1 during stage C. Hence, only at the end of stage C are we guaranteed to have the new value residing in the register file. Meanwhile, instruction 2 must read the value of R1 at the start of stage B so it can complete the addition in stage B. If you observe the orange boxes on slide 12, lecture 2, you'll notice that the read of R1 (in stage B of instr2) happens before the write of R1 (in stage C of instr1), meaning that instruction 2 ends up reading some old value of R1. Because of this dependence, instruction 2 cannot begin in the second cycle. It must begin in the third cycle. Hence, if we have two back-to-back dependent instructions, we can expect one stall cycle (a cycle where no result is produced). Therefore, the time gap between two independent instructions leaving the pipeline is (t1/3 + t0) and the time gap between dependent instructions leaving the pipeline is (2*t1/3 + 2*t0). If t0 is relatively small, we can safely conclude that the pipelined implementation is faster than the unpipelined implementation, where the time gap between successive instructions (whether dependent or not) is always t1+t0. Having examined the above specific example, let us see how performance changes as a function of the number of pipeline stages. Before we can do that, let us first see how a dependence can introduce stalls between successive dependent instructions. Consider a single instruction's execution below with two key moments marked: A regvalue gets read here A regvalue gets written here | | | | V V -------------------X---------------------------------------Y Assume that moment-X happens after 10 units of time and moment-Y happens after 30 units of time. If there are two successive dependent instructions (with the dependence between Y of the first instruction and X of the second instruction), for the dependence to be honored, instruction 2 would have to start 20 units of time after instruction 1's start. In other words, if it takes time t1 to execute an entire instruction, if there was a dependence as above, the gap between the two instructions must be at least enough to complete (2*t1/3) time units of work. Note that this discussion never made any assumptions about the number of pipeline stages. All we have done is identify the time gap between instructions imposed because of a dependence. Now, suppose we break the processor into N pipeline stages. The clock cycle time would be (t1/N + t0). The gap between independent instructions = t1/N + t0 For dependent instructions, we need a gap of 2*t1/3 time units worth of work. Since each pipeline stage does about t1/N units worth of work, the gap of 2*t1/3 corresponds to (2*t1/3) / (t1/N) = 2*N/3 pipeline stages. Hence, the gap between dependent instructions = Num pipeline stages * time per pipeline stage = (2*N/3) * (t1/N + t0) = (2*t1/3) + (2*N*t0/3) Compare this with the unpipelined implementation, where gap between independent/dependent instructions = t1 + t0 We can see that as we increase N, gap between independent instructions keeps reducing, while gap between dependent instructions keeps increasing. Depending on which is dominant, performance may either improve or worsen. If a graph is plotted, we frequently observe that performance improves at first, reaches a peak around N=30 to 50, and then starts dropping beyond that. If we break up an instruction's execution into 30-50 pipeline stages, there is usually enough time within each pipeline stage to do about 6 to 10 gate delay's worth of work (note that most circuits consist of a series of logic gates). Next, we will design a pipeline that is more complex than the 3-stage example we used on slide 2, lecture 3. The figure on slide 4, lecture 3, shows a 5-stage pipeline. The first stage (Instruction Memory) brings in an instruction from some storage (either cache or memory). The instruction is then stored in the latch. In the second stage, the instruction is decoded and the appropriate input register values are read from the register file and stored in the latch. The ALU then operates on these register values and produces a result. If the instruction is simply trying to do ALU arithmetic, it can skip the fourth stage. If the instruction is a load or store operation, the third stage is used for computing the memory address. This memory address is then used in the fourth stage to actually fetch/store a value from/to memory. In the fifth stage, the result of the ALU computation or the load instruction is finally stored into the register file. Now consider a series of complications with this pipeline implementation. We will assume that a branch instruction completes in the second stage -- in other words, the second stage reads register values, makes a comparison, figures out which way the branch is going, and has time to update the PC with the new target. This means, we can start fetching from the correct location in cycle 3. But what about the instruction that was fetched in cycle 2? We will later examine multiple ways to handle this instruction, but keep in mind for now that branches pose this problem. If it takes even longer to resolve a branch, more cycles will go by without us knowing what instruction to fetch. These are known as control hazards. The second potential complication with the above pipeline is the conflict for storage. In cycle 4 in the figure, instruction 1 is trying to access data storage, while instruction 4 is trying to access instruction storage. If the processor had a single unified storage for both (for example, a unified cache with a single access port), both operations cannot simultaneously happen. This would give rise to stalls. In fact, performance would be slowed down by a factor of 2. If the processor has separate storage for instructions and data, a new instruction can enter/leave the pipeline every cycle. If there is a unified storage, 3 instructions enter the pipeline in the first 3 cycles, and then there are 3 stall cycles when no instructions can enter the pipeline because of conflict for the storage. This is known as a structural hazard and can be usually easily dealt with by throwing more hardware at the problem -- for example, separate caches for instructions and data. Another example of a structural hazard is posed by the register file. In cycle 5 in the figure, instruction 1 is attempting to write to the register file while instruction 4 is attempting to read from the register file. By allowing multiple read and write ports, we can do away with the resource conflict. For the rest of this discussion, we'll make a slightly different assumption. We'll assume that read and write are operations that take half a clock cycle. So writes are completed in the first half of the cycle and reads are completed in the second half of the cycle, thereby posing no conflict. (In modern-day processors, this is not a reasonable assumption as reads and writes can often take more than one full cycle.) Finally, we'll examine the third type of hazard: the data hazard. As we had seen before, we may have to introduce gaps between successive instructions if there is a dependence. Consider the example on slide 13, lecture 3. The first instruction writes the result into register $2 during cycle 5 (based on our assumption above, the write completes half-way through the cycle). All the subsequent instructions read $2 from the register file. The second instruction reads $2 during cycle 3, so it clearly receives some old value. Similarly, for the third instruction. For the fourth and fifth instructions, their register reads yield the latest updated value. Hence, for the pipeline to work correctly, there have to be two cycles of inactivity after the first instruction and the second instruction must enter the pipeline at cycle 4 in order to read the correct input operands. However, we can eliminate these stalls with a technique called bypassing/forwarding/short-circuiting. Even though the result of an operation is written into the register file in the fifth stage, the value is often known earlier. For an integer add, the value that is being written is known as early as at the end of stage 3. The result in the latch can be forwarded to the ALU and a multiplexor unit before the ALU selects the right input -- it can select the value in the latch after stage 2 (whatever it read from the register file) or the value in the latch after stage 3 (whatever was produced by the previous instruction). We can extend the concept further and have the multiplexor also select the value in the latch after stage 4 (whatever was produced by the previous-to-previous instruction). Clearly, we must also provide some control bits to the multiplexor so it can make the appropriate selection of inputs -- we will examine that later. This "forwarding" mechanism "bypasses" the register file and allows dependent instructions to execute in back-to-back cycles. Of course, bypassing does not eliminate all stalls. For example, if the first instruction was a load, the result is produced only at the end of stage 4. Since a dependent instruction needs its input at the start of stage 3, it is impossible to forward the result of the load to the ALU in time. Hence, to detect if we need stalls between dependent instructions, we have to identify when the first instruction produces its result and when the second instruction needs that particular result. Control Hazards On a branch, we have the following options: (i) Don't do anything until we figure out which way the branch is going. (ii) Assume the branch is either taken or not taken and start fetching from the new target. If there is a mis-predict, again we have two options: (a) squash the mis-fetched instruction (b) let the mis-fetched instruction complete -- the completed instruction is said to be in the branch delay slot. The compiler is responsible for placing an instruction in the branch delay slot such that it does useful work most of the time and never causes wrong results. Slide 10, Lecture 4 shows three ways in which the delay slot can be filled with useful instructions. In the first example, an instruction before the branch can be moved after the branch -- we always need its result and luckily it is independent of the branch condition. In the second example, the instruction before the branch cannot be moved after the branch as the branch condition depends on it. Hence, we try to move some later instruction into that slot. Somehow (perhaps with profiling), we figured out that the branch is often taken -- so we place an instruction from the taken part into the delay slot. This instruction happens to write to R4. If it turns out that the branch was not taken, we may go the other way. If we try to read the value of R4, we end up getting an incorrect value. Hence, we can move a later instruction into the delay slot only if it writes to a register that is not live (in other words, if we end up going the other way, at least we didn't destroy relevant state). The third example moves an instruction from the not-taken part into the delay slot. If a pipeline had no stalls, a new instruction leaves the pipeline every cycle to yield a CPI of 1. In a realistic pipeline, performance (CPI) can be expressed by 1 + stalls-per-instruction. In the in-order processor described so far, the need for stalls, bypassing, etc., must be determined in the instruction decode stage. Accordingly, control signals for multiplexors in the pipeline are generated so that each stage knows where to get its inputs from. Slide 14 shows how the pipeline would be extended to handle multi-cycle operations, such as multiply, FP-add, divide. Some units, such as the multiplier/adder can be pipelined so that when the first operation moves to the second stage, a new operation can enter the first stage (initiation interval of 1). The divide unit re-uses the same adder over 25 cycles, so a single instruction ties up the resource for 25 cycles (initiation interval of 25). Now, even though an instruction enters the pipeline in order and starts executing in order, completion has now suddenly become out of order. This leads to various complications. Two instructions that began execution in different cycles may complete in the same cycle and raise a structural hazard for the next pipeline stage -- more stall cycles. Also, two instructions that both write to R1 may complete out-of-order and return the wrong value to future reads of R1. If there is a use of R1 between the two writes to R1, this will not happen -- the use will be stalled until the first write to R1 produces its result. Since instructions start execution in order, the stall of the use of R1 will also end up stalling all subsequent instructions (and the second write to R1). The other concern with out-of-order completion is the lack of imprecise exceptions. When an exception is raised, we'd like to make a clean checkpoint of the program state at some PC, so that the programmer can see what has happened until that point (and debug), and so that the program can resume execution from that PC after it returns from the exception servicing. It is difficult to preserve in-order program state if we allow out-of-order completion. A later instruction may have written into a register even if it happens to be after the excepting instruction. To deal with this problem, all completing instructions write their results into some separate temporary registers. These temporary register values are moved to the register file in program order -- when the oldest instruction in the pipeline has completed, it "commits" and its temporary result is moved into the register file.