Lecture Notes CS/EE 6810 Chapter 2: Dynamic ILP (Lectures 7 and 8 and 9 and 10) Branch prediction Most processors, whether they rely on the compiler or hardware for high performance, rely on hardware-based dynamic branch prediction in order to fetch worthwhile instructions. The branch predictor is responsible for estimating which way a branch will fork at run-time. In order to make this prediction at run-time, we maintain a "cache" of recent branch outcomes. For example, if I know which way a branch went the last time it was in the pipeline, I can predict that it will have the same behavior and get a pretty high prediction accuracy. Consider the following simple branch predictor. We maintain 1024 1-bit values. The last 10 bits of the branch PC are used to index into one of these 1-bit values. The 1-bit value stores which way the branch went last time (1 for taken and 0 for not-taken) and that is used as the prediction. If multiple branches map to the same 1-bit value, there will be interference -- a larger table of 1-bit values will probably have a higher prediction accuracy. In the example code on slide 7, lecture 7, there are two loops that execute for 10 and 20 iterations and the program keeps visiting these loops. Every time we enter the first loop, we predict not-taken because that is what that loop did the last time (the last time we were in the loop, we exited it -- branch not-taken). That is clearly the wrong prediction as the branch is taken for the next 10 instances. For the 10th branch, we predict taken because that is what we did the last time, but the branch is obviously not-taken. Thus, for every loop, there is a mispredict when we execute the first and last iteration of the loop. The prediction accuracy for the first loop is 80% and for the second loop is 90%. By using 2 bits per entry in the branch predictor, we can capture more than just what happened the last time we hit this branch -- it helps us capture the common branch outcome in recent history. Each entry of the branch predictor stores a value between 0 and 3. Every taken branch increments the value (never going above 3) and every not-taken branch decrements the value (never going below 0). If the value is 0 or 1, the next branch is predicted as not-taken, else it is predicted taken. The advantage of a 2-bit saturating counter is that it helps mitigate the effects of noise. If a branch consistently goes one way, an occasional branch that goes the other way will not influence the branch prediction (helps reduce the number of mispredicts for a loop). Similarly, if two branches map to the same entry in the predictor, the effects of interference will be reduced. We can also employ more than 2 bits per entry, but most designs have deemed it cost-effective to only employ 2 bits. Finally, we try to capture more than just the common case. We'll examine what has happened to recent branches and depending on the recent pattern, we'll figure out which way the next branch will go. In the branch predictors so far, a subset of PC bits are used to index into the branch predictor -- one specific counter keeps track of the common case for that branch. Now, we'll combine some bits of the branch PC with some bits of the branch history pattern to index and pick out one counter from the branch predictor table. That counter keeps track of the common case for that branch, with that history pattern. Thus, each branch can now map to multiple counters, one for each possible history pattern. This can help a loop, for example, figure out when it has reached the last iteration and needs to predict not-taken. If we have a single branch history register that keeps track of the last N outcomes for all branches, we end up having a global predictor. The expectation is that if we know how neighboring branches behave, it'll help indicate how the branch in question behaves. This predictor has relatively low overhead as we've only added a single branch history register to our table of 2-bit saturating counters. If we maintain multiple branch history registers, and each register keeps track of the last N outcomes for its branch, we have a local predictor. The branch PC is used to first index into a level-1 table that has (say) 64 branch history registers. We pick out the local history for that branch. That history is then (sometimes) combined with some branch PC bits to index into the second-level, which is our table of saturating counters. Thus, for each branch, we have a bunch of saturating counters, one for each history pattern for that branch. This predictor has a higher overhead as we must maintain a table of history registers. Some programs benefit from a local predictor, while others benefit from a global predictor. To get the best of both worlds, processors can have one predictor of each kind and use a third predictor to figure out which of the local or global predictor has been yielding better predictions for each branch. This third predictor is known as a tournament or combining predictor. Branch direction prediction happens in the first pipeline stage so we can re-direct fetch if necessary. If the branch is predicted not-taken, fetch continues at PC+4. If the branch is predicted taken, we must also compute what the branch target is so that the PC can be updated. Hence, it is also necessary to have a branch target predictor. Typically, it keeps track of where the branch went the last time and that is good enough most of the time. Out-of-Order Processor Implementations A high-performance processor today usually employs out-of-order execution. In the basic pipelines we have studied so far, we encountered structural hazards, data hazards, and control hazards. Data hazards can be further broken down as true data hazards and name data hazards. A true data dependence exists between two instructions if the second instruction reads a value written by the first instruction -- a Read After Write (RAW dependence). A name dependence exists if the second instruction is writing to a register that is either being read or written by an earlier instruction -- a WAW or WAR dependence. This is not a true dependence as there is no value being passed between the two instructions -- the second instruction cannot execute before the first simply because they both happened to use the same name for their output value. Register renaming can handle the problem and is the second stage of our out-of-order processor (slide 2, lecture 9). In the first stage, branches are predicted and instructions from the predicted path are brought into the instruction fetch queue. These instructions then go through the decode and rename stage. Since we care about committing instructions in program order, entries are created for each instruction in the reorder buffer (ROB). The ROB tracks all instructions that are presently in the processor in some pipeline stage. It also keeps track of the results produced by these instructions as this state is still speculative. A temporary register (in this example, T1-T6) is associated with each ROB entry. An instruction is not speculative until it "commits", at which point, the temporary results are copied into the register file. The rename stage assigns the result of each instruction to the temporary register associated with the ROB entry. The input operands of subsequent instructions are also renamed so that they refer to the correct register. The renamed instruction is placed in the issue queue. The issue queue keeps track of whether the instruction's register operands are available. As soon as registers are available, the instruction is sent to the ALUs to execute (before which, input registers are read from the ROB and the register file). This is the key out-of-order step -- in the in-order processor, when one instruction is stalled, all subsequent instructions are stalled. After completion, the result is written into the ROB and the register name is broadcast to the issue queue so that dependent instructions can mark the input operand as ready. When the oldest instruction in the ROB completes, it proceeds to the commit stage. In the example, the value in T1 is written into R1 and we have to go through the issue queue to make sure that instructions accessing T1 now refer to R1. Note that the renaming process eliminates WAW and WAR hazards. In the example in slide 2, the first and last instructions write to R1 and cannot be issued out-of-order in the in-order processor. By using T1 and T5 as the destination registers for these two instructions, there is no name dependence and they can be executed out-of-order (not in this case, since there are other dependences to be worried about, but at least there isn't a WAW dependence). When an exception happens, we must wait until that instruction reaches the top of the ROB. At that point, all previous instructions have committed and saved their results in the register file. Instructions after the excepting instruction have only written results into the ROB and not into the register file. The PC and register file are saved, so we can return to that PC after servicing the exception. If a branch is mispredicted, all instructions after it are squashed -- since they have only written results into the ROB, no damage was done by executing the mispredicted instructions. The only problem with the above implementation is that when an instruction commits, we have to go into the issue queue and change input operand names. Consider the following implementation that eliminates the problem. We have a single "physical register file" with 38 entries. At the start of the program, "logical registers" (the registers that are used by the program's assembly code) R1-R32 are mapped to physical registers P1-P32. The remaining 6 registers are used to store the temporary results of the first 6 instructions that enter the pipeline. At that point, we run out of registers and stall. When the first instruction completes, it proceeds to commit -- registers are not copied, but a register map table is updated to keep track of where the latest committed value of R1 can be found. We also don't have to change names within the issue queue. At this point, we can also free P1 as no instruction will ever need the value P1. P1 can now be used to store the result of the 7th instruction. Thus, if the processor has 6 rename registers, we can have 6 in-flight speculative instructions in the pipeline. Every time an instruction commits and leaves the pipeline, it frees up a register that allows a new instruction to enter the pipeline. Therefore, physical register file size - logical register file size gives an estimate of how many speculative instructions can enter the pipeline, i.e., the ROB size. Note that the ROB can actually be a little larger as some instructions (stores, branches) do not need registers to store their results. The formal rules for instruction dispatch and commit are described below: At decode/dispatch: Select a physical register from the pool of free (unused) physical registers as the destination register for the instruction. The input operands to the instruction are renamed to point to physical registers that contain the most recent write (in program order) to the corresponding input logical register. At commit: The physical register that contains the result of the instruction is now considered the "committed mapping" for the corresponding destination logical register. The physical register that was the earlier committed mapping for the instruction's destination logical register is put back in the free pool of physical registers as it is guaranteed that we do not need that register value. The examples so far have assumed that the issue queue resolves register dependences and that is enough to determine when an instruction can issue. That is true only for arithmetic operations. For loads and stores, we must first honor register dependences. Once the address has been computed, we have to identify memory dependences. RAW, WAW, and WAR hazards for memory have to be honored as well. If an earlier store address has still not been computed, subsequent loads will have to wait even if their addresses are known -- if the store happens to write to the same address as the load, the load should return the value written by the store, not the value that currently resides in memory. Stores cannot write to memory until we know for sure that they are not speculative -- hence, a store has to wait until it is the oldest in the ROB before it can write to memory. We have seen multiple options for improving performance. We can increase the bandwidth of each stage: in every cycle, let us fetch 10 instructions, decode 10 instructions, issue 10, commit 10, etc. Studies have shown this is not always the most cost-effective approach (note it also increases complexity). Deep pipelining is another way to improve performance, assuming we're not already at the optimum pipeline depth. There are implementation concerns here as well -- some structures such as the register file are just hard to pipeline. Note that pipelining also increases bypassing complexity as an ALU (sitting in the 15th stage) may need bypassing from the 16th, 17th, 18th,... stages. Higher capacity is another option -- larger branch predictors can improve prediction accuracy, larger ROB/issueq size can improve our chances of finding ready instructions. Of course, the larger the structures, the harder it may be to accommodate their access time within a fast clock cycle time. Some structures can take multiple cycles to access. For example, increasing the cache size may cause the latency to go up from 1 to 2 cycles and that means dependent instructions will have to wait longer for every single load. The following are useful guidelines: (i) Optimum pipeline depth is around 40-50. But since we've hit the power wall, no processor has exceeded a pipeline depth of around 25-30. In the future, 10-20 is likely common. (ii) A fetch/decode/dispatch/issue/commit width of 4 is very popular. Increasing this to 6 may improve ILP marginally, but has a penalty in terms of clock speed. (iii) Having 128 in-flight instructions yields close to optimal performance. A larger window will yield higher ILP and better tolerate long memory latencies, but again there is a possible clock speed penalty. Much research has explored techniques to improve ILP with structures that do not negatively impact cycle time. A ROB size of 128 must be used in tandem with a register file that has ~160 entries, an issue queue with ~40 entries, an LSQ with ~30 entries, an IFQ with ~16 entries. Consider the following techniques to improve performance (usually by eliminating stall cycles and hazards): Trace Cache: A line of instructions fetched from the instruction-cache has useful instructions until we encounter a taken branch. The remaining instructions in that line are useless for now. A trace cache maintains instructions in the order they get executed, which means that a frequently taken branch is followed by instructions from the branch target. This improves average instruction fetch bandwidth. This technique is employed in the Intel Pentium4, not for performance reasons, but because it reduces energy. The Intel x86 CISC architecture instructions are converted within the processor into multiple micro-ops (RISC-like instructions). To avoid this translation for every instruction, these micro-ops are stored in a trace cache. If the processor doesn't find its instructions in the trace cache, it looks up the regular instruction cache for the original CISC instructions. Register life times: A register is assigned to an instruction in the rename stage. The instruction then waits in the issue queue for its input operands. It eventually executes and writes the result into the register. This register is then read by dependents. Then the register is kept around until a later instruction commits. Hence, the register is active (written and read) for a brief period in between. There is a period of inactivity initially when there is no result in the register and there is another lull later when the value is simply being kept around in case something bad (mispredict/exception) happens. The first problem can be fixed by assigning the register to some virtual name. This virtual name is eventually mapped to a physical register when we are ready to write a value. The second problem can be solved by copying possibly inactive register values into a second level register file that is usually not read from and can have many more entries with fewer ports. Memory dependence prediction: In the LSQ, a load is not allowed to issue in case there is an earlier store with an unresolved address. We could implement a branch predictor-like structure that keeps track of whether a load has a habit of depending on a recent store. If not, we can speculatively issue the load instead of keeping it waiting in the LSQ. If the store address does happen to match the load address, a mispredict (similar to a branch mispredict) is signalled and we roll back. Value prediction: Most processors are severely limited by RAW hazards. To circumvent this dependence, we could predict the value that the instruction is waiting upon and start executing the instruction. This works surprisingly well because many programs frequently work with 0s or 1s or have continuously increasing/decreasing values (0,1,2,3,4 or 4,8,12,16..). When the predicted value is known, we must confirm that the prediction was correct, else we must roll back (similar to a branch mispredict). An example code execution on an out-of-order processor (lecture 11, slides 2-5) The discussion of this example problem is based heavily on Assignment 4 and the assumptions stated in that. The 7-instruction code sequence is being executed on a pipeline where an instruction gets placed in the issue queue at the end of the 5th pipeline stage. The instruction may sit in the IQ for as long as is required for its input operands to be made ready. When an instruction leaves the IQ, it goes through 5 more pipeline stages before it finally completes (writes a result into the register file, updates the ROB, etc.). We will also assume that "completion" coincides with "commit" if this instruction happens to be the oldes in the pipeline. For this example, we won't bother checking for memory dependences, so we will assume a single unified IQ that deals with all instructions. The only difference for a LD/ST is that it goes through 6 pipeline stages once it leaves the issue queue. At the start, logical registers R1-R32 are mapped to physical registers P1-P32 and registers P33-P36 are in the free pool. We are assuming a width of 4, so only 4 instructions can be handled at a time by any of the stages. When the program begins, in cycle i (i=5 in our processor model), the first four instructions get placed in the issue queue. They get renamed with physical registers P33, P34, and P35. Note that the store does not write to a physical register, so it does not need a physical register from the free pool. The final free physical register is used to rename the 5th instruction in the next cycle and place it in the issue queue. At this point, we have run out of physical registers and must wait for the first few instructions to commit before having registers at our disposal to rename the 6th and 7th instructions. When the first instruction commits, it releases the prior mapping of R1 into the free pool, i.e., before the 1st ADD was renamed, R1 was mapped to P1, and after the 1st ADD was renamed, R1 got mapped to P33. When this 1st ADD commits, it frees up the old mapping (P1) that is now no longer required by any instruction. Similarly, when the 2nd instruction (LD) commits, it frees up P2. Hence, the 6th and 7th instructions use P1 and P2 as their destination registers. Next, we'll look at the timing for each of these instructions. Since the first 4 instructions are placed in the issue queue at the end of cycle i, the issue queue will check for dependences in cycle i+1 and issue those instructions that have all their register operands ready. In cycle i+1, only the 1st instruction is ready and it issues. Instructions that are only waiting for the result produced by the 1st instruction can now leave the issue queue in the next cycle (we assume that producers and consumers can flow back-to-back through the pipeline, unless the producer is a load, in which case, there is a cycle gap between producer and consumer). The instructions waiting for P33 are the 2nd, 4th, and 5th... so they all issue in cycle i+2 (thankfully, they are already in the issue queue before then, else, they would have to first wait to be placed in the IQ). The third instruction is waiting for the result of the LD, so it issues in cycle i+4 (since a cycle gap is required between the producer and consumer). The completion time for each instruction is easy: it is simply 5 + the issue time (6 + issue time for LD/ST instructions). The commit time for the 1st instruction equals the completion time. The 2nd instruction has to wait until cycle i+8 to commit. The 3rd instrs waits until cycle i+9 to commit. In cycle i+9, we can also commit the next two instructions as they have also completed. Now that the we know the commit times for the first 5 instructions, we can tell that P1 is freed in cycle i+6 and P2 is freed in cycle i+8. Accordingly, the 6th and 7th instructions can be renamed and placed in the IQ in cycles i+7 and i+9, respectively. Their dependences are resolved and they issue in the next cycle, complete and commit 6/5 cycles later.