Lecture Notes CS/EE 6810 Chapter 1: Fundamentals of Computer Design (Lectures 1 and 2) Microprocessor performance has improved at a rate of approximately 50% every year for the past two decades. Part of this improvement can be attributed to material scientists/physicists/VLSI engineers -- process technology improvements have resulted in faster and faster transistors nearly every year. If architects had stuck to a single processor design (from 1985) and had retired, we would have still enjoyed 35% annual performance improvements because of better technology. Which means that architectural innovations have contributed roughly 15% improvements every year and these innovations are the primary subject of this course. While clock speed improvements were dramatic in the 80's and 90's, this rate has slowed down considerably in recent years. Clock speeds improve for two reasons: (i) designers choose to do less work in each pipeline stage, (ii) transistors get faster. Since we are already doing very little work in each stage today, improvements because of the first factor will soon be non-existent. Hence, clock speeds can no longer improve at a dramatic rate. Further, power consumption is a linear function of clock speed and we are reaching power limits as well. Even though a chip may be sold with a high peak frequency rating, it will often run at a much lower frequency if on-chip sensors detect a thermal emergency. It is also harder to make a single processor run faster with architectural techniques because most of the low-hanging fruit has been picked. Most attempts to improve parallelism within a single thread entail huge overheads in terms of complexity and power and are no longer attractive to industry. For these reasons, multi-core processors have emerged as the next major trend for processors. Since transistors on a chip double every two years, a chip can accommodate multiple processing cores in the same space previously occupied by a single core. Intel, AMD, Sun, IBM, etc. can convince users to buy new processors every few years by providing more cores, even though each individual core is not much better in terms of speed. Technology trends A significant fraction of performance improvement over the last two decades can be attributed to the fact that transistor sizes have been constantly shrinking. It is projected that transistor dimensions will be as low as 35nm in 2014 (the dimensions in 1997 were 250nm). By increasing transistor density by 35% every year and by increasing die size by 10-20% every year, it has been possible to increase functionality on a chip. Small transistor sizes also lead to a corresponding linear reduction in transistor delay, making it possible to improve the rate at which work gets completed and hence the clock speed. Transistors are connected by wires on metal layers that lie above the silicon layer. As transistors become smaller, wires have to traverse shorter distances. Unfortunately, the wires are also becoming narrower, which causes their resistance per unit length to go up. This increases the delay of a wire and the net result is that wires have not been improving at the same rate as transistor logic. While a signal could go across the chip in a single cycle in the past, that delay could become tens of cycles in the future. The Pentium 4 already incurs a few cycles worth of delay to send messages on the chip (leading to multiple pipeline stages). Similar to the wire delay problem on a chip, delays for memory have also been growing. Memory chips are composed of DRAM cells (more details in a later chapter) and with technology improvements, we have been able to pack more cells into a chip. So while memory capacity, cost, and bandwidth have been improving dramatically, latency has improved only marginally. While memory access would take tens of cycles in the past, it is projected to take of the order of 500 cycles within the next decade. The same story is true of disks as well. While disk density has improved dramatically, disk latency has been languishing in the milli-second range forever. Network bandwidths have also improved, but end-to-end latency is limited by the speed of light. The shrinking of transistors decreases the capacitance of individual transistors. Likewise, voltage levels on chip have also been reducing. Unfortunately, dynamic power (dissipated when transistors switch) is also a function of the number of transistors and the clock frequency. The net effect is that dynamic power has been increasing every year. Leakage power, which is the result of current leaking from power to ground every cycle, whether the transistor switches or not, has also been increasing. Processors already dissipate more than 100W of power over an area of about 1cm^2, which exceeds the power density of a hot plate! If the upward trend continues, we will soon reach the power density of a nuclear reactor. Clearly, architects must engage in power-aware design so we never get to that point. Measuring Performance Performance is usually measured as response time or throughput. Both are related concepts. If we are able to improve the processor so it has better response time, then it usually also implies better throughput (assuming that other factors such as scheduling policies remain unchanged). If we improve the processor to have better throughput, it does not necessarily imply better response times -- in fact, response time usually gets worse. High throughput implies that we are not wasting resources -- this may mean a lot of context-switching and threading to ensure that we are able to find enough work for the resources. This in turn increases contention for each program and each program ends up having slower response times. There exist three main market segments today and it is important to understand what drives each segment. For example, the embedded domain requires processors to consume little energy and be relatively cheap. The processor one designs for the embedded domain is likely to be very different from the processor one would design for the desktop domain, where performance is a much greater factor. The server domain, on the other hand, cares about metrics such as throughput, availability, and scalability. (Keep in mind that power is becoming such a major constraint in recent years that even the desktop and server domains can no longer ignore it.) To measure performance, power, etc., associations come up with sets of programs that they feel represent their market. There exist many benchmark sets, such as SPEC CPU (to measure CPU performance), SPECapc and SPECviewperf (to measure graphics performance), SPECSFS (to measure disk and network I/O), TPC (database performance), EEMBC (embedded processor performance). All the programs within the suite are run (following the guidelines from the association) and the performance numbers (execution times) are finally combined into one single number -- for example, the SPEC rating for a CPU. The execution times for all programs can be combined using one of many metrics -- AM of normalized execution times and GM of execution times are the two most popular (the latter is used to derive the SPEC rating). AM of normalized execution times A workload is first determined: for example, run program A to completion, then run program B to completion, then program C to completion, and so on. This choice of workload is executed on a reference machine and we determine execution times for A, B, C... To get the rating for a new machine X, the exact same workload is run on the new machine to determine execution times. These execution times are normalized with respect to the execution times for the reference machine. For example, it may turn out that the normalized execution times on the new machine are 0.8, 1.1, 0.5. The normalized execution times for the reference machine are of course 1.0, 1.0, 1.0. From this we can conclude that a workload where every program would have executed for 1 second on the reference machine, would end up taking 0.8, 1.1, and 0.5 seconds on the new machine X. In other words, a certain amount of work took 3 seconds on the reference machine and only 2.4 seconds on X --> X is 20% faster. Therefore, the conclusions that one derives by examining AM of normalized execution times will indeed be true if the workload is such that every program in the benchmark set executes for an equal number of cycles. GM of execution times This requires no reference machine. The workload is fixed and is executed on the new machine. The execution times are multiplied, the nth root is taken, and we get a measure for how fast the machine was. Unfortunately, this measure really doesn't tell us how much faster the new machine will run on some specific workload. The performance improvement for some workload may very likely be in the vicinity of that predicted by the GM, but there is no guarantee. As an example, consider slide 12 of lecture 1. If I were to trust the GM, I would conclude that A had the same performance as B and C was about 1.6 times faster than A. However, no workload mix will ever yield execution times that are consistent with our conclusions above. Therefore, while GM has the nice property that it does not require a reference machine, it is basically just a warm, fuzzy number that need not necessarily predict performance for any specific workload. While execution time is the ultimate metric for performance, it has a number of components, as given by the equation on slide 15, lecture 1. Clock cycle time gives an indication of the peak rate at which instructions enter and leave the processor. IPC (instructions per cycle) is the realistic empirical rate at which instructions leave the processor. Instruction count takes into account the efficiency of translating the original source code into the processor's assembly instructions. Computer architecture influences all of these aspects, most significantly, IPC (or CPI -- cycles per instruction). Let's say we came up with an architectural technique that only impacted CPI. We ran a benchmark suite and measured CPIs with and without our innovations. To summarize the CPIs, we could use GM, but that would not necessarily reflect execution time for any workload. However, it would reflect the GM of overall execution time, which is used in some summaries such as the SPEC rating. AM of CPIs would end up adding the cycles assuming each processor ran for one instruction. Therefore, AM of CPIs does reflect performance (execution time) for a workload where every program executes for an equal number of instructions. Likewise, you'll find that HM of CPIs (which is 1/AM of IPCs) reflects performance for a workload where every program executes for an equal number of cycles. When comparing the performance of two systems, it is common to use the terms "speedup" and "improvement". Speedup is always a ratio: For the example on slide 18, lecture 1, the speedup with the new laptop is 100/70. Note that performance = 1/execution time and speedup of A over B is performance_A/performance_B = exectime_B/exectime_A. A percentage change is always represented as (newvalue - oldvalue)/oldvalue. Hence, the improvement in performance with the new laptop is (1/70 - 1/100) / 1/100. Principles of Computer Design There are some basic principles of computer design that will be frequently applied throughout this course. One of the more important ones is Amdahl's Law. It effectively states that we must target bottlenecks. If some component contributes 60% to overall performance, we can expect up to 60% performance improvement by optimizing that component. Likewise, if some component only contributes 2% to overall performance, it is best left alone as our best efforts will only yield a 2% overall benefit at best. The second important principle is that of locality. It turns out that most program instructions and data are repeatedly accessed within a short time interval (temporal locality). It also turns out that most program instructions and data are accessed in contiguous chunks (spatial locality). The third important principle is the mining of parallelism. Often, operations within the processor are not inter-dependent and can happen in parallel. These examples manifest at the circuit, architecture, and system level, as given by the examples on slide 4, lecture 2. Before we jump into details behind designing architectures, let's examine how our designs might influence cost. As the equations on slide 7, lecture 2 show, the area of the chip (die) can significantly influence cost. Basically, silicon wafers are circular, and undergo a number of processing steps (to etch out transistors on silicon, draw wires on metal layers, etc.). Producing a silicon wafer has a fixed cost (assuming that the number of metal layers is also fixed). The cost of each individual chip is now a function of how many good chips we can extract out of one wafer. Clearly, area plays a major role and the relationship is expressed in the third bullet (the last term accounts for the fact that all the dies on the circumference have to be discarded). A number of dies must also be discarded because of defects. The number of defects that show up on the die is a function of the defect rate for that technology as well as the term alpha, that tries to encompass the complexity of the manufacturing steps. Bullet 4 is an empirically-determined formula that expresses the relationship. It turns out that smaller chips are likely to have higher yield rates because small area improves the chances of a chip escaping a defect. As seen on slide 8, doubling the area of a chip results in three times fewer chips being produced from one wafer. While the relationship between cost and price is rather complex, in general, one would expect them to correlate. In addition to the cost of generating the chip, the price also includes items such as direct costs (labor and warranty that are directly tied to that product), indirect costs (R&D, building rental that are tied to all products that company makes), and retail mark-up. It is also worth noting that the processor in a desktop system only accounts for about 22% of the total cost -- this should be kept in mind while considering performance-cost trade-offs. While purchasing any system, it is important to examine two or three important metrics: price, performance, and performance/price. In the embedded domain, it is also important to inspect performance/watt. In some cases, systems with poor performance/price ratio also do well because they have other favorable properties such as better reliability, availability, customer support, opportunity for upgrades, etc. Dependability The next major metric for a computer system is its dependability. The examples on slide 10 illustrate a few basic definitions. A fault is a bad event, it becomes an error as soon as it influences the computations of a program, and failure is said to have happened when the output of the program deviates from the norm. Failures and restorations cause the system to toggle between states of service accomplishment and service interruption. Reliability is a measure of mean time to failure (MTTF). The inverse of MTTF is the FIT rate (failures in time). If a system has multiple components that fail independently, the FIT rate of the system is the sum of FIT rates of the individual components. Availability is a measure of the fraction of time that a system is in a service accomplishment state.