Lecture Notes CS/EE 6810 Chapter 5: Memory Hierarchies (Lectures 11, 12, 13, 14, 15, 16) For our discussions so far, we have assumed that the memory stage of the pipeline takes a single cycle. We've already seen that memory accesses are extremely slow and getting slower. A single memory access today takes about 300 cycles, so we need cache hierarchies to reduce stalls. The cache stores a subset of memory on the chip, allowing fast access to some memory. For example, a 32KB L1 data cache can service about 96% of all requests. Such a cache may have an access time of 2 cycles, resulting in few stalls. If there is a miss, for an in-order pipeline, the pipeline is frozen until the value is fetched from memory. For an out-of-order pipeline, other instructions can continue to execute while the memory fetch proceeds in the background, but eventually, the long-latency instruction will hold up commit and stall execution. Typically, separate L1 instruction and data caches are implemented so that instruction fetch and data fetch can happen at the same time. Smaller caches also have smaller access times. The L1 caches are backed up by a much larger (and slower) L2 cache. There is usually a single L2 cache that is shared by instructions and data. Higher access times for the L2 are not a great problem because the L1 takes care of most requests. The L2 is primarily in place to avoid expensive memory accesses and it is worth it even if it has a hit rate of less than 10%. Note that each level of the hierarchy is accessed sequentially -- that is, if there is an L1 miss, we do not look up the L2 and memory in parallel (to reduce power consumption and complexity). Therefore, in some cases, the addition of another level of cache can lead to poor performance if it yields very few hits (can happen when we're streaming through large data arrays that do not fit in cache). We'll begin by looking at how the cache is accessed. As an example, we first design a small cache that consists of 8 8-byte words. Since the CPU can address individual bytes, the last three bits of the address refer to the offset within a word. We need three more bits to select one of the 8 words. Thus, the last six bits of the address are used to select a byte out of this cache. If an address has 64 bits, the remaining 58 bits are stored in an adjoining tag array. Note that multiple different addresses can be stored in the same location in the cache. The tag keeps track of which address is currently stored in the cache. Hence, while picking a byte out of the data array, we must also pick the tag out of the tag array and compare it with the address requested by the CPU to make sure we're returning the correct data. If the tags do not match, a cache miss is signaled and the request is sent to the next level in the cache hierarchy. The amount of data stored in each row of the cache can be increased. For example, we could store 32 bytes in each row. The contiguous bytes stored in a row are referred to as a "block" or "cache line". We must now use the last 5 bits of the address as offset bits (to select 1 of the 32 bytes in that block). If two different words are frequently accessed and map to the same row, they can't co-exist together and this leads to many cache misses. A set-associative cache allows us to maintain two different blocks in each row. At a time, two blocks are read out of the data and tag arrays. If one of the tags matches the tag of the requested address, a hit is signalled. Each block is referred to as a "way" and all the ways in a row are referred to as a "set". Thus, the number of offset bits = log(block size). The number of index bits = log(number of sets). Tag bits = address size - offset bits - index bits. The size of the data array = block size * number of ways * number of sets. The size of the tag array = tag bits * number of ways * number of sets. If a cache has N blocks and 1 way, it is direct-mapped. If a cache has N blocks and N ways, it is fully-associative. Next, we'll examine what happens on a miss. On a read miss, you bring in an entire block (with the expectation that spatial locality will exist) and replace one of the existing blocks in the cache. If the cache is set-associative, we have choices when replacing a block. An LRU policy yields the best performance (since programs exhibit temporal locality). LRU policy is easy to implement for a 2-way cache, but for higher associativities, processors are forced to implement simpler pseudo-LRU mechanisms. On a write miss, it may often not be necessary to bring the entire block in (such a policy is known as write-no-allocate) (usually, when you write to memory, you tend not to read the value again in the near future). If the block is brought in on a write miss, it is known as a write-allocate policy. On a write hit, one may choose to immediately update the copy of the block in the L2 or in memory. That would be a write-through policy -- it is clearly more bandwidth intensive. However, it simplifies coherence if there are multiple processors sharing data (more details on this in the next chapter). To save on bandwidth, we could wait until a cache block is evicted before updating the L2 or memory. Such a policy is known as a writeback policy. Typically, the L1-L2 interface is write-through, while the L2-memory interface is write-back. There are many ways to improve cache behavior (more than 2000 papers were written on cache optimizations in the 1990s). We have already seen that multiple levels can improve performance (L1/L2/L3/memory). Note that we can attempt optimizations on the L2 that we cannot on the L1. For example, we can first look up the tags, figure out which way has the block we need and then only read out that way to save energy. There is an access time penalty in doing this because the tag and data arrays have to be accessed sequentially. This is worth doing for the L2 because L2 access time is not as critical and the large L2 size enables more energy savings with such optimizations. On a write miss, instructions are typically not waiting for the write miss to complete. On the other hand, the load instruction and its dependents are typically waiting for the result of a read miss. Hence, reads always have higher priority when accessing the next level. Writes typically are stored in a write buffer and the write to the next level happens when there is a lull in activity. Reads can bypass earlier writes and must look in the write buffers before advancing to the next level. A direct-mapped cache often suffers from misses because multiple words may map to the same location in cache. When a block is evicted from the cache, it can be stored in a small victim buffer in case we need the word in the immediate future. By keeping track of the last 8 (say) evictions, the victim buffer provides additional associativity for up to 8 different cache sets. This may yield most of the benefits of a 2-way set-associative cache. Cache misses can be categorized into three main classes: compulsory, capacity, and conflict. It is hard to exactly identify what causes a cache miss, so these are only intended to serve as a guideline. A compulsory miss happens when a word is accessed the first time. It can only be avoided through some kind of prefetch mechanism. A large line size is the simplest form of prefetch as it brings in neighboring words on every cache miss. To count the number of compulsory misses, simulate an infinitely sized cache. Capacity misses are caused because the program's working set size is larger than the cache capacity. For example, if we access N different blocks between successive accesses to a block X, the second access to X is likely to yield a cache miss if the cache can only accommodate fewer than N blocks. Thus, the additional misses encountered in moving from an infinite cache to a fully-associative finite cache can be attributed to capacity misses. Conflict misses are caused because a block got evicted only because a different block was also mapped to the same location. The number of additional misses caused by moving from a fully-associative cache to a direct-mapped cache can be attributed to conflict misses. As stated, these are only guidelines. For example it is possible that the number of misses may reduce in moving from a fully-associative to a direct-mapped cache. Take the following example: a program accesses blocks 1 to N+1 in round-robin fashion and the cache can only store N blocks. With a fully-associative cache, every access is a miss because N other blocks are accessed between successive accesses to the same block. With a direct-mapped cache, only blocks 1 and N+1 raise conflicts and lead to cache misses, while none of the other blocks pose conflicts and have perfect hit rates. This so-called paradox can be attributed to the LRU replacement policy. Let us examine the effect of different parameters for a fixed cache size. If we double the number of sets and halve the block size, the number of compulsory misses go up (since a smaller block size leads to less prefetch). The number of capacity and conflict misses should remain roughly constant. If we double the number of sets and halve the number of ways, there may or may not be more conflict misses (note that the probability of two blocks mapping to the same set goes down, but if there is a conflict, the probability that we have enough ways to accommodate both blocks goes down too). If we double block size and halve the number of ways, compulsory misses go down and conflict misses may or may not increase (an argument similar to the one above). DRAM We have already seen that cache misses are expensive because memory accesses cost about 300 cycles. We will next see why memory latency is difficult to improve, while memory bandwidth is easy to improve. A memory chip contains an array of bits (in this example, we show a 1Mb chip with a 1024x1024 array of bits). The memory controller first provides 10 row address bits to pick out an entire row of bits. 1024 bits are read out into the column decoder register. The memory controller then provides 10 column address bits (usually on the same pins that provided the RAS bits) that pick out at least 1 bit out of the column decoder and send it back to the CPU. Column decoder access is fast because it is mostly logic. Using the RAS bits to pick a row out of the array involves long wire delays and is usually much slower (hence, memory latency does not improve much every year). Once we've read out an entire row, the column decoder can quickly forward a series of bits back to the CPU (thereby, improving bandwidth). Bandwidth can also be improved by stacking multiple memory chips and having each chip provide 4 bits (say) and providing enough wires to send all of these bits back to the CPU in parallel. A cache line is therefore interleaved across multiple chips (the first chip stores the first 4 bits of the cache line, the second stores the next 4 bits and so on). Memory bandwidth can also be improved by employing synchronous transfer between memory and CPU -- the next word is sent on every clock edge. With an asynchronous transfer, overhead is incurred for every word as we have to wait for acks, etc., before sending the next word. Memory chips employ DRAM technology because it allows high storage density and low cost. This is possible because a single transistor stores charge for a bit to keep track of 0 or 1 (SRAM on the other hand, employs 6 transistors per bit, affording good speed, but low density and high cost). Unfortunately, the charge on the single transistor leaks away over time and is often affected when the value is read. In order to not lose this value, the memory controller periodically reads every row within memory and writes it back. Because of this periodic refresh operation, the memory is referred to as "dynamic" RAM or DRAM. This also leads to unpredictable access times as we may have to wait for the refresh operation to complete before our request can be serviced. Finally, we'll see how virtual memory is organized on a system. Virtual memory makes it possible for each process to have the illusion that a lot of memory is available to it. For example, each process may believe that it can have 4GB (say) of virtual memory, but the physical memory on the system may only have a capacity of 1GB (that is shared by all processes). Thus, a part of the process' memory is stored in physical memory, while the rest is stored on disk. Hopefully, most of our requests can be found in memory -- if not, a page fault is incurred and data is copied from disk to memory (very, very expensive). Data is typically organized at the granularity of pages (say, 8KB page size). Each virtual page is mapped to some physical page. The CPU and the process only deal with virtual memory addresses. Before we can access memory, we must know the physical address, which requires us to translate the virtual address into a physical address. Each process maintains a page table that tracks the virtual to physical page translation for every virtual page. Obviously, the page table itself may contain millions of bytes of data and will often require a memory access. To reduce this penalty, a subset of virtual to physical page translations are stored in a small buffer called the translation look-aside buffer (TLB). Before accessing memory, the TLB is looked up to find the physical page number and hopefully, most accesses will be hits. On a miss, the page tables will have to be looked up (much slower). Note that out of a 64-bit (say) virtual address, the last 13 bits pick a byte out of an 8KB page size (the page offset). Since we are allocating memory at page granularity, the page offset is the same in the virtual and physical address. The remaining 51 bits of the virtual address comprise the virtual page number. These are translated to yield the bits of the physical page number. When accessing the cache, the implementation would be greatly simplified if we only dealt with physical addresses. However, this requires us to first do the virtual-to-physical translation which adds to the cache access latency. It would be good for performance to begin accessing the cache with just the virtual address. Let's first look at potential pitfalls if we use the virtual address to index into the cache. Note that different processes can share data -- a single location in physical memory may be mapped to the virtual address spaces of different processes (potentially at different virtual addresses). When one process updates that data, it should be visible to the other process. Thus, one physical location has two different names. If these virtual names are used to index into the cache, it is possible that the virtual names may map to different sets. Thus, the same physical memory location may end up being stored in two different locations in the cache. Each process will end up making updates to their own cached copies, without these updates ever being visible to the other process. It is therefore important that multiple virtual names for the same physical memory all map to the same location (set) in the cache. If we use an 8KB direct-mapped cache, the last 13 bits of the address are used to index into the cache and pick a byte out. Luckily, the last 13 bits of the virtual and physical address are the same, so we can begin indexing into the cache with just the virtual address. This is also true for a 16KB 2-way cache, or a 32KB 4-way cache, and so on. If we use a 16KB 1-way cache, the last 14 bits of the virtual address are used for indexing. The 14th bit will undergo translation through the TLB and may pose problems (i.e., the same physical address may map to different sets depending on whether that 14th bit of the virtual address is 0 or 1). To avoid these problems, the Operating System may have to implement page coloring -- i.e, pages have to be allocated such that the 14th bit of the virtual and physical page numbers are the same, i.e., odd physical pages are mapped to odd virtual page numbers, and likewise for even pages. We have ensured that two different virtual names for the same physical address are mapped to the same set. However, if we save the virtual address in the tag array, a process may not register a hit if the name that is currently in the cache is that of the other process. Not only does this lead to more misses, it can also lead to correctness issue in a multi-way cache. Two different virtual addresses for the same location may be saved in adjacent ways and each process registers a hit, but end up accessing different copies in the cache -- the same problem that we were trying to avoid. Hence, the tag array must save the physical page number so that every process registers a hit for the same location in cache. Slide 8, lecture 14 shows the cache access pipeline. The virtual address is enough to index into the data and tag array and pick out all the ways in that set. Before doing the tag comparison, though, we must do the virtual to physical page translation. This happens through the TLB, an access that is done in parallel with looking up the tag and data arrays. Finally, physical tag comparison is done and data is returned to the CPU. This is a virtually indexed, physically taged cache. A virtually tagged cache can lead to correctness problems (for a set-associative cache) or to performance problems (for a direct-mapped cache) as we may end up not recognizing different copies of the same physical location. A physically indexed, physically tagged cache will work, but will have poor performance because every cache access will begin with a sequential look-up of the TLB. A TLB with 128 entries and 8KB page size has a coverage of 1MB, i.e., about 1MB worth of memory can be mapped without incurring any TLB misses. For programs that have much larger working set sizes, it would be nice to selectively use a larger page size for those programs so that TLB hit rates are higher. Larger page sizes tend to waste memory (note that if each active process wastes a few pages worth of memory, the overhead is quite significant), so we'd like some dynamic mechanism that monitors program needs and creates larger pages (called superpages) if necessary. The TLB will now have to maintain a few more bits that keep track of the page size for each entry. Obviously, a superpage occupies contiguous locations in memory. Slide 10, lecture 14 shows an example of dynamic superpage creation. At the start, we use minimum page size and 16 contiguous virtual pages are mapped to physical pages that are scattered all over the place. If we realize that these 16 pages are frequently accessed together, it might make sense to combine the 16 pages into 1 superpage so they can all be mapped with a single TLB entry. For this to happen, we will have to copy the 16 physical pages into contiguous locations in memory. Thus, there is a significant cost associated with superpage creation. Let's say that we have a dynamic monitoring mechanism that realizes that had we created a superpage, N TLB misses could have been avoided. If the cost of a TLB miss is y cycles and the cost of creating a superpage is x cycles, we create a superpage as soon as N equals x/y. This ensures that the time associated with these pages will never exceed twice the optimal time (as determined by some oracle mechanism). This is very similar to the ski rental problem described on slide 16 (lecture 14). Of course, these heuristics are not required if we have already profiled the program and the compiler is able to pass on information to the architecture/OS that certain pages need to be created as superpages to begin with. Note that in addition to maintaining the virtual to physical translation, the TLB also maintains the process id as well as information on permissions. The TLB look-up guarantees that a process does not access data that it does not have permissions for. As a virtual memory case study, we'll examine the Alpha 21264. Each process divides up its virtual space in the manner shown on slide 14. The thing to note here is that page tables are stored in virtual memory. We'll soon see why. The minimum page size allowed is 8KB, although larger pages are also possible. Each page table entry takes up 8 bytes (including the physical address, permissions, etc.). Now consider the example where the process makes a request for a virtual address with virtual page number abc. It looks up the TLB, finds the translation abc --> xyz and the request is appropriately serviced (by the cache or the memory). If there is a TLB miss, we must look up the page tables to determine the translation. The page tables are stored in virtual memory, starting from some fixed location (W). If we are looking for the translation for the first virtual page, its translation can be found at virtual address W. If we are looking for the translation for the second virtual page, its translation can be found at virtual address W+8. And so on. Thus, we immediately know the exact virtual address that stores the PTE that we're looking for. Let's say that that virtual address is lmn = W + abc*8. Before we can access that PTE, we must know the virtual to physical translation for virtual address lmn. Let's say that we find the translation lmn --> pqr in the TLB. This allows us to pick out the PTE for abc-->xyz from the memory location pqr and we can now service the original request for abc. If the translation lmn-->pqr is not found in the TLB, we're now stuck. We could continue to repeat the process as before, but we're only bound to keep incurring TLB misses. At this point, we decide to take a different approach (shown on slide 16, lecture 14). For each process, we maintain a 3-level structure of page table entries. The first level has a single page of PTEs. The physical address for that page can be found in a special register (the page table base register). This single page in the first level can store 1024 PTEs (since each PTE takes up 8 bytes and the page size is 8KB). The first 10 bits of the original page number (abc) are used to pick out one of these 1024 PTEs. The value stored in the PTE is the physical address of a page in the second level. The second level has 1024 pages and a pointer to each of these pages is stored within the 1024 entries in the first level. Now that we have picked out a specific page in the second level, we use the next 10 bits of the virtual page number abc to pick out one of the PTE entries in this page. This PTE entry points to a page in the third level (again using physical addresses -- virtual addresses are not used at all during these look-ups). There are 1M pages in the third level and pointers to these 1M pages are stored in the 1K entries within each of the 1K pages in the 2nd level. Now that we've picked out a single page in the third level, we use the next 10 bits of virtual address abc to pick out a PTE from that page. This entry now stores the translation abc-->xyz and we can now proceed with the original request. We don't bother to maintain virtual address mappings for the first 2 levels. We maintain virtual address mappings for the 3rd level because we can then quickly determine the virtual address of the PTE and if we're lucky, we'll find the translation in the TLB. This avoids the 3 memory look-ups required if we decide to traverse the 3-level structure. A 3-level structure also allows us to have non-contiguous pages of PTEs, which is necessary for programs that dynamically allocate large portions of memory (simple linear indexing requires that pages be placed contiguously and we can have scattered pages if we instead use more levels of indirection). Based on this organization, we can now calculate that the virtual address has 43 bits, 13 for the offset and 30 to traverse the 3-level structure. The size of the physical page number is fixed at 32 bits, which means that the physical address is at most 45 bits. If we double the page size, the physical address size becomes 46 bits. The virtual address size grows much more. We can now accommodate twice as many PTEs in each page, so we now need 11 bits to index into each level of the PTE structure and an additional bit of offset --> virtual address size of 11+11+11+14=47 bits. Large Cache Design Most of the cache hierarchies that we have discussed so far are at most a handful of mega-bytes in size. In the future, cache hierarchies will likely be many mega-bytes in size. The design of these large caches opens up new and interesting problems, which are subsequently discussed. We are already seeing examples today of these large caches: the Intel Montecito has two cores, each with a private 12 MB L3 cache (such a large cache will require nearly 1.2 billion transistors: 24MB = 192 Mb, each SRAM bit needs 6 transistors). Intel has also announced an 80-core prototype chip: all the 80 cores (and their corresponding L1 caches) fit on a single die; the L2 cache will actually be implemented on a separate die that is bonded on top of the processor die to produce a 3D stacked chip. Assuming that 3D technology shortly becomes mainstream, we are possibly entering an era where a chip is made up of multiple dies, with some dies exclusively dedicated for large multi-mega-byte caches. We can also get a hint of things to come by examining recent evaluations from Intel: an example model in some of these papers is shown on slide 4, lecture 15: they employ a complex hierarchy, where L1 caches are private to each core, L2 caches are shared by two cores each, and a large 32 MB L3 cache is shared by all cores on chip. In most multi-core processors, each core has its own private L1 data and instruction caches. Each core may also have a private L2 cache, or all the cores may share a combined large L2 cache. Both of these options are shown on slide 5. A shared L2 allows the space to be dynamically allocated between the various L1s, i.e., if the L2 is 8MB in capacity, perhaps 4MB is occupied by one application, 2MB by another, and 1MB by two other cores. This allocation is done automatically by the cache based on its LRU policy, i.e., if a core doesn't make too many requests, its data is gradually pushed out of the L2. On the other hand, a private L2 organization would have 4 private L2 caches, each of size 2 MB...this means that the large application will suffer many misses while the small applications under-utilize their cache space. If a multi-threaded application is running and it has a large amount of data shared among the threads, then with a private L2 organization, each L2 cache ends up making a copy of the data, leading to replicated blocks and lower effective capacity. With a single shared L2, the L2 serves as a single repository for all this shared data. There's no replication of blocks and higher effective capacity. The primary advantage of a private L2 cache is that it is smaller and quicker to access. A large shared L2 will have a longer access time. The private L2 also has a dedicated bus between the L1 and L2. The bus used to access a shared L2 is used by all cores and may lead to additional contention delays. The access time for a cache is typically determined by the time taken to access the most distant corner of the cache (in other words, the worst-case delay). When implementing a large multi-megabyte cache, it is rather expensive to incur a worst-case delay for every L2 access. To avoid this, non-uniform cache architecture or NUCA has been proposed. In such an organization, the cache is partitioned into many banks and these banks are connected with an on-chip network (as shown on slide 8, lecture 15). If we get lucky and find data in a nearby bank, the access time is fairly short. With such a NUCA architecture, it is beneficial if we can organize data so that the probability of finding data nearby is high. A simple design for NUCA would simply map data to a given bank based on the address index bits. For example, if there are 128 banks, seven bits of the address (part of the index bits) are used to route the request to the correct bank. Once there, the bank look-up is just as with any cache (all the ways for a set can be found in that bank). However, such a policy does not help a great deal with reducing average access time. Typically, data will be found in each bank with equal probability. Such an organization is known as static-NUCA (S-NUCA). In an alternative and more complex design, the ways of a set can be scattered among the banks. If the L2 is 8-way set associative, some of the ways may be in a nearby bank, while the other ways may be scattered further away. When the CPU issues a request, it first looks up the nearby banks in the hope that those ways contain the requested data. If data is not found here, the request propagates to the ways/banks that are further away. This "search" for data is expensive, but it provides flexibility when placing data, which means we can attempt to organize data to improve the probability of finding data nearby. One possible policy: initially, bring data to the most distant way (replacing the least recently used data), as data gets accessed, it keeps migrating closer to the CPU (causing a swap of data on every migration). Such an architecture is known as dynamic-NUCA (D-NUCA). In a multi-core processor, such as the one shown on slide 10, the L2 cache is centrally located. If there is a block of data shared by multiple cores, the block essentially migrates to a bank that is at the "center of gravity" of the requests, i.e., if CPU-1 issues 10 requests for the block and CPU-4 issues 1 request, the block is placed in a bank that is 10 times as close to CPU-1 as it is to CPU-4. In the future, processors may also employ 3D integration, where multiple layers of silicon are stacked upon each other. This primarily helps reduce the distances that must be traveled on chip (remember that traversal over wires is expensive)(the distances traveled in the vertical dimension are very short). When using such a 3D organization, CPUs and cache banks will likely be interspersed (such as the checkerboard organizations on slide 13) to reduce average cache access latencies and overall temperature (since CPUs are hotter than caches and stacking CPU upon CPU can lead to high temperatures). Cache Innovations Hardware prefetching is commonly employed in most modern processors. Typically, the pattern of accesses is monitored and the hardware prefetch engine issues requests for what it believes will be the next set of addresses accessed by the CPU. Aggressively bringing in many blocks improves our chances of reducing misses (improves "coverage"), but reduces our "accuracy" (a smaller fraction of prefetched blocks are actually useful). If prefetched blocks are placed in the cache, they may evict other blocks that might have been useful. This is known as cache pollution and this phenomenon can be avoided if the prefetched blocks are placed in a separate small prefetch buffer. When accessing the cache, the prefetch buffer must also be looked up. If data is found in the prefetch buffer, it is brought into the main cache. It is important that prefetches must be timely: issuing the prefetch too late may result in the load having to wait for data arrival, issuing the prefetch too early may cause the prefetched block to be evicted before it is actually used or it evict another block that may still have had some utility. Stream buffers are the most popular form of prefetch engine. On a miss, a "stream" is allocated within the stream buffer and the next many sequential cache lines are brought into this stream. If the first word of the stream is subsequently accessed, the stream buffer brings in the next cache line. Thus, if the size of the "stream" is 8 cache lines, the stream buffer attempts to always stay 8 cache lines ahead of the current access. Sometimes, prefetch engines also keep track of previous addresses accessed by a load to detect possible consistent strides (for example, if the code is only touching every 16th word, we need not prefetch the remaining 15 words). Next, we'll see how application re-structuring can have a big impact on cache misses. Consider the matrix multiplication code shown on the left on slide 5, lecture 16. Assume that the cache line size is large enough to only accommodate one array element of either x, y, or z (to make it easier to count misses) (for a larger line size, misses typically can reduce by a constant factor). We'll also assume that N is very large (even a single row of an array cannot fit in the cache). For the baseline code, 2N^3 + N^2 misses are encountered. Each element of x is only accessed once, leading to N^2 misses for x. Each element of y is accessed N times and many other elements are accessed between successive accesses of that element. Hence, each access to y is a miss, leading to N^3 misses. The same argument applies to z. The newly re-structured code is shown on slide 7, lecture 16. The code has been "blocked by a factor B". B is chosen such that the cache is large enough to accommodate B^2 elements of z as well as have space left over for some more elements. Our objective here is to bring in a square portion of z and re-use it as many times as possible before throwing it out. If we are able to do this, each element of z will only incur one miss. In order to make this happen, we cannot compute the value of x[1,1] in a single multiply-and-accumulate (MAC) step. Each MAC step is now broken into smaller stages; within each stage we compute a partial sum, obtained by doing the MAC step on B elements of y and z. We begin by computing the partial-MACs for the first B columns of x (shown by the first three sets of figures on slide 7, lecture 16). We then move y by B columns and move the z block down by B rows (shown by the fourth set of figures). This allows us to compute the next set of partial-MACs for the first B columns of x. We must repeat this until we have moved y until the right edge and moved z until the bottom edge. Note that x is re-visited every time we move y to the right by B units. Since each row has N elements, y can be moved to the right N/B times. Thus, each element of x is re-visited N/B times (in other words, we're computing N/B partial-MACs for each element of x). Therefore, each element of x has N/B misses, leading to N^3/B total misses for the x array. After the first B columns of x have been computed, we then start computing the next B columns of x. We start scanning y from the start and the z-grid is now moved to the right and to the top (shown in the fifth set of figures). Thus, we are now finally re-visiting the elements of y. An element of y is re-visited every time we move x to the right by B units. Hence, each element of y is re-visited N/B times, leading to N^3/B total misses for the y array. Each element of z incurs only 1 miss -- once it is brought into the cache, it gets re-used repeatedly (cache hits) before it gets thrown out again. Thus, the total number of misses with the new code is 2N^3/B + N^2, about B times less than the original code.