Lecture Notes CS/EE 6810 Chapter 4: Multiprocessors (Lectures 17, 18, 19, 20) Microprocessors can generally be classified into four forms, shown on slide 2, lecture 17. Most multiprocessors today are of the form MIMD -- each CPU operates instructions independently and on arbitrary sets of data, hence, multiple-instruction-multiple-data. Multiprocessors are primarily distinguished by the memory organization. The first organization is shown on slide 4, known as a symmetric multiprocessor (SMP) or uniform memory access (UMA) architecture. All CPUs are connected to a single set of memory chips. Every CPU has an identical view of memory and equal memory access times. If many CPUs share a single centralized memory, it can emerge as a bottleneck. Hence, for scalable designs, a distributed memory organization, such as the one shown on slide 6, is used. When a CPU has a miss in its cache, it hopefully will find data in its local memory. If data is not found there, memory associated with a remote node will have to be accessed. This is a non-uniform memory access (NUMA) architecture as memory latency is a function of the physical location of the memory. Programs can be written with either a shared-memory programming model or a message-passing programming model. In a message-passing model, each thread accesses its own (disjoint) set of physical memory locations and to communicate data between threads, explicit messages have to be sent. In a shared-memory model, each thread can access any physical memory location, so data can be communicated between threads if one thread writes to a specific location and the other thread reads from that location. Either programming model will work easily on an SMP (UMA) multiprocessor. On a distributed-memory multiprocessor, again, a message-passing model works fine. For a shared-memory program to work, there has to be a mechanism to allow a thread to write to remote memory locations, so that threads can exchange data by simply writing to and reading from the same memory location. One option is to use a hardware mechanism that recognizes that the physical memory address being written to is in a different node and forward the write on. The other option is to use a software layer -- the OS can recognize that the write needs to be sent to a different node and messages are exchanged between the operating systems to effect the write. The difference between these two implementations is the usual cost-performance trade-off. In essence, the message-passing programming model expects the programmer to implement cache coherence at the software level by explicitly exchanging data through send and receive messages. In a shared-memory programming model, the programmer expects that cache coherence will be implemented by the layer below (either hardware or software). In this course, we will focus on hardware cache coherence implementations, so we will assume that programs are being written in the shared-memory model. To better understand the differences between the two programming models, consider how the single-thread program on slide 8 is written with shared-memory (slide 9) and message-passing (slide 10). The single-thread program walks through a 2d array and re-computes the value of each element by averaging the values of neighboring elements. The program stops when values converge (difference between old and new values is less than a pre-specified threshold). In the shared-memory program, the array is created in physical memory space that is visible to every thread. A number of parallel threads are created, that all execute function Solve(). Each thread has a different pid and that is used by each thread to determine a subset of rows that it will operate on. Each thread will need the new values computed by a neighboring CPU for its border rows. These are automatically propagated by the underlying cache coherence system -- when one CPU writes to the border rows, the neighboring CPU receives the latest value on its next read. Each thread computes its own diff and all diffs are added together at the end of the iteration to determine if the computation has converged. Each thread obtains a lock before updating the global diff counter. Because of the underlying cache coherence mechanism, the latest value of the diff counter is visible to every thread. When a thread reaches a barrier, it has to wait there until all threads reach that barrier. Barriers are used between every pair of reads and writes to "diff" so that every thread gets to see the value of diff before it gets updated again. The barrier at the start of the while-loop makes sure that every thread is done resetting diff to zero before individual threads start incrementing diff after executing the for-loops. With the message-passing model, each thread creates its own local physical memory space that only it can access. Each thread also creates a local copy of the border rows -- when the neighboring CPU computes new values for these border rows, explicit messages are sent to keep the local copies up-to-date. At the start of each iteration, every thread sends its border rows to its neighboring CPUs and likewise, receives their border rows. After going through the averaging step, mydiff values are sent to the thread with pid 0. The thread with pid 0 computes the global diff value and sends it back to each thread so they know if the algorithm has converged or not. No "barriers" are required -- sends and receives are enough to make sure that all processes have advanced the appropriate amount. In this example, non-blocking sends and blocking receives are used -- a thread can execute a send and move on to the next instruction even if the corresponding receive has not been executed -- but a thread waits until a receive has completed before moving on to the next instruction. Note that the three versions of the program on slides 8, 9, and 10 can all yield different final results. For the single-thread model, for each averaging step, the newly computed values of the top and left neighbor are used, while the old values of the bottom and right neighbor are used. For the message-passing model, old values are used for all but the left neighbor. For the shared-memory model, it is hard to predict if old or new values are employed for the top and bottom neighbors (it depends on the relative speeds of the threads on neighboring CPUs). Cache Coherence A system is said to be cache coherent if it fulfils two conditions: (i) write propagation: a write by one process is eventually visible to other processes, (ii) write serialization: every process sees two writes to the same memory location in the same order. Cache coherence protocols are either based on snooping mechanisms (where every cache monitors the requests made by other caches and updates its own state) or directory-based mechanisms (where a centralized directory keeps track of how a memory block is being shared and all requests are sent to this directory). Protocols are also classified based on whether a write causes other cached copies to be invalidated or updated. The latter is more bandwidth- intensive, while the former can impose a longer latency on a read. On slide 9, lecture 18, the protocol actions for a snooping-based cache coherence protocol are shown. A single centralized memory is used and a bus connects all CPUs to this memory. When a request is put out on the bus, every cache monitors the request and takes the required steps. At the outset, memory location X is not in any of the caches. When CPU-A attempts to read X, a cache miss is encountered and a read request is put on the bus. Every cache checks to see if it has a local copy of X. Since the result in memory is up-to-date, memory responds with X and it is cached by CPU-A in "Shared" state. When B tries to read X, the read request is put on the bus, A snoops the bus, checks its tags and realizes that it has a valid copy of that block. In some protocols, memory always responds with the block if it has a valid copy (this is what we'll assume for the rest of this discussion). In other protocols, cache-to-cache sharing is encouraged because cache latencies are lower than memory latencies. This will require us to designate one of the sharers as the "owner" so that only that node responds with a copy. In this case, memory responds and B stores the block in its cache in "Shared" state. The same process repeats when C reads X. When A attempts to write X, it discovers a miss because A has the block in shared state, which only grants read permissions. Hence, a request is placed on the bus asking for exclusive access. B and C realize that they have a copy of X and mark that block as invalid. A marks the block as exclusive and proceeds with a write. There is no data block transfer as A already has a valid copy of the block (this is simply a permission upgrade request). On subsequent writes by X, no bus traffic is generated as A already has the block in exclusive state. When C does the write to X, it broadcasts its request. A realizes that it has the only valid copy of the block and it responds with this block. It also sets its own state to invalid. C accepts the block, marks it as exclusive and proceeds with the write. A writeback into memory does not happen at this time. When B attempts to read X, it broadcasts its request, receives a valid copy from C, C downgrades its state from E to S. At this point, a writeback to memory also happens as memory is responsible for supplying valid data if the block is cached by others in shared state. Finally, when A reads Y (which happens to map to the same location in the direct-mapped cache), X is evicted from A's cache. Note that when a request is placed on the bus, other caches carry out a tag comparison and will not be bothered by requests for Y if they happen to have X. If multiple processes issue writes at the same time, they arbitrate for the bus and one of them will end up accessing the bus first. Thus, the order in which writes appear on the bus determines the order in which all processes see different writes (they all see writes in the same order). Slide 10, Lecture 18 shows the actions taken for each event. The first 9 entries show actions taken by a cache when the processor has a read/write hit/miss for the specified block state. The last 4 entries show actions taken by a cache when it sees the corresponding request on the bus (placed by some other cache) and it has a valid copy in shared or exclusive state. For the above examples, we have assumed that one transaction happens entirely before the next transaction can begin. Assuming such atomic transactions slows performance, but simplifies the coherence protocol. If we want non-atomic transactions, we'll have to keep track of additional state for every block. For example, we could maintain a busy state for a block and if some other transaction concerning that block shows up on the bus, some form of negative acknowledgment can be sent informing the new requestor to wait. In addition to capacity, conflict, and compulsory misses, multiprocessor caches now suffer from a new miss -- the coherence miss. If another process writes to a cache block, other cached copies of that block get invalidated and that results in additional "coherence" misses for other processors. If the data words being accessed by two different processors are different, but happen to be in the same cache line, the miss is termed "false coherence". If the two processors access the same word, it is a "true coherence miss". A larger cache results in fewer capacity and conflict misses, but more coherence misses (in other words, misses that were capacity or conflict before now get converted into coherence misses). A larger processor count (with the same problem size) results in fewer capacity and conflict misses (since each processor is dealing with less data, hopefully). The number of compulsory misses might go up because some data will likely get accessed by more processors. Coherence misses will also likely go up since the probability of data being accessed by multiple processors also goes up. A large block size will increase the number of false misses, reduce the number of compulsory misses, and reduce the capacity and compulsory misses if cache size also increases. Directory-based coherence Obviously, bus-based systems have little scalability -- the bus is a centralized resource and will not work well if 100 processors regularly compete for the bus. Hence, larger scale multiprocessors employ directory-based protocols. Consider the distributed-memory system (slide 5, lecture 19). For every "block" (say, 64 bytes) in memory, some state is maintained in an adjoining directory. The directory itself is quite large and is usually implemented as DRAM -- hence, the memory and directory are usually looked up in parallel. The directory keeps track of how a block is being shared within the entire multiprocessor. Every cache miss is now sent to the directory through a network and the directory takes necessary actions. The memory (and the corresponding directory) is now distributed, i.e., some addresses (say 0-1 GB) now reside at node 0, addresses 1-2 GB reside at node 1, 2-3 GB reside at node 2, and so on. Based on the address that is being accessed, we know the exact location of a memory address's home node (where its memory copy and directory are stored). The cache miss request is accordingly sent to that node. Each cache can no longer be expected to update its own state as misses are not broadcast to everyone. If multiple processors are attempting a write at the same time, order is determined by the order in which those requests arrive at the directory. Consider the example on slide 7. Assume that physical memory location X is stored on the second node (B). When A tries to read X and has a cache miss, the request is sent to the second node. Memory and directory are looked up, and the block is returned to A. The directory keeps track of the fact that A has the block in shared (read-only) state. Similarly, for B and C. When A attempts to write X, it has a cache miss (since it does not have write permissions), and it sends the request to the directory. The directory responds with the permission and sends messages to B and C, letting them know that they must invalidate their blocks. A cannot proceed with the write unless it receives acknowledgments from B and C that they have invalidated their blocks. When we study consistency models, we will understand what happens if A proceeds without waiting for the acks. Note that the directory has to maintain a bit for every processor (for every block) to keep track of whether that processor has the block in shared state (some optimizations are possible here). When C attempts a write to X, the request is sent to the directory, the directory forwards the request to A as A has the latest copy and A is responsible for sending the latest copy of the block to C. When B attempts a read of X, the request is again forwarded to C and C responds with data. At this point, we can also write-back the data into memory. Slide 8 summarizes all the actions taken on each event. Locks are a basic primitive within every parallel program. If the underlying system is cache coherent, when a lock is released by a process, that is eventually seen by other processes. Let's examine the coherence behavior of different lock algorithms. In order to construct a lock, the hardware must provide a basic atomic read-modify-write operation. An example is the atomic exchange operation, that swaps the contents of register and memory without any other operation intervening. If the content of the register is initially one, this operation is known as "test and set". The code on slide 10 ensures that a process can enter the critical section (CS) only if it finds a zero (meaning lock is free) in the memory location. The process will keep spinning and attempting test and sets until it finds a zero in the memory location. The lock is released by writing a zero into the memory location. The lock variable can be placed in the cache. The loop in the previous example is highly inefficient because processes spin constantly on the test and set, keep attempting writes, resulting in high invalidate traffic. To reduce this traffic, a test&test&set was introduced. Each process keeps spinning on a test (a read that does not generate traffic) and a test and set is attempted only when the lock is observed to be free. Of course, when the lock is set free, all waiting processes suddenly attempt a test and set, resulting in high traffic. A load-linked store-conditional implementation provides flexibility -- it does not produce atomicity, but allows the process to detect if the operation was atomic or not. Thus, more computations can happen between the read and the write. A store conditional fails if some other process attempted a write to the location between the load-linked and store-conditional to that location. Every LL creates an entry in a table that stores the address and monitors bus traffic to see if some other process writes to that address. When an SC is executed, the table is checked to figure out if there was an intervening write to that location. To acquire the lock (slide 15), processes keep spinning on the LL. When the lock is set free, they all attempt the SC. Only the first SC succeeds. The other SCs fail because they realize some other process stored to the location since the last LL. Since they fail, they do not generate invalidate traffic (thereby reducing traffic, compared to the test&test&set implementation) (note that the textbook wrongly assumes that failed SCs also generate invalidate traffic). However, when each process loops back to the LL, they generate misses because they do not have the latest update to the memory location. If i processes are waiting for the lock to be released, and the lock holder releases the lock, the following transactions happen (on a bus-based system). The lock releaser gets the block in exclusive state (so it can write a zero) and every other cache invalidates its shared copy. When the i waiting processes execute their LL, they find the block in invalidate state and this causes a read miss. This results in i bus transactions (I'm counting the request and the response as 1 bus transaction). All processes see that the lock is free and move on to trying to acquire the lock with an SC. The first SC succeeds and results in a write transaction on the bus. Other processes see this write, invalidate their blocks, and update their LL-SC table, indicating that there was an intervening write. The other i-1 SCs will therefore fail and they die quietly without generating bus traffic. These i-1 processes will loop back to the LL instruction. Since the block is invalid in all of these i-1 caches, the LL results in a read miss. Thus, totally 2i+1 bus transactions are generated when the lock is released. To further reduce bandwidth needs, we introduce three more locks. A ticket lock maintains a now-serving variable. In order to acquire a lock, a process first joins the queue and figures out its wait number. It attempts a store conditional only when the now-serving variable matches its wait number. You'll realize that this implementation generates the same order of bus traffic as the LL-SC implementation. The reason is that every process is still waiting on a single variable that is repeatedly updated. An array-based lock solves the problem. When each process joins the queue, it is made to wait on a different memory location. When a 0 is written to that memory location, the process acquires the lock. While releasing the lock, the process writes a 0 into the next memory location, thereby passing the lock to the next process. Hence, every lock release results in bus traffic for only one other process. The overhead: more memory storage. A queueing lock works well with a directory-based protocol, where the directory controller keeps track of the order in which requests arrived and passes the lock to the next in line. The code on slide 7, lecture 20, implements the other basic synchronization primitive: the barrier. Barriers can be easily implemented with a single global counter to track the number of processes that have arrived and a shared variable that indicates when all processes have arrived. Improper use of the shared variable can lead to a deadlock situation. For example, on slide 7, when bar.flag is set to 1, all processes waiting on the while statement are expected to exit the while loop and move on. If the processes encounter another barrier, one of the processes resets bar.flag to 0. If one of the processes had not exited the while loop (let's say it had been switched out), it never gets to see bar.flag = 1. By the time the OS switches it back in, bar.flag has already been set to 0 and the process continues to loop -- deadlock! The problem can be easily dealt with by toggling the value of bar.flag in every instantiation (slide 8). This way, bar.flag is not set for a short window of time and no process will miss it. The value is toggled only when all processes leave the next barrier. Another problem with barriers is the potentially high traffic generated by the lock, unlock, and flag resetting operations. Consistency models Until now, we have examined coherence, which requires two conditions to be met: write propagation and write serialization (to a single memory location). The consistency model defines the ordering of writes and reads to different memory locations. A hardware is designed to exhibit a specific consistency model and the programmer must understand it and accordingly write correct programs. The consistency model that is easiest for the programmer to understand is sequential consistency (SC). A multiprocessor is said to be SC if the results of the execution are as if each process completed each of its memory operations atomically and in program order, and the operations of different processes are interleaved in some arbitrary fashion. Thus, there are two main constraints that need to be fulfilled -- program order and atomicity. Consider the parallel program examples on slide 10. If the programmer assumed the SC model while writing the programs, this is what he/she would expect of the programs. The first program would implement mutual exclusion (both processes cannot enter the CS at the same time). The second example would see the value 2000 when P2 reads Data. The third example would put the result 1 into the register. This will indeed happen if every instruction in every program completes entirely before we move on to the next instruction in every program (what we assumed when walking through all our coherence protocol examples). However, to improve performance, we often introduce optimizations. For example, we may use an out-of-order processor. For the first example, such a processor can execute the if-condition first before it writes the one since the if and the write do not have any RAW/WAR/WAW dependence (the two instructions refer to different locations). If this happens, both processes could end up in the critical section at the same time. A similar result would happen even if we used an in-order processor, but assumed a write buffer. The write is placed in the write buffer and the processor moves on to the next instruction even though the rest of the world has not seen the write. So let's not use an ooo processor or a write buffer. In the second example, P1 moves on to the second instruction only after it has sent out the write to Data on the network. But sending out the request is not the same as completion. If the network does re-ordering, P2 may see the write to Head before it sees the write to Data. This will lead to incorrect results. Hence, P1 cannot move on to the next instruction unless it receives acks from other caches that they have seen the write. Now let's move to the third example. P1 is waiting for P3 to send it an ack, but let's say that the write hasn't yet reached P3 because of some congestion. In the meantime, P2 sees the write, sends the ack and moves on. When it writes to B, P3 ends up seeing the write to B before it sees the write to A. Hence, we must make sure that a process does not proceed with data unless the last write to that data have been seen by everyone else. This would require a three-phase write -- after receiving acks from everyone, the writer has to now send another message to sharers letting them know that it is ok to actually read the data. There are efficient ways to do this...in an invalidate protocol, P2 would have to request the latest copy from P1 anyway and P1 will wait until it sees all acks before responding -- this avoid the three-phase write. The bottomline is this: SC requires program order, write serialization, and everyone has to see an update before that value can be read. This makes programming very intuitive, but the hardware very slow...imagine, no out-of-order execution, no write buffers, waiting for acks before doing the write, etc. To work around this problem, relaxed consistency models have been designed -- they make programming a little harder, but greatly boost performance. Fence instructions are introduced and the hardware has to only maintain SC between fence instructions. Between fence instructions, all hell can break loose. A conservative programmer would insert a fence after every instruction. This would guarantee SC. A smart programmer would realize that there are certain parts of the program where optimizations will not influence correctness and would remove fences from that part of the program, speeding up its execution.