Lecture Notes CS/EE 6810 Transactional Memory (Lectures 21, 22) Most large lock-based programs are difficult to design. The programmer must be careful to not introduce deadlocks or unnecessary stalls. The former problem is caused when there is a cycle of dependences, such as in the code below: Thread 1 Thread 2 ... ... lock(L1) lock(L2) ... ... lock(L2) lock(L1) ... ... unlock(L2) unlock(L1) unlock(L1) unlock(L2) A deadlock is caused when each thread acquires its first lock and then forever waits for the other thread to release its lock. Unnecessary stalls are caused when the code within the critical section does not usually pose conflicts, but may occasionally pose conflicts. Because of the possibility for a conflict, the programmer must encapsulate the code within a lock, and this forces every execution of that code to wait for a lock acquire (essentially serializing all the executions). The concept of transactional memory attempts to eliminate these and other problems. When any code is encapsulated within a critical section, the hardware does not wait to acquire any lock. It dives straight into the code and starts executing it. This means that each thread may be executing its critical section in parallel. But we don't allow a thread to make its changes permanent until it gets to the end of its transaction. The hardware also monitors and checks for conflicts. The hardware makes sure that the execution was as if each transaction executed atomically and in isolation (i.e., one transaction executes at a time and completely). Thus, the final behavior was as if a thread acquired a lock, executed its code, and then released its lock (with no interference from other threads because presumably they were all waiting for the lock to be released). In reality, of course, all the threads are executing their critical sections in parallel and it is possible that one thread modifies data that another thread is dealing with. If this happens, the execution is not as if each transaction executed atomically and in isolation: when the hardware detects such a conflict, it forces at least one of the threads to abort and re-start its execution from scratch (hoping that the next time it can execute without any conflicts). Consider the following banking example with locks and with transactions: Thread 1 Thread 2 lock(L1) lock(L1) rd balance rd balance increment increment wr balance wr balance unlock(L1) unlock(L1) Thread 1 Thread 2 Transact_begin Transact_begin rd balance rd balance increment increment wr balance wr balance Transact_end Transact_end We want the same behavior as the lock-based program, i.e., if Alice and Bob are both trying to update their bank balance, the two operations need to be completely serialized (say, Alice should finish her operation, then Bob should begin his operation). With the transactional code, both threads will optimistically start executing their code, expecting no conflicts. If both threads end up accessing different data (different bank accounts), they will both execute successfully (no conflicts) and finish. This gives high performance because threads can execute in parallel if there are no conflicts. If there is a conflict (Alice and Bob accessing the same account), the hardware must detect it, abort one of the transactions, let the other one complete, and re-execute the second one. So the hardware will make sure that the final outputs are as if each transaction executes sequentially. Since a thread never holds on to a resource, there can be no cycle of dependences in a transactional program and hence no deadlocks. It is possible for a programmer to get the same performance as a transactional program if they have a different lock for each memory location (fine-grain locks) and take care to acquire the correct lock for each location: this prevents one thread from waiting for another thread if they are dealing with different memory locations. But it is easy to see that this quickly becomes very complex for a programmer (if the critical section accesses many locations, many locks must be acquired and this increases the chances of a deadlock). Hence, the primary goal of transactions is to simplify programming: the programmer can encompass any code that they believe accesses shared data within a transaction and not worry about performance loss or deadlocks. One other advantage of transactions is fault resiliency. If a thread acquires a lock and then crashes, the lock is never released and other threads can't make forward progress. Since a transaction does not hold on to any resource, a crashed node does not impede progress for other threads. One could view transactions as a glorified form of load-linked-store-conditional. Recall that LL-SC succeeds only if no other thread changes the value of the loaded value between the LL and SC. Transactions are very similar: its like an LL and SC that handles multiple locations instead of the single location handled by LL-SC. There are many ways to implement transactional memory in hardware. Most implementations can be classified based on their policies for data versioning and conflict detection. We'll consider two examples. The first example uses lazy versioning and lazy conflict detection. For this discussion, we'll assume a small-scale multiprocessor with a bus-based snooping cache coherence protocol, and for now, we will allow only one transaction to commit at a time (note that multiple transactions can execute in parallel, but only one can proceed with the task of committing at a time). In this implementation, when a transaction issues a read, the block is fetched in read-only mode and placed in the cache (using the snooping-based cache coherence protocol). The cache maintains an additional bit per cache line to indicate that this cache line was read by the currently executing transaction. When a transaction issues a write, the transaction fetches the block in read-only mode and places it in cache. In a lazy conflict detection system, we don't announce the write just yet -- that is done at the end of the transaction. Hence, the need to fetch the block in read-only mode. Each cache line also maintains a write-bit to indicate that this line was written by the current transaction. Note that writing a result into the cache line is not a problem: since the old value is still in memory, it is not lost and there is no correctness issue even if the transaction is aborted. If another transaction attempts to read the same location, it will pick up the value in memory (not realizing someone has written to it): the conflict will be detected later since this is a lazy system. When the transaction finishes its execution, it proceeds to the commit stage. It contacts a central arbiter and acquires permission to commit (an arbiter is required because we are allowing only one transaction to commit at a time). When the transaction gets this permission, it is guaranteed to not be aborted and it proceeds with making its writes permanent. Over the next many cycles, it broadcasts its writes on the bus. The snooping-based cache coherence protocol will cause other copies of this line to be invalidated. If an invalidated line has its read-bit set, that transaction must abort (it has read a line whose contents have been modified before the transaction finished). After broadcasting its writes, the transaction is complete: it clears its read and write bits in cache and moves on to its next transaction. In summary, this lazy approach allows aborts to be quick, but commits are slow. The implementation has no deadlocks/livelocks -- the first transaction to contact the central arbiter will commit successfully. Starvation can happen: a short transaction will commit sooner than a long transaction and it is possible that the short transaction always writes a result that has been read by the long transaction: this will cause the long transaction to keep aborting (if the short transaction is executed within a loop). The second implementation we consider employs eager versioning and eager conflict detection. If a transaction does a write, we directly update memory. The transaction may abort and we can't lose the old value, hence, we make a copy into some location in virtual memory (known as the log). Since the write has been propagated, the directory (in a directory-based cache coherence protocol) will forward subsequent read requests for that location to this transaction. Likewise, when the transaction does its write, the invalidation messages are forwarded to other readers. Hence, conflicts for data are detected eagerly and we need not wait for the end of the transaction. When a conflict is detected, the requester is forced to wait and re-try after a while. For example, if a transaction does a read and the directory forwards the request to a transaction that has not finished and has written to this cache line in this transaction, the writing transaction sends a NACK to the requester. The requester attempts after a while and if the writing transaction has already committed, it responds with the required data. Thus, this "requester stalls" policy avoids aborts. In another example, let's say a transaction does a write. The directory sends invalidations to other sharers of the cache line. If any of the sharers has its read-bit set (i.e., it is in the middle of a transaction that read this line), it sends a NACK to the writing transaction and it waits and re-tries later. Hopefully, on the second re-try, the reading transaction has finished and cleared its read-bit and this allows the writing transaction to proceed. A deadlock is possible with this implementation. See the code below: Thread 1 Thread 2 Transact_begin Transact_begin rd A rd B ... ... wr B wr A ... ... Transact_end Transact_end Each transaction will be nacked by the other when it attempts the write, leading to a deadlock. This requires some hardware/software contention manager that detects such situations and takes action (aborts one of the transactions and lets the other proceed). In summary, commits are very fast in this implementation, while aborts are slow (since values have to be copied back into memory from the log). But note that aborts are very uncommon (only when a deadlock situation arises).