Lecture Notes CS/EE 6810 Interconnection Networks and I/O (Lectures 23, 24, 25) This chapter focuses on interconnection networks required in multiprocessors. If we are trying to connect eight processors, we can either use a centralized or decentralized switch. The first example we consider is a centralized switch. All eight processors have output wires going in to the centralized switch and input wires coming in from the switch. Within the switch, there is logic to route the switch's eight input signals to any of its eight output ports. The crossbar incurs the most cost: it has an 8x8 array of small switches. When a switch is turned on, the signal in that row is propagated to the corresponding column. The maximum bandwidth of this system is eight, assuming that there are eight unique sources and eight unique destinations for the messages simultaneously in transit (this is the maximum possible since each processor can only handle one input message or one output message). The cost of the crossbar switch is also fairly high O(N^2). This is especially expensive since each link is actually composed of (say) 64 wires. This may lead to high latencies for the switch. The Omega network uses a different architecture within the centralized switch and provides lower performance at lower cost. It uses a number of small 2x2 switches as building blocks. These switches are connected with multiple levels and there is a unique path from every input port to every output port. The path taken by a message is a function of the address of the destination output port. If the message is destined to node 010, it goes through the top (0) port of the first switch it encounters, then through the bottom (1) port of the second switch it encounters, then through the top (0) port of the third switch it encounters. Hence, the paths are unique. Some message transfers that may have been possible with a crossbar are not possible now. For example, the crossbar would have allowed simultaneous transfers from P0 -> P5 and from P1 -> P6. But with the Omega network, both messages conflict for the same output port at the first switch they encounter, so only one of them proceeds in one cycle and the other waits until the next cycle. Hence, performance is lower than the crossbar. The cost is also lower. There are N/2 2x2 switches in each level and there are log(N) levels (we need one level for each destination address bit). The performance (and cost) of the Omega network can be increased by having more levels. If we have six levels (and mirror the 3 levels), we now have eight possible paths between any two processors: in essence, a message from P1 to P6 can go through any of the eight half-way points. This allows the switch to choose routes that minimize conflicts and improve performance. The tree network is another point on the cost-performance trade-off curve. It allows high performance for neighborly communication, but does poorly for random traffic as there is only one link between the two halves of the network and this quickly becomes the bottleneck. This leads us to defining bisection bandwidth, a metric to gauge the performance possible with a network topology for random traffic patterns. It is the minimum bandwidth encountered between two halves of the network. We next deal with distributed switches. Each node is connected to its own (distributed) switch. The switch is connected to other switches. It is responsible for either propagating requests it receives from other switches or from its own node. We first consider a ring topology, where every switch is connected to two neighbors. Usually, a bidirectional ring is constructed with two unidirectional rings. The switch in each unidirectional ring is a 2x2 switch since it has two inputs (left neighbor and local node) and two outputs (right neighbor and local node). A unidirectional ring on average requires N/2 hops for message delivery and has a bisection bandwidth of 2 (low performance), but also low cost (N links and 2x2 switches). A fully connected network is the other end of the spectrum. Every switch has a link to every other switch. The cost is high (each switch is NxN and O(N^2) links are required. Performance is high (single hop message deliveries and (N^2)/4 bisection bandwidth). Most commercial designs use topologies in between. A grid is a 2D array of processors. A torus is a grid, but with wrap-around edges. A hypercube is a system with 2^d processors, where d is the dimension. Each node has a binary label (with d digits). Two nodes have a link if their labels differ in a single digit. Thus, every node has links to d other nodes. A cube is an example of a hypercube of dimension 3. The figure on the slide shows a hyperucube of dimension 4. The table in the slide (for a 64-node network) shows that performance (in terms of bisection bandwidth) keeps improving from left to right. Similarly, hops on the network (not shown in the table) also keeps reducing as we go from left to right. The cost increases from left to right (in terms of total links and the ports per switch). To study the properties of networks, we will consider a general k-ary d-cube, which consists of a d-dimension array with k elements in each dimension. There are links between elements that differ in only one dimension by 1 (mod k). Many of the above topologies are special cases of the k-ary d-cube. For example, a 7 x 7 mesh is a 7-ary 2-cube. A 64 node hypercube is a 2-ary 6-cube. In general, a k-ary d-cube has N nodes = k^d. Slide 4 (lecture 24) lists some of the properties of such a network. Each node has a switch that is connected to 2d neighbors and each link has a width of w bits. A packet may have to travel along all d dimensions through an average of (k-1)/2 hops, resulting in the average routing distance of d(k-1)/2. The maximum routing distance (diameter) is d(k-1). Based on the above metrics, we pose the question: should we build high or low dimension networks? For fixed N, as d is increased, k decreases dramatically. This helps reduce the values for diameter and average routing distance. However, high dimensionality also leads to higher cost, as is seen by the equations for switch degree, number of links, and pins per node. (Switch degree is also referred to as radix.) If we keep some of these costs constant and try to increase d, the curves depict specific optimal points. For example, if we kept the number of pins constant, an increase in d would imply a decrease in w. Therefore, message latency decreases at first because of shorter routing distances, but then increases because the low w causes the message to be broken up into many smaller packets. The same curve is observed when we instead try to keep bisection bandwidth at a constant. The design of network topologies must take cost and latency considerations into account to determine the optimal values for k and d. There are many forms of routing on such networks. Deterministic routing follows a fixed path through the network based on the source and destination addresses. Adaptive routing can alter the route if there is heavy congestion or faults along the original route. The routing strategy has to be designed to be deadlock-free. A deadlock happens when there is a cycle of resource dependencies, i.e., a process holds on to a resource (A) and attempts to acquire another resource (B) (A is not relinquished until B is acquired) and another process holds on to B and attempts to acquire A. An example is shown on slide 8. Four different messages are travelling through the network. The yellow colors depict the input and output buffers that are held by message 3. Message 3 is trying to make a left turn and is trying to acquire the light-blue output port. Until that port is acquired, the yellow buffers are not relinquished. It can be seen that no message can make forward progress under this scenario. We can prove that dimension-order routing can have no deadlocks. In dimension-order routing, we first route along the x-dimension and then along the y-dimension. Slide 9 shows a specific numbering of edges in a 2-d mesh. Any path that follows dimension-order routing will travel on edges that are increasing in number. This proves that it is impossible to construct a cycle of dependencies. If the mesh was replaced by a torus, the above argument would not hold and it would be possible to construct a deadlock situation. Slide 11 shows one strategy for breaking deadlocks. Cycles can be either clockwise or anti-clockwise. Each of these cycles is composed of four turning messages. If we were to disallow some of these turns, the cycle can be avoided. For example, in dimension-order routing, we disallow turns from the vertical direction to the horizontal direction. The figures on slide 11 show 4 different ways in which two turns can be suppressed. The last example does not avoid deadlock. Even though we have prevented the West-ward turn in the first loop, the message can turn towards the East, then follow the path in the second loop until it eventually turns West and generates a cycle. The first example disallows any turns to the West. Therefore, a message must travel as far West as it needs to before it starts making any turns. Similarly, the second example prevents any turns from the Northward direction and the third example prevents any turns towards a negative direction. With these rules, we can avoid deadlocks even with an adaptive routing mechanism. Messages are typically partitioned into smaller packets. Each packet can follow an independent path to its destination. This prevents a single large message from consuming resources along a single path. Each packet requires headers so that the message can be re-constructed at the destination. In on-chip networks, messages are already very short: we are typically sending no more than a cache line at a time. Hence, in almost every case, a "message" is equivalent to a "packet". However, a packet must be broken into multiple flits depending on the link width. If the link between routers is only 64 wires wide, a 128-byte cache line (packet) will have to be partitioned into 16 flits of 64 bits each. Flits do not contain any header information. The packet's header is included in the first flit (referred to as the head flit) and there is a demarcation in the last tail flit to indicate that this is the last flit. Hence, flits much follow one behind the other so that the receiver can easily re-construct the packet by simply appending the flits. This also means that once a packet starts to flow across a link, another packet cannot cut in, else the flits of the two packets will be confused (more on such optimizations to follow). We have already looked at routing mechanisms (adaptive, dimension-order). Next, we'll look at flow control mechanisms (mechanisms to allocate resources to packets). The first and simplest form is bufferless flow control, where there are no buffers at intermediate routers. If two incoming packets (on different input ports), both want to go on the same output port, there is a conflict and one of the packets is simply dropped. A NACK is sent back to the sender and it must re-try sending the packet. This clearly leads to poor performance and buffering overheads at the sender and is usually not employed. The next flow control mechanism is circuit-switching. This also does not employ buffers for data packets in intermediate routers. Before the data transmission, a probe message travels from the sender to receiver, reserving all the links along the way. The probe then travels back and lets the sender know that the packet can now be transmitted with no conflicts at all. Since there can be no conflicts along the way, there is no need for buffers at intermediate routers. The tail flit finally frees up the links as it goes through. This mechanism does entail a high initial cost to set up the connection and this overhead can be better amortized if we are transmitting large packets. This may be a compelling mechanism in modern on-chip networks where buffers consume lots of power. They may be especially useful for bulk transfers at the end of a transaction or when prefetching many blocks at a time. Note that the probe message may have to deal with conflicts and will have to be buffered in such situations -- so some limited buffering will help even in cct-switching. The next few forms of flow control employ buffers at routers. First, we'll look at packet-buffer flow control, where resources (channels and buffers) are allocated on a packet granularity. Store-and-forward allocates a packet's worth of space at the next router before transferring the entire packet over. Only after receiving the entire packet does the head flit move on to the next channel. This is shown in the time-space diagram on slide 4 (lecture 25): the five flits go across channel 0 in the first five cycles and then across channel 1 in the next five cycles and so on. Cut-through allows the head flit (and other flits) to proceed to the next channel without waiting for the entire packet to arrive. This is clearly faster, but consumes more resources across the network. In store-and-forward, at a time, a packet only holds on to two packet-sized buffers (on either end of the current channel). In cut-through, a packet can hold on to many packet-sized buffers if the packet is spread across the network. Hence, the buffer utilization of cut-through is quite inefficient. Wormhole routing attempts to address this inefficiency. It is exactly like cut-through, but allocates and de-allocates buffers on a per-flit basis. Now, a flit may be stuck in a router because there is no free buffer space at the next router. This does not happen in cut-through: if the head flit is able to move to the next router, there is guaranteed buffer space for the rest of the packet and the remaining flits will go across in subsequent cycles. But with wormhole routing, a packet may be stalled with its flits spread across multiple routers. This holds up traffic in multiple routers as we can't allow another packet to cut in (else, the flits will be confused). To finally deal with the above problem, we introduce virtual-channel flow control. We'll act as if there are multiple channels between two routers and as if we can have multiple packets simultaneously in transit between the routers. Obviously, there is only one *physical channel*, but we'll act as if there are (say) two *virtual channels*. So two packets believe that they are simultaneously in transit (on virtual channels 0 and 1), but only one of them will actually transfer a flit across in a cycle on the single physical channel. Each flit will be appended with a 0 or 1 so that the receiver can make the distinction between the two packets. The receiver has some control state and some buffers (for each virtual channel). The control state keeps track of the routing decision that was made for the head flit based on the headers it contained (for example, the headers indicated that the packet was heading in the North-West direction, the router used its adaptive routing mechanism to decide that the packet would be sent on the North output port and this is stored so that the remaining flits can be blindly sent to the North output port as well). When a router (A) sends a tail flit on VC-0, it knows that VC-0 will soon be free, so it assigns VC-0 to a different packet. Note that the next router (B) may not have forwarded the previous packet on just yet (that is, its tail flit may still be sitting in the buffer and the control state still indicates "North"). When router-B receives the head flit for the next packet on VC-0, it simply buffers it behind the tail flit of the previous packet. The head will eventually be handled later and will change the output port to something else based on its headers. Virtual channel flow control allows multiple packets to share a physical link: if one packet is stalled, the other packet can make forward progress: this was not possible for any of the previously discussed flow control mechanisms. Before a head flit can be forwarded to the next router, the current router must make sure there is a free virtual channel, it must make sure there is at least one free buffer at the other router, and it must wait for the physical channel to be idle. The body and tail flits need not allocate a virtual channel: they can hop across as long as the physical channel is not in use and there is a free buffer at the other end. To determine if buffers are available, we need mechanisms to estimate buffer count at neighboring routers. Credit-based mechanisms send short messages to their previous routers every time a flit is forwarded and frees up buffer space. Alternatively, the router can inform its previous router if the buffers are close to being full. The latter entials less overhead but leads to a little more under-utilization of buffers. Another major advantage of virtual channels is that it can help in deadlock avoidance. Note that deadlocks can be avoided if packets always acquire resources in numerically increasing order. In our previous examples, we associated a number with each link (and the corresponding input buffer) and showed that dimension-order routing always sends messages on higher-numbered links. With virtual channels, we can now associate multiple (higher numbers) with each link. We can therefore attempt adaptive routing using virtual channels numbered less than 100 (see example on slide 9) and if its not possible to go in some direction with a higher-numbered VC (smaller than 100), we simply pick the VC numbered between 100-200. Again, we use adaptive routing and use a VC greater than 200 if necessary. Once we run out of VCs, we have to resort to dimension-order routing. Thus, having many VC names makes it possible to acquire resources in numerically increasing order even with adaptive routing (most of the time). We have seen that a router undertakes a number of non-trivial steps. It has to first examine the header flit and invoke the routing mechanism to determine the output port (the RC stage). It has to then compete for virtual channels with other packets that are also using the same output port (this is the VA stage -- virtual channel allocation). Once it acquires a VC, it must compete for the physical link (this is the switch allocation SA stage) and finally, the flit goes through the crossbar and travels on the physical link (the switch traversal ST stage). Thus, every router traversal can take four cycles and there are some proposals for speculative routers that can reduce this delay. Some recent data from Intel shows that an on-chip network can account for 20-36% of total chip power and a large fraction of this network power is dissipated within buffers and crossbar (and a non-trivial component within the links).