1. Introduction
With the shrinking of transistor sizes more intellectual property (IP) cores are being integrated onto a single chip to implement more complicated system functions[1]. As the number of IP cores continue to scale in system-on-chip (SoC) the architecture of bus-based communication and crossbar interconnections is often the performance bottleneck due to the growing of bandwidth requirement and non-predictable wire delay[2]. Solution to address the communication demands of future multi-core system,NoC has become an emerging solution due to its considerable advantages such as reusability,scalability,and parallelism in communication infrastructure[3].
The routing policy is among the most important considerations in NoC design[4]. It has an important impact on some performance criteria such as the average latency and system throughput in the network[5]. A typical wormhole or virtual channel router for NoC is commonly comprised of input/output ports,buffers,routing logic and a crossbar switch connecting input ports to output ports. The use of buffers in the router can improve the bandwidth efficiency and prevent the packet being dropped. However,buffers in the router lead to a high hardware implementation overhead and increase design complexity. As mentioned in Reference [6],buffers in the router are responsible for 46% of router power consumption and 30% of router area. With buffers elimination,bufferless routing emerges as a potential solution for cost-efficient in NoC design[7].
The basic principle of bufferless routing for NoC is that all packets arriving at a router must immediately be forwarded to an adjacent router. As there are no buffers,output port contention in the router is mostly resolved by the deflection packet,which is called bufferless deflection routing[8]. For bufferless deflection routing,deflections occur more frequently at a high network load because more packets may contend the same output port. Each deflection sends a flit further from its destination,which causes extra cost for latency and power consumption. In addition,the bufferless deflection routing with a high deflection rate may severely degrade the performance due to the processing of a packet involved in livelock[9]. Thus,to reduce deflection is a key for improving the performance of bufferless deflection routing.
Two schemes are mostly discussed to address this issue for the bufferless deflection router. One is to reduce the deflections of packets by adding a few buffers,such as the approach mentioned in Reference [10]. However,this strategy weakens the primary advantage of the bufferless router in cost-efficiency. The other is to reduce deflections by using a priority strategy or control mechanism. Based on this idea,a location based priority and a distributed source-throttling congestion control mechanism were proposed in References [11, 12] respectively to reduce deflection and relieve congestion for a bufferless deflection router. The experiment results have shown that throughput can be improved by 12% compared with the baseline bufferless deflection router as mentioned in Reference [12]. To analyze the causes of deflections,three deflection models were constructed and a low-deflection bufferless router (LDBR) was proposed in Reference [13] which adopted a multi-channel network interface and control mechanism to address contention. However,the routers above yield performance improvement by a complex mechanism.
In this paper,we propose a load balancing bufferless deflection router called LBBDR for NoC. To explain the working mechanism,two typical bufferless deflection routers are introduced. The common deficiency of design complexity for those routers is analyzed,which disagrees with the primary advantages of the bufferless router. Based on the current research results,the proposed LBBDR employs the strategy of load balancing for flit initial routing combined with an improved priority policy to simplify the design of the deflection router. To evaluate the effectiveness of LBBDR,the performance of deflection routing and hardware overhead are assessed by a simulation platform using System C and Design Complier of\,Synopsys with the TSMC 28 nm HPM technology. The results illustrate its superiority under low-to-medium network load compared with the reported schemes.
The remainder of this paper is organized as follows. Two typical bufferless deflection routers are described to explain its working mechanism and the deficiency of current research results in Section 2. The architecture of the proposed LBBDR and its algorithm are presented in Section 3. A comparison of output port contention between the proposed LBBDR and the reported bufferless deflection routers are given to illustrate the superiority of the former in Section 4. The simulation results are presented in Section 5. The conclusions of this paper are drawn in Section 6.
2. Typical bufferless deflection router
The bufferless deflection routing is highly concerned as an alternative to the traditional buffered router in NoC for saving hardware cost. As shown in Figure 1,the baseline bufferless deflection router (BBDR) architecture is based on 2D mesh in Reference [14]. There are commonly five pairs of input/output ports in a router and each input port has only one register. It has a three-stage pipeline. The incoming packets from input ports are written into the registers and routing computation is applied in parallel to decide the output port by the specific algorithm. When two or more packets contend the same one output port,the priority scheme is used to arbitrate which packet occupies the desired port and then only one of the packets can be sent to this port. The others are deflected to any of the available outputs ports. The crossbar switch is turned on that connects the input port to the required output port Obviously,the priority scheme and port allocation mechanism of BBDR play an important role for its performance and lead to a high complexity as well. In order to simplify the design of the bufferless router,an improved bufferless deflection router called CHIPPER was presented in Reference [15],which used a novel permutation network instead of a traditional crossbar switch to accomplish output ports allocation. Meanwhile,an efficient flow control approach with packet reassembling was designed to improve performance and to achieve deadlock freedom. It is a novel implementation method for cheap bufferless deflection routing. However,as mentioned in Reference [13],deflections cause great performance loss in the bufferless network. Thus,to reduce contention and deflection is a major issue for the bufferless deflection router.
To reduce deflections of bufferless deflection routing,a LDBR in Reference [13] was presented using multi-channel network interface and a deflection routing based on the turn model. The architecture of LDBR is given in Figure 2. Its pipeline consists of an input stage and output stage and each pipeline stage takes one cycle to execute. The H-V transmitting-enable counter is the generator of tick_h and tick_v signals for direct connection and cross connection.
In the input stage,a multi-channel ejector including four switches is used for selecting the ejecting flits. Each switch works in two modes for direct and cross connection. The direct connection mode is set when an input packet should be sent to the output stage. The cross connection mode is set when an input packet arrives at its destination router and should be ejected,or a local packet needs to be injected. In the output stage,the permutation unit composed of two switches is used for output ports allocation. The selecting mechanism of the input stage and the routing method of the output stage can effectively reduce congestion. However,the processing logic of the input stage for LDBR leads to a high complexity. Moreover,XY routing for LDBR is the lack of a load balancing mechanism which may degrade the routing performance criteria.
3. The proposed bufferless deflection router
As shown in Figure 3,the proposed LBBDR architecture differs slightly from that of BBDR and LDBR for NoC. The flit injection from the local input port is allowed when one input register is unoccupied and one incoming channel is free at least. If the input flit from the routing port reaches its destination router,it can be ejected from the local output port by a multi-channel ejector. Otherwise,the proposed load balancing deflection routing algorithm is used to decide the output port. It is completed by a group demultiplexer and multiplexer
Some necessary definitions are given in order to explain the proposed load balancing deflection routing algorithm.
Definition 1. The (Xs,Ys),(Xc,Yc) and (Xd,Yd) stand for the coordinates of the flit routing for source,current and destination nodes in network respectively.
Definition 2. The R stands for the toggle identifier of load balancing deflection routing.
The function of the proposed load balancing deflection routing algorithm is to decide the routing output ports and arbitrate the output ports contention for input flits. It employs a balance toggle identifier in the source router to control the initial routing X or Y direction for flit. Each flit is routed according to XY or YX routing in the network afterwards. Once output ports contention arises,a nearer-first priority policy which means the flit is nearest to the destination router in the network with the highest priority. The detail of the load balancing deflection routing algorithm for port allocation and priority scheme is given in Algorithm 1.
Algorithm 1. Load balancing deflection routing algorithm
1: input: input flit f from L,E,W,S,N
2: output: output port allocation of L,E,W,S,N
3: if \,((Xs== Xc) and (Ys== Yc)) then // flit injection
4: if \,(R==0) then
5: if \,(Xc>Xd) then
6: E←f;
7: if\,(Xc<Xd) then
8: W←f;
9: R=1;
10: else
11: if \,(Yc<Yd) then
12: N←f;
13: if \,(Yc< Yd) then
14: S←f;
15: R=0;
16: end if
17: if \,((Xc== Xd) and (Yc== Yd)) then // flit ejection
18: L←f;
19: end if
20: if \,(Xc < Xd) then //flit routing in first Demux of W
21: E←f;
22: else if \,((Xc== Xd) and (Yc > Yd)) then
23: S←f;
24: else if \,((Xc== Xd) and (Yc < Yd)) then
25: N←f;
26: if \,(Xc > Xd) then //flit routing in first Demux of E
27: W←f;
28: else if \,((Xc== Xd) and (Yc > Yd)) then
29: S←f;
30: else if \,((Xc== Xd) and (Yc < Yd)) then
31: N←f;
32: if \,(Yc > Yd) then //flit routing in first Demux of N
33: S←f;
34: else if \,((Yc== Yd) and (Xc > Xd)) then
35: W←f;
36: else if \,((Yc== Yd) and (Xc < Xd)) then
37: E←f;
38: if \,(Yc < Yd) then //flit routing in first Demux of S
39: N←f;
40: else if \,((Yc== Yd) and (Xc > Xd)) then
41: W←f;
42: else if \,((Yc== Yd) and (Xc < Xd)) then
43: E←f;
44: end if
45: if\,(the number of f\,> 1) then // in Mux of E,W,S,N
46: f with minimum \textbar Xc -- Xd\textbar{} + \textbar Yc -- Yd\textbar\,get desired port
47: other flits are deflected to free output ports
48: else
49: E,W,S or N←f; //each output port only one flit
50: end if
As shown in Algorithm 1,the proposed load balancing deflection routing algorithm mainly consists of initial routing direction decision in the source router and the nearer-first priority policy for addressing output ports contention. It is assumed that the east and west build the expression of the X direction,and the south and north build the expression of the Y direction in the network. Firstly,the incoming flits from the local input port and routing input ports are written into input registers. For the flit injected from the local input port,its initial routing direction of X or Y in the network is decided by the identifier R,which is an extended scheme of our work[16]. If the value of R is equal to 0,the flit is routed from the X direction within the region of the source-destination pair. Otherwise,it will be routed from the Y direction. The R is toggled when the initial routing direction for each flit injected from the local input port is decided. Using this mechanism,the ports contention can be reduced compared with the traditional routing approach with a single initial direction,such as XY routing. For the incoming flits from routing ports,they are routed according to XY or YX routing along the initial routing direction X or Y in the network. Thus,the demultiplexer and multiplexer of LBBDR are responsible for flits routing at a specific direction and addressing the output ports contention respectively. The nearer-first policy means that the flit is nearer to the destination router with the higher priority. This flit is forwarded to the desired port and is routed to the destination router as early as possible with the advantage of reducing load in the network. Once the network load is reduced,the deflection of flits will be decreased as well. It helps to improve bufferless deflection routing performance.
4. Discussion and analysis
In order to demonstrate the superiority of LBBDR the comparisons between the typical XY routing algorithm and the proposed load balancing deflection routing are given in Figure 4 and Figure 5. We briefly discuss the difference of ports contention and how to address the ports contention for three routers in 2D mesh topology.
Figure 4 shows the routing paths of handling output ports contention for three routing schemes. It is assumed that three flits of f1,f2 and f3 are routed from source node (0,2),(1,3) and (1,2) to destination node (1,1),(1,0) and (2,2) in the network respectively. The output port contention arises in node (1,2) according to XY routing. As shown in Figure 4(b),the f1 can be routed from another path to avoid the port contending in node (1,2) due to the turn of the initial routing direction in LBBDR. Thus,the output port contention can be reduced in LBBDR.
Figure 5 shows the routing paths for two schemes of oldest-first priority policy and nearer-first priority policy to address output ports contention. Continuing with the discussion of the port contention in node (1,2),it is assumed that the flit f1 is deflected to node (1,3) according to the Oldest-First priority scheme. In node (1,3),the desired output port of the south is obtained by flit f1 due to the higher priority and is routed to destination node (1,1) via node (1,2) afterwards. While flit f2 in source node (1,3) is deflected to node (2,3) due to the lower priority compared with flit f1. Then it is routed to the destination node through the necessary five hops routing in the network. Thus,the total number of routing hops for flit f1 and f2 to the destination node is nine in the network.
For LBBDR,the flit f1 has a higher priority compared with the flit f2 according to the nearer-first priority policy for addressing ports contention in node (1,2). Therefore,the desired output port is obtained by flit f1 and is routed to destination node (1,1) directly. While flit f2 is deflected to node (0,2) and is routed to destination node (1,0) through the necessary four hops routing in the network. Thus,the total number of routing hops for flit f1 and f2 to the destination node is seven. The fewer necessary routing hops of addressing the output ports contention means a lower load in the network which is beneficial to reduce deflection.
Based on the analysis above the proposed load balancing deflection routing algorithm can significantly reduce output ports contention and network load in addressing output ports contention. It has a positive influence on performance metrics of bufferless deflection routing.
5. Experimental results
A simulator of NoC based cycle is developed using System C to evaluate the performance criteria of different routing schemes. For comparison,the BBDR in Reference [14] and LDBR in Reference [13] are implemented. Our discussion is also concentrated on 2D mesh due to its facility in simplicity of routing strategy and the network scalability. For all the routers,the data width is set to 64 bits. Each packet length is set to 5 flits. The simulation clock is selected as 500 MHz.
The widely used synthetic traffic profiles such as random traffic pattern and transpose traffic pattern are considered as a benchmark to evaluate the routing performance criteria. To demonstrate the feasibility of the proposed LBBDR,we study and compare the routing performance criteria of the flit deflection rate,average packet latency and throughput for bufferless deflection routing. Moreover,the hardware overhead of the layout area and power consumption is assessed with TSMC 28nm HPM technology.
In the random traffic pattern,each node sends data packets to another at random in the network. In the transpose traffic pattern,each node sends data packets to another node with the address of the reversed dimension index,such as from source node (i,j) to destination node (j,i) in the network. The flit deflection rate is an important performance for bufferless deflection routing which means the proportion of deflections in the total number of forwarding flits. In Figure 6,the flit deflection rate is plotted for three routing schemes as a function of the injection rate. As observed from the results,the flit deflection rate of LBBDR is reduced by 13% and 15% compared with others under two traffic patterns. The reduction of the deflection rate for bufferless deflection routing means the reducing of network load. It has a positive impact on the performance of the average packet latency.
The performance of average packet latency is illustrated in Figure 7. As observed from the results,the average packet latency of LBBDR is reduced by 10% and 11% compared with others under two traffic patterns. Note that the average packet latency of LBBDR is reduced up to 20% with the increasing of injection rate. This result illustrates that the LBBDR achieves the better performance,especially under the higher injection rate.
In this work,the throughput is used as the performance metric of bufferless deflection routing as well,which is defined as the number of flits received by all destination nodes at specific cycles under the injection rate. The specific cycles means the expected time for routing flit with the longest distance between source node and destination node according to the pipeline of the routing scheme in the network. Thus,if there is no deflection,the throughput will be equal to the injection rate. In Figure 8,throughput is plotted for three routing schemes as a function of the injection rate. As observed from the results,the throughput of LBBDR is improved by 8% and 6% compared with others under two traffic patterns.
In order to evaluate the hardware implementation overhead,the RTL (register transfer level) code of the network mentioned is realized with Verilog HDL. The parameters of routers and network such as data-width and network size have been mentioned in the previous section. The TSMC 28 nm HPM CMOS technology is used at the operating frequency of 500MHz and with a voltage supply of 0.9 V to assess the area cost and power consumption of different routers using Design Compiler 2011SP5 of Synopsys. Each network is synthesized to obtain netlist when the timing meets the requirement. The timing critical path for three routers is given out in Table 1. It can be seen that the BBDR has the shortest critical path of register to register due to there being more stages in the pipeline. The timing critical path of LBBDR is less 21% compared with LDBR for its simplicity of architecture.
![]() |
The same random seed for each routing scheme is provided to get the waveform for the precise power consumption. The power dissipation is calculated using PT-PX 2013 of Synopsys. The layout area and power consumption of three routing schemes with a different sized network are shown in Figure 9. As observed from the results,the area cost and power consumption of the network composed of LBBDR are lower,by 12% and 7% compared with others respectively.
In summary,the experimental results demonstrate that the mechanism of controlling the initial routing direction and the priority policy of nearer-first provide a guarantee to improve the performance of LBBDR. Moreover,the proposed LBBDR is cost-effective for NoC due to the simplicity of its architecture.
6. Conclusion
This paper presents a load balancing bufferless deflection router for NoC. In the proposed router,a balance toggle identifier is used in source router to control the flit initial routing direction for reducing the output ports contention. To reduce deflection further,a nearer-first priority scheme is proposed. The experimental results show that the proposed LBBDR can improve the performance metrics of bufferless deflection routing. The hardware overheads of layout area and power consumption are reduced due to the simplicity of its architecture as well.