# An Improved High Fan-in Domino Circuit for High Performance Microprocessors\* Feng Chaochao<sup>†</sup>, Chen Xun, Yi Xiaofei, and Zhang Minxuan (National Laboratory for Paralleling and Distributed Processing, School of Computer, National University of Defense Technology, Changsha 410073, China) Abstract: An improved high fan-in domino circuit is proposed. The nMOS pull-down network of the circuit is divided into several blocks to reduce the capacitance of the dynamic node and each block only needs a small keeper transistor to maintain the noise margin. Because we omit the footer transistor, the circuit has better performance than the standard domino circuit. A 64-input OR-gate implemented with the structure is simulated using HSPICE under typical conditions of $0.13\mu m$ CMOS technology. The average delay of the circuit is 63.9ps, the average power dissipation is $32.4\mu W$ , and the area is $115\mu m^2$ . Compared to compound domino logic, the proposed circuit can reduce delay and power dissipation by 55% and 38%, respectively. Key words: high fan-in; domino logic; high performance; keeper transistor EEACC: 1265A; 2570D ### 1 Introduction High fan-in AND/OR gates are widely used in the critical path of high performance microprocessor, such as Cache tag comparison circuits and zero detection circuits<sup>[1]</sup>. As the bit-widths in the microprocessor increase, high fan-in AND/OR gates may deteriorate the speed. So it is important to optimize the circuit to meet high performance requirements. There are several different kinds of circuit structures to implement a high fan-in AND/OR gate. In a static CMOS circuit, the nMOS/pMOS network always contains serial transistors. Because of the large source and drain capacitance of serial transistors, the static CMOS gate usually contains less than 4 serial transistors. So we need to adopt a multi-level tree structure to implement a high fan-in AND/OR gate with a static CMOS circuit. The disadvantage of this kind of circuit is that as the number of inputs increases, implementation requires more levels, which will increase delay and area. A pseudo-nMOS circuit is suitable for implementing a high fan-in OR/NOR gate. The circuit can not only reduce the number of transistors but also improve performance because it can eliminate the pMOS serial transistors. However, the circuit has high power dissipation due to its high bias current. Dynamic circuits, which are faster than static CMOS circuits, are widely used to implement high fan-in circuits [2]. A dynamic circuit has low latency because it does not require a pMOS transistor stack, unlike the static CMOS circuit. Figure 1 (a) shows an n-input dynamic domino OR-gate with a footer transistor and Figure 1 (b) shows a footless dynamic dom- Fig. 1 Traditional domino logic for *n*-input OR-gate <sup>\*</sup> Project supported by the National High-Tech Research and Development Program of China (No. 2005AA110020) <sup>†</sup> Corresponding author. Email: fengchaochao@nudt.edu.cn Fig. 2 Domino structure in Ref. [6] ino OR-gate. In order to enhance the noise margin of the gate, a keeper transistor is added to the output to compensate the loss of the charge in the pull-down leakage path. As the number of inputs increases, the capacitance of the dynamic node also increases, deteriorating the speed of the circuit and worsening the reliability. A wide keeper transistor not only maintains the large noise margin of the dynamic circuit, but also increases competition between the keeper transistor and the nMOS pull-down network, increasing power dissipation and reducing speed. A high-speed domino<sup>[3]</sup>, skew-tolerant high-speed domino [4], conditional keeper<sup>[5]</sup>, and four-phase, non-full swing keeper<sup>[2]</sup> have been proposed to make a tradeoff among power dissipation, performance, and noise margin in high fan-in domino gates. Different from these complicated keeper transistor technologies, we divide the high fan-in domino logic into several blocks, simplifying the keeper transistor design and maintaining a reasonable noise margin. In this paper, we propose an improved high fan-in domino circuit. In 0. $13\mu m$ CMOS technology, the delay of the circuit is 63. 9ps and the area is $115\mu\text{m}^2$ , which is faster and smaller than static CMOS circuits and traditional domino logic. ## 2 An improved high fan-in domino circuit Reference [6] proposed an improved domino logic, shown in Fig. 2. Compared to traditional domino logic, the circuit uses an nMOS transistor MN1 to precharge the dynamic node A. The footer transistor is absorbed into the clock tree and the discharge process is quickened by saving one serial transistor. Based on this circuit structure, we propose an improved high fan-in domino circuit. For high fan-in domino logic with more than 16 inputs, we divide the nMOS pull-down network into several blocks, which can effectively reduce the capacitance of the dynamic node. Figure 3 shows the proposed circuit structure. For a 64-input OR-gate, the nMOS pull-down network is divided into 4 blocks. In each block, as in the structure in Fig. 2, the CLK input is directly connected to the source of the nMOS pull-down network equivalent to omit the footer transistor in standard domino logic. Four dynamic nodes are connected to the gate of 4 parallel pMOS transistors, respectively. The output is connected to the gates of 4 pMOS keeper transistors. The circuit works as follows. In the precharge phase (CLK = 1), dynamic nodes D0, D1, D2, and D3 are precharged to $V_{\rm DD}-V_{\rm Tn}$ by nMOS transistors MN1, MN2, MN3, and MN4 because the nMOS transistor produces "weak ones". At the same time, the nMOS transistor MN0 is turned on to pull down the output to 0, which turns on 4 pMOS keeper transistors to pull up 4 dynamic nodes to $V_{\rm DD}$ . In the evaluation phase (CLK = 0), the nMOS transistor MN0 is turned off. As long as one input of the 4 blocks is 1, the dynamic node (D0, D1, D2 or D3) is discharged to 0, which can turn on one of 4 parallel pull-up transistors (MP1, MP2, MP3 and MP4) to pull up the output to 1. This circuit has three advantages. First, the nMOS pull-down network is divided into several small blocks, which reduces the capacitance of the dynamic node efficiently and simplifies the keeper transistor design. In order to enhance the noise margin of the wide dynamic OR gate, a large keeper transistor is often used. However, it deteriorates the speed of the circuit. In our proposed circuit, because of the small capacitance of the dynamic node of each block, we can use a small keeper transistor to maintain the noise Fig. 3 Improved high fan-in domino circuit Fig. 4 Pulse generator margin with very low performance degradation. In addition, the pMOS keeper transistor can compensate the disadvantage of using the nMOS transistor to precharge. Second, because we omit the footer transistor, the circuit has a smaller delay than traditional domino logic with the footer transistor. During the precharge stage, the nMOS pull down network is turned off by the footer transistor, so the footless domino logic usually cannot be used as the first level of the domino logic. But the proposed circuit can be used as the first level because the CLK input is connected to the source of the pull-down network and an nMOS transistor is used to precharge. During the precharge phase (CLK = 1), even if one of the inputs transits from 0 to 1, the dynamic node cannot be pulled down. Moreover, the circuit eliminates the inverter between the two standard domino logics, which also ameliorates the performance. Third, the circuit is better than the compound domino logic<sup>[7]</sup>, which uses a complex static CMOS gate to combine the outputs of multiple dynamic gates. Because the static CMOS gate in the compound domino is often not more than 4 fan-ins, the number of blocks is also not more than 4. As the number of inputs increases, each block contains more transistors. However, different from the compound domino logic, the second level of the circuit uses only one nMOS transistor instead of nMOS serial transistors of a complex static CMOS gate, so the circuit has a smaller delay from high to low and can contain more blocks than the compound domino logic. Furthermore, each block has fewer transistors. In addition, the circuit avoids the back-gate coupling effect[7], which exists in dynamic gate driving multiple-input static CMOS gates such as the compound domino logic. In the dynamic circuit, the half of clock cycle is used to precharge. During the precharge time, the logic in the gate cannot be utilized. In order to hide the precharge time, we design a circuit that generates a pulse from the global clock to precharge the circuit. The circuit and the precharge waveform are shown in Fig. 5 Layout of the improved 64-input domino OR-gate Fig. 4. The width of the pulse depends on the number of inverters in the circuit, but the number of inverters must be odd and must ensure that the pulse can precharge the dynamic node to $V_{\rm DD}$ . The circuit generates a high-level pulse. To generate a low-level pulse, it only needs to use a NAND gate instead of the AND gate in the circuit. Using the pulse generator, the precharge of the circuit can be done with other functions at the same time equivalent to hide the precharge time. For example, if the register is followed by the high fan-in circuit, the width of the pulse can be the delay of the register. The precharge of the high fan-in circuit could coincide with the data propagation of the register. When the data of the register is ready, the precharge of the circuit is complete and the high fan-in circuit can be evaluated immediately, which can improve the performance sufficiently. ### 3 Implementation and results We implement the improved high fan-in domino circuit with 0. $13\mu m$ CMOS technology. The layout of the 64-input OR-gate is shown in Fig. 5. The 64 nMOS transistors in the pull-down network are placed on the left of the layout, while the precharge transistors, foot transistors, and keeper transistors are on the right of the layout, which is beneficial for placement and routing. In addition, we use only two metal layers in the layout. In order to evaluate the performance of the improved circuit, we also implement a 64-input static CMOS OR-gate, a 64-input standard domino OR-gate, and a 64-input compound domino OR-gate with 0. $13\mu m$ CMOS technology. The 64-input static CMOS OR-gate is based on the 0. $13\mu m$ standard cell library, which has three logic levels. The first level of the circuit contains 16 4-input NOR gates followed by 4 4-input AND gates and the last level is a 4-input NAND gate. The 64-input standard domino OR-gate has two levels. The first level of the circuit contains 4 16-input domino OR-gates with the footer transistor. The sec- Table 1 Simulation results of four structures | | Delay/ps | Power/µW | PDP/fJ | Number of | |------------------|----------|----------|--------|-------------| | | | | | transistors | | Static CMOS | 256.9 | 24.4 | 6.3 | 176 | | Standard domino | 162.3 | 62.3 | 10.1 | 92 | | Compound domino | 141.3 | 52.6 | 7.4 | 84 | | Proposed circuit | 63.9 | 32.4 | 2.1 | 77 | ond level is a 4-input footless domino OR-gate. The 64-input compound domino OR-gate is implemented with 4 16-input dynamic NOR gates with the footer transistor followed by a 4-input static CMOS NAND gate. We simulate the layout of the four structures using HSPICE under the typical conditions ( $V_{\rm DD}$ = 1.2V, Temperature = 25°C) of 0.13 $\mu$ m CMOS technology. The simulation results are shown in Table 1, which lists the average delay, average power dissipation, product of power and delay (PDP), and the number of transistors. Compared to the static CMOS circuit, the dynamic domino circuit is better due to its speed and area, although its power dissipation is a bit larger. The compound domino circuit has smaller delay than the standard domino circuit because it omits the inverter of the first level. The proposed circuit can reduce the delay and power by 55% and 38%, respectively, compared to the compound domino circuit, and has the smallest number of transistors. The simu- Fig. 6 Simulation waveform of the layout lation waveform of the proposed circuit is shown in Fig. 6. The transition delay from low to high is 103. 7ps and the transition delay from high to low is 24. 1ps, so the average delay of the circuit is 63. 9ps. ### 4 Summary and conclusion A high fan-in circuit, which is used in control logic of high performance microprocessors, is often implemented with domino logic. In this paper, we propose an improved high fan-in domino circuit that divides the nMOS pull-down network into several blocks, so each block does not require large keeper transistors. The delay of the circuit is smaller than traditional domino logic because the footer transistor is omitted. The power dissipation is smaller than other domino circuits. Additionally, the number of transistors is also fewer than the static CMOS circuit and other domino circuits. We apply the circuit to the float-unit of a 64bit high performance microprocessor and achieve better results. #### References - [1] Clark L T. Taylor G F. High fan-in circuit design. IEEE J Solid-State Circuits. 1996, 31(1):91 - [2] Yang G, Wang Z, Kang S M. Low power and high performance circuit techniques for high fan-in dynamic gates. The 5th International Symposium on Quality Electronic Design, 2004;421 - [3] Anis M H, Allam M W, Elmasry M I. Energy-efficient noise-tolerant dynamic styles for scaled-down CMOS and MTCMOS technologies. IEEE Trans Very Large Scale Integration Systems, 2002, 10 (2):71 - [4] Jung S O, Yoo S M, Kim K W, et al. Skew-tolerant high-speed (STHS) domino logic. The 2001 IEEE International Symposium on Circuits and Systems, 2001:154 - [5] Alvandpour A, Krishnamurthy R, Soumyanath K, et al. A conditional keeper technique for sub-0. 13µm wide dynamic gates. The 2001 Symposium on VLSI Circuits, 2001;29 - [6] Jia S, Liu F, Ji L. Improved domino logic for high speed design. IEE Electron Lett, 2003, 39(8):644 - [7] Rabaey J M, Chandrakasan A, Nikolic B. Digital integrated circuits: a design perspective. 2nd ed. New Jersey: Prentice Hall, 2003 ## 高性能微处理器中一种改进的高扇入多米诺电路设计与实现\* 冯超超"陈 迅 衣晓飞 张民选 (国防科技大学计算机学院 并行与分布处理国防重点实验室,长沙 410073) 摘要:设计实现了一种改进的高扇入多米诺电路结构.该电路的 nMOS 下拉网络分为多个块,有效降低了动态节点的电容,同时每一块只需要一个小尺寸的保持管.由于省去了标准多米诺逻辑中的尾管,有效地提升了该电路的性能.在 $0.13\mu m$ 工艺下对该结构实现的一个 64 位或门进行模拟,延迟为 63.9ps,功耗为 $32.4\mu W$ ,面积为 $115\mu m^2$ .与组合多米诺逻辑相比,延迟和功耗分别降低了 55% 和 38%. 关键词: 高扇入; 多米诺逻辑; 高性能; 保持管 **EEACC:** 1265A; 2570D 中图分类号: TN402 文献标识码: A 文章编号: 0253-4177(2008)09-1740-05 <sup>\*</sup> 国家高技术研究发展计划资助项目(批准号:2005AA110020) <sup>†</sup> 通信作者.Email:fengchaochao@nudt.edu.cn 2008-01-16 收到,2008-06-02 定稿