2004年6月 # A Novel Divider Based on Dual-Bit Algorithm\* # Li Xia, Sun Hui and Zhang Qianling (State Key Laboratory of ASIC & System, Fudan University, Shanghai 200433, China) **Abstract:** A novel divider based on dual bit algorithm and its VLSI implementation are presented. Compared with the divider of MIPS microprocessor, it decreases the average executing cycles by 52.5% while its maximum delay is almost the same and its transistor count increases by 60%. Furthermore, the simulation result indicates that the power consumption decreases to 11.3% with the same processing ability. Key words: dual-bit algorithm; division; VLSI implementation; power consumption EEACC: 2570D; 1265B # 1 Introduction The divider is one of the most important modules in microprocessor<sup>[1]</sup>. Originally, there was no hardware divider in the microprocessor and all the divisions were done by software. The efficiency of software division is very low, which can not meet the requirement of larger and larger calculation. Hence the hardware divider appears<sup>[1,2]</sup>. For hardware divider, usually there are four algorithms, i. e. non-restoring algorithm, restoring algorithm, SRT algorithm, and Newton Raphson algorithm [1,3]. The hardware implementation for Newton Raphson algorithm needs a lookup table, which is done by a ROM, and is unsuitable for microprocessor design. The restoring algorithm and SRT algorithm are relatively complicated, so the non-restoring algorithm is widely used in microprocessor for its simple principle and easy to be implemented by hardware, which is adopted in MIPS microprocessor [4,5]. But the non-restoring division is not efficient though and the execution cycles may be up to 32 for one single division in MIPS microprocessor<sup>[4]</sup>. A novel division named dual-bit algorithm and its implementation are presented. Compared with the divider of MIPS microprocessor, it decreases the average executing cycles by 52.5% while its maximum delay is almost the same and its transistor count increases a little. The theoretical analysis indicates that the power consumption can be decreased dramatically, and the simulation result shows the power consumption decreases to 11.3% with the same operation ability. For the transistor count, it is not very important issue with the decrease of the device characteristic dimension, the modification is very attractive. The dual-bit divider is very suitable for super-performance, low-power application. <sup>\*</sup> Project supported by National High Technology Research and Development Program of China (No. 2002AA1Z1060) Li Xia male, was born in 1978, PhD candidate. His research interests include low-power IC design. Email: xiali@ fudan. edu. cn Sun Hui female, was born in 1978, master candidate. Her research interests include high performance memory design. 646 半 导 体 学 报 25 卷 ### 2 Division in MIPS The division operations in MIPS are implemented using the simple non-restoring division algorithm<sup>[4]</sup>. The algorithm can be described in the flow chart as shown in Fig. 1<sup>[4,6,7]</sup>. Fig. 1 Flow chart for division in MIPS In the flow chart, the variable "i" stands for the current iteration cycle, and the variable "dividend", "divisor", and "quo" represent the dividend, divisor, and quotient of the division respectively. The variable N represents the cycle count that is necessary for completing the division. In normal case, the N represents the active bits of the dividend (the active bits which are recorded as " $C_{\rm ab}$ " mean the left bits when all the most significant zeroes are removed, e. g., for the binary data "00100101", the active bits is 6). While in MIPS microprocessor, if the dividend is zero extended on the upper most 8, 16, or 24bits, the N will be 24, 16, or 8; else the N will be 32. The N for division in MIPS can be described as expression (1). $$N_{\text{mips}} = \begin{cases} 32, & C_{ab} > 24 \\ 24, & 16 < C_{ab} \le 24 \\ 16, & 8 < C_{ab} \le 16 \end{cases}$$ $$8, & C_{ab} \le 8$$ $$(1)$$ $C_{\rm ab}$ represents the active bits of the dividend. The corresponding hardware architecture is shown in Fig. 2<sup>[7]</sup>. The counter gives the control signals for shifting the data in the "tmp" and "quo" registers. In order to decrease the hardware, the dividend is latched to the "quo" register at the first cycle, and the quotient is moved into the "quo" bit by bit when the following steps are going on. When the division is completed, the quotient is in the "quo" and the remainder in the "tmp". Fig. 2 Architecture of divider in MIPS # 3 Division based on dual bit algorithm According to the Fig. 1, only one single bit of the dividend is moved in for processing each cycle. Therefore, the calculating will take 32 cycles if the active bits of dividend are 32. It is obvious that the non-restoring algorithm is not very efficient. To improve this situation, a novel division algorithm named dual-bit algorithm is presented. The difference between the non-restoring algorithm and dual-bit algorithm is that two bits are moved in each cycle, so the efficiency increases dramatically. In Fig. 3, two bits are moved into the variable "tmp" each cycle, so two quotient bits should be generated each cycle. As the two quotient bits may be 00, 01, 10, or 11, three subtract operations have to be done, which are one divisor subtraction from the variable "tmp", double divisor from "tmp", and thrice divisor from "tmp". The corresponding results are the variable "tmp1", "tmp2", and "tmp3". The two quotient bits can be got according to the "tmp1", "tmp2", and "tmp3", then the "tmp1", "tmp2", or "tmp3" should be moved into the "tmp" which will act as the new subtraction operand for the next cycle together with the two bits from the dividend. Since two bits of the dividend are moved in each cycle, it is named as dual-bit algorithm. Fig. 3 Flow chart for division based on dual bit algorithm The variable N is the necessary iteration cycle count. For the case that two bits of the dividend is moved in each cycle, the N can be calculated by $[(C_{ab}+1)/2]$ . In Fig. 3, the "divisor $\times$ 2" can be got by shifting left one bit for the divisor, so there is almost no more operation. While the "divisor $\times$ 3" should be done by an add operation, therefore one more cycle is needed to get it. So the necessary cycle for dual-bit algorithm, $N_{dual-bit}$ , can be expressed as formula (2). $$N_{\text{dual-bit}} = [(C_{\text{ab}} + 1)/2] + 1$$ (2) It is considered that the dividend with different active bits has the same probability, and the average executing cycles of division in MIPS and dual-bit division can be calculated according to formulae (1) and (2), which are listed in Table 1. From Table 1, it is obvious that the du- al-bit division decreases the execution cycles by 52.5% compared with the division in MIPS. Table 1 Executing cycles of division in MIPS and dual-bit division | $C_{ m ab}$ | MIPS | Dual-bit | |----------------------|------|----------| | 1 | 8 | 2 | | 2 | 8 | 2 | | 3 | 8 | 3 | | 4 | 8 | 3 | | 5 | 8 | 4 | | 6 | 8 | 4 | | 7 | 8 | 5 | | 8 | 8 | 5 | | 9 | 16 | 6 | | 10 | 16 | 6 | | 11 | 16 | 7 | | 12 | 16 | 7 | | 13 | 16 | 8 | | 14 | 16 | 8 | | 15 | 16 | 9 | | 16 | 16 | 9 | | 17 | 24 | 10 | | 18 | 24 | 10 | | 19 | 24 | 11 | | 20 | 24 | 11 | | 21 | 24 | 12 | | 22 | 24 | 12 | | 23 | 24 | 13 | | 24 | 24 | 13 | | 25 | 32 | 14 | | 26 | 32 | 14 | | 27 | 32 | 15 | | 28 | 32 | 15 | | 29 | 32 | 16 | | 30 | 32 | 16 | | 31 | 32 | 17 | | 32 | 32 | 17 | | verage execution cy- | 20 | 9.5 | # 4 VLSI implementation and power analysis The implementation architecture of dual-bit division is shown in Fig. 4. The counter gives the control signals and the "quo" and "tmp" registers keep the quotient and temporary result. The "divisor3" register stores the thrice-divisor got at the first cycle, so it will not increase the maximum delay. The three adders named as "adder1", "adder2", and "adder3" process the three subtract operations. The least significant 30bits of the multiplexer output and the highest 2 bits of the "quo" register is stored to the "tmp" register each cycle. Fig. 4 Implementation architecture for dual-bit divider The operation of the three adders is parallel, so the delay is the same as that of one single adder. The data to be written to the "tmp" and the "quo" registers is a simple combinational logic output of the result of the adders, which has little timing delay. Therefore the critical path delay of the dual bit divider should be almost the same as the divider in MIPS microprocessor. For the case that average execution cycle decreases by 52.5% according to the analysis in section 3, the frequency of the divider can be decreased to 47.5% with the same processing ability. As we know, the critical path delay of dual-bit divider is almost the same as the divider in MIPS, so the frequency decreasing to 47.5% means that the maximum delay can increase to 2.1 times of its original maximum delay. According to formula (3)<sup>[8]</sup>, the propagation delay is almost inverse proportion to the supply voltage, so the voltage can also decrease to 47.5% of the original voltage. $$t_{\rm p} \approx \frac{C_{\rm L}}{2V_{\rm DD}} (\frac{1}{k_{\rm p}} + \frac{1}{k_{\rm n}}) \tag{3}$$ $$P = \alpha c v^2 f \tag{4}$$ In formula (4)<sup>[9]</sup>, P is the average power consumption, $\alpha$ is the average switching probability of the nodes, c is the sum of all capacitance, v is the supply voltage and f is the operating frequency. Supposing $\alpha_{\text{dual-bit}}$ is $k_1\alpha$ , and $c_{\text{dual-bit}}$ is $k_2c$ . For $v_{\text{dual-bit}}$ is 0.475v and $v_{\text{dual-bit}}$ is 0.475v, the average power of dual-bit divider, $v_{\text{dual-bit}}$ , can be expressed as (5). $$P_{\text{dual-bit}} = \alpha_{\text{dual-bit}} c_{\text{dual-bit}} v_{\text{dual-bit}}^2 f_{\text{dual-bit}}$$ $$= 0. 107 k_1 k_2 \alpha c v^2 f$$ (5) The $k_1k_2$ depends on the modification rate of the av- erage switching probability and the sum of all capacitance. For the case that capacitance and average switching probability does not change much, the $k_1k_2$ should be 1 or so, as a result, the average power, $P_{\rm dual-bit}$ , must decrease dramatically according to formula(5). # 5 Simulation results and analysis #### 5.1 Comparison of speed and area The divider in MIPS microprocessor and the dual-bit divider are implemented in 0.18 µm CMOS process. They are described in Verilog HDL and synthesized by Synopsys Design Compiler. The two dividers are synthesized under the same condition, that is, "TYPICAL" operating condition, "reference \_ area \_ 20000" wire-load, 1.0ns clock constraint, and '0' max-area. To get a more accurate result, the gate-level netlist is transformed to transistor-level by Cadence ICFB tool, then the transistor-level netlist is simulated by Avanti Hspice. The simulation results are listed in Table 2. From Table 2, it is obvious that the average execution cycle decreases by 52.5%, and the maximum delay is almost the same. Only the transistor count increases by 60% or so. Table 2 Performance comparison between divider in MIPS and dual bit divider | Parameter | Divider<br>in MIPS | Dual-bit<br>divider | Percent* | |-------------------------|--------------------|---------------------|----------| | Average execution cycle | 20 | 9.5 | - 52.5% | | Maximum delay/ns | 2. 80 | 2. 88 | + 2.8% | | Transistor count | 15572 | 24870 | + 59.7% | <sup>\*</sup> Percent means the modification rate for dualbit divider compared with divider in MIPS. ## 5. 2 Power simulation result To get a universal result, we choose the following test vector, the divisor is 0xffffffff and the dividend is expressed as (6). Dividend = $$0x 00000001 \ll i$$ , $i = 0, 1, 2, 3, ..., 31$ In formula (6), when i varies, the dividend is got by logic shifting left i-bit of 0x00000001. So the test set contains 32 test vectors, and the active bits of the dividend varies from 1bit to 32bit. The divider in MIPS is running at 2.8ns clock cycle with the supply voltage 1.8V. Considering the characteristic of the technology library<sup>[10]</sup> and the calculation ability of the divider, the dual-bit divider is working at 5.8ns clock with the supply voltage 0.9V. Table 3 shows the Hspice simulation result for the two dividers as completing all the test vectors. Table 3 Power simulation result for divider in MIPS and dual-bit divider as running all test vectors | Parameter | Divider<br>in MIPS | Dual-bit<br>divider | Percent | |---------------------|--------------------|---------------------|---------| | Clock cycle/ ns | 2. 8 | 5.8 | / | | Supply voltage/V | 1.8 | 0.9 | / | | Completing time/ ns | 1983. 8 | 1957. 5 | / | | Average power/mW | 4. 52 | 0. 51 | - 88.7% | The simulation indicates that the average power consumption of dual-bit divider decreases to 11.3% of the divider in MIPS with the same calculation ability, which is consistent with the theoretical analysis. # 6 Conclusion A novel divider based on dual-bit algorithm is presented. The divider based on this algorithm decreases the average executing cycle by 52.5% compared with the divider in MIPS microprocessor. The dual-bit divider is also implemented in 0.18µm CMOS process. The implementation result shows that the maximum delay is almost the same and the transistor count increases by 60%. The sim- ulation indicates that with the same operation ability, the power consumption decreases to 11.3% compared to the divider in MIPS. Consequently, the dual-bit divider is very suitable for super-performance, low-power application. #### References - [ 1 ] John L H, David A P. Computer organization & design: The hardware/ software interface. China Machine Press, 1999: 265 - [2] Ding Baoyan, Zhang Qianling. Design of constant divider and its BIST implement. Chinese Journal of Semiconductors, 2000, 21(5): 491(in Chinese)[丁宝延,章倩苓.常数除法器的设计及其 BIST 实现. 半导体学报, 2000, 21(5): 491] - [3] Li Yamin. Computer organization and architecture. Beijing: Tsinghua University Press, 2000: 72(in Chinese)[李亚民. 计算机组成与系统结构. 北京: 清华大学出版社, 2000: 72] - [4] MIPS Technologies Inc. MIPS32 4KE processor core family software user's manual. No. MD00101, 2002; 29 - [5] MIPS Technologies Inc. MIPS32 architecture for programmers, Volume II: The MIPS32 instruction set. 2001: 109 - [6] Hao Jianxin, Xie Jianbin. The realization of precedent cellular arrays divider by means of FPGA. Journal of National University of Defense Technology, 1997, 19(1):66(in Chinese)[郝建新,谢剑斌.用FP-GA 实现先行进位单元阵列除法器.国防科技大学学报, 1997, 19(1):66] - [7] Xue Shengjun, Wang Changyin. The principle of computer organization. Wuhan: Huazhong University of Science and Technology Press, 2001: 48 - [8] Rabaey J.M. Digital integrated circuits: a design perspective. Prentice-Hall International Inc, 1999: 129 - [ 9 ] Gary Y. Practical low power digital VLSI design. Kluwer Academic Publishers, 1998: 4 - [10] VeriSilicon Microelectronics (Shanghai) Co, Ltd. VeriSilicon SMIC 0. 18µm high density standard cell library databook. Ver 1. 0, 2002; 5 650 半 导 体 学 报 25 卷 # 基于双比特算法的新型除法器\* #### 李 侠 孙 慧 章倩苓 (复旦大学专用集成电路与系统国家重点实验室, 上海 200433) 摘要:提出了基于双比特算法的新型除法器及其 VLSI 实现结构.与 MIPS 微处理器中的除法器相比,它的平均执行周期减少了 52.5%,而其最大延时几乎不变,仅晶体管数目增加了 60%.仿真结果表明,在相同的运算能力下,其功耗仅为 MIPS 中除法器的 11.3%. 关键词: 双比特算法; 除法运算; VLSI 实现; 功耗 **EEACC:** 2570D; 1265B 中图分类号: TN402 文献标识码: A 文章编号: 0253-4177(2004)06-0645-06 <sup>\*</sup> 国家高技术研究发展计划资助项目(编号: 2002AA1Z1060) 李 侠 男,1978年出生,博士研究生,研究方向为低功耗集成电路设计. 孙 慧 女,1978年出生,硕士研究生,研究方向为高性能存储器设计. 章倩苓 女, 1936 年出生, 博士生导师, 主要研究方向为 VLSI 系统集成等.