# A New Architecture of Twiddle Factor Generator for Radix-2 1024-Point FFT\* ## Li Jincheng and Yang Huazhong (Department of Electronic Engineering, Tsinghua University, Beijing 100084, China) Abstract: A new ROM less twiddle factor generator for radix-2 1024-point FFT is proposed. It consists of several simple logic units and each of them synthesizes some data, which will be used to compose the twiddle factors. The power analysis with Synopsys Power Compiler shows that it consumes about 2mW with TSMC 0.25μm CMOS process at 50MHz. This twiddle factor generator is designed for the low power applications, especially for the mobile communications and other portable devices. Key words: FFT; twiddle factor; low power; ROMless; mobile communications EEACC: 1517; 6730 CLC number: T N 43 Document code: A Article ID: 0253-4177(2004) 04-0377-06 #### 1 Introduction As the development of the CMOS processes, system-on-chip (SOC) is becoming popular in the implementation of the mobile communication systems for the features of low-cost, low-power, and small physical size. Low power design is becoming more and more important to the mobile communications. Fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT) are the key building blocks for the mobile communications, especially for the orthogonal frequency division multiplexing (OFDM) transceiver systems[1]. Although the algorithms and the hardware realization techniques of FFT are well developed, they do not perfectly meet with the low power requisition for the mobile communications. It is more important to modify the twiddle factor reading methods to further cut down power consumption of FFT. Besides, the algorithm itself of FFT also needs to be optimized to reduce the power consumption. It is well known that the twiddle factor generating method in the traditional way is to read a ROM lookup table that stores the twiddle factors. But for a N-point radix-2 FFT with W-bit resolution of the twiddle factors, however, the ROM size should be N/2 locations with each location storing a 2W-bit data (W bits for the real part, W bits for the imaginary part). When the N and W are large, for example in the applications in the OFDM systems, a large ROM is needed and there will be a great deal of power cost in the accesses to the ROM for the twiddle factors. A lot of papers had been found researching on minimizing the power cost in the accesses to the ROM that stores the twiddle-factors by reducing the ROM size or decreasing the access or falling <sup>\*</sup> Project supported by National Natural Science Foundation of China (Nos. 60025101 and 90207001) and State Key Development Program for Basic Research of China (No. G1999032903) Li Jincheng PhD. He is interested in the research on mixed-signal and RF CMOS ICs. Yang Huazhong professor. His research interest includes integrated circuits design and electronic design automation. the power consumption for one $access^{[2-7]}$ . Even with the reduced ROM size, when the N and the W are large enough, the ROM still consumes the largest part of the power consumed by the entire twiddle factor generator. FFT is a well-organized algorithm of discrete Fourier transform (DFT), which significantly reduces the number of required arithmetic operations and greatly increases the speed of data processing. DFT of discrete signal x(n) can be directly computed as $$x(k) = \sum_{n=0}^{N-1} x(n) W_N^{nk}, \quad k = 0, 1, \dots, N-1(1)$$ where $$W_N^{nk} = e^{-j2\pi nk/N} = \cos(2\pi nk/N) - j\sin(2\pi nk/N)$$ (2) is known as the twiddle factor. Here x(n) and x(k) are sequences of complex numbers. We can see from Eq. (2) that the twiddle factor $W_N^{nk}$ can be regarded as the combination of the evenly sampled sine and cosine, where the real part is the cosine and the imaginary part is the sine. By making full use of the symmetric feature of a sine wave and the relation between sine and cosine, this design suggested a new low-power ROM less architecture for the twiddle factor generators. The power analysis with Synopsys Power Compiler shows that the power consumed by this twiddle factor generator is about 2mW with TSMC 0. $25\mu m$ CMOS process at 50M Hz. ## 2 Algorithm If $$p = nk$$ , equation (2) can be rewritten as $$W_N^p = e^{-j2\pi p/N} = \cos(\frac{2\pi}{N} \times p) - j\sin(\frac{2\pi}{N} \times p)$$ (3) where $p = 0, 1, 2, \dots, 511, N = 1024$ in this proposed design. Figure 1(a) shows the waveforms of the twiddle factors, and Figure 1(b) shows the absolute value of the real part and the imaginary part. We can find from Fig. 1(a) that the sign bits of the real and the imaginary parts can be easily set, where the sign bit of the imaginary part is "and the sign bit of the real part is 0", when p = 0, 1, 2, ..., 255, is "1" when p = 256, 257, ..., 511, so if we ignore the sign bits of the real and the imaginary parts, a more simple relation between the absolute values of the real and the imaginary parts can be found in Fig. 1(b), means the former half and the latter half of the picture composed by the two lines of the absolute values of the real and imaginary parts are identical to each other. When a circuit generates one half of the picture of the absolute values of the real and imaginary parts, the same circuit can generate the other half of the picture of the absolute values of the real and imaginary parts just by exchanging the generated values. To a 9-bit input address for 512 data, the MSB of the input address will only be used to control whether to exchange the generated values and the residual bits of the input address are fed to the generating circuit to generate one half of the picture shown in Fig. 1 (b). There is just an exchanging operation at the output of the generating circuit and no other additional operation, so this algorithm realizes a simple implementation of the entire system. Fig. 1 Waveforms of the twiddle factor (a) Real and imaginary part; (b) Absolute values of (a) ## 3 Architecture and operation Figure 2 shows the architecture of the twiddle factor generator. This proposed design generates the twiddle factors with 16-bit resolution. The addr[8:0] is the input address, p in Eq. (3); clk is the system clock; real[15:0] and imag[15:0] are the output of the real and the imaginary parts, respectively. The sign bit of the imaginary part in Fig. 1(a) is 1' to all of the 512 twiddle factors, so imag[15], the sign bit of the imaginary part, is directly connected to 1' as shown in Fig. 2; the sign bit of the real part in Fig. 1(a) can be derived to equal to the MSB of the input address, so in Fig. 2, real[15], the sign bit of the real part, is equal to the signal of one period delay of the addr[8]. Fig. 2 Architecture of the twiddle factor generator Because the sign bits of the real and the imaginary parts can be easily got, the function of the entire architecture as illustrated in Fig. 2 mainly generates the absolute values of the twiddle factors. The data sequence of the imaginary part is rearranged as illustrated in Fig. 3, then the total 512 15-bit data are transferred to 256 30-bit data, which as shown in Fig. 1(b), composed by the former 15-bit absolute values of the imaginary part and latter 15-bit absolute values of the real part. The blocks of L0 to L7 are 8 separate logic circuits and each of which synthesizes 32 30-bit data, for 256/8= 32, L0 to L7 can reproduce all the 256 data. n0[29:0], n1[29:0], ..., n7[29:0], the outputs of L0 to L7, are summed together by the block of SUM. For simplifying the circuit of SUM, before n0[29:0], n1[29:0], ..., n7[29:0] are fed to SUM, these data are processed by the AND gates, U0, U1, ..., U7, each of which has one input from p1[7:0]. Because p[7:0] is the output of 3 to 8 decoder, only one bit is "among the bits of the output of the decoder. p1[7:0] is the triggered output of p[7:0]. The processed data, y0[29:0], y1[29:0], ..., y7[29:0], have just one none-zero value and can be added easily by OR gates instead of the large area-occupied and power-hungry adders. Fig. 3 Data arrangement (a) Original data; (b) Rearranged data Let's introduce the operation of the diagram assuming addr[8:0] = $9^{\circ}b0 - 0010 - 0110$ , p = 38 in Eq. (3). Because addr[7:5] = 3'b001, p[7:0], the output of 3 to 8 decoder is 8'b0000 - 0010, means p[1] = 1'b1 and the clock signal, clk, can pass through the AND gate switched by p[1].m1[4:0] can get a new value of addr[4:0]. L1 creates a new value to its output, n1[29:0], corresponding its input address, m1[4:0]. As mentioned above, the inputs of SUM has only one none-zero value, y1[29: 0], and out [29: 0] gets the new value of y1 [29: 0]. out [29:0], which is composed by the former 15-bit absolute values of the imaginary part and the latter 15-bit absolute values of the real part as mentioned above, is fed to the block of SWITCH, which redistributes its input data to real [14:0] and imag[14:0] according to its controlling signal, real[15]. Table 1 shows the redistribution regular, which transfers out [29: 15] and out [14: 0] to imag[14: 0] and real [14: 0], respectively, when real[15] is "0" and transfers out [14: 0] and out [29: 15] to imag[14: 0] and real [14: 0], respectively, when real [15] is "1". To the assuming of addr[8: 0] = 9'b0\_010\_0110, for addr[8] = 1'b0, real [15], one period delay of addr[8], is 1'b0, then according to Table 1, imag [14: 0] is equal to out [29: 15] and real [14: 0] is equal to out [14: 0]. The sign bit of the imaginary part is "1", and the sign bit of the real part is real [15], one period delay of addr[8], "1". Table 1 Characteristic value of SWITCH | real[ 15] | imaginary[ 14: 0] | real[ 14: 0] | |-----------|-------------------|--------------| | 0 | out[ 29: 15] | out[ 14: 0] | | 1 | out[ 14: 0] | out[ 29: 15] | ### 4 Power analysis and comparison To any portable devices, as the rapid developing of the CMOS processes, low power design is becoming more and more critical. Minimizing the switching activity of the logic gates in digital CMOS VLSI is a prospective method to reduce power consumption. By reducing the switching activity, the design in this paper is indeed a real power-efficient twiddle factor generator. First, to decrease the switching of the DFFs and to reduce the load of the clock tree, a 3 to 8 decoder is used to generate controlling signal to select the DFFs that transfer subaddr[5:0] to L0 to L7 and reduces the load of the clock tree by clock-gating; second, only one logic block among L0 to L7 works within one period of the clock; third, as introduced above, the summation of the outputs of L0 to L7 is realized by OR gates instead of the power- and area-hungry adders. The Power Compiler of Synopsys Inc. is a tool for power analysis and optimizations. It can analyze the switching power and the leakage power of digital circuits. The switching power analysis is based on the activity of the nets in the design. When analyzing a gate-level design, Power Compiler requires a gate-level netlist and some forms of switching activity for the netlist. The Power Compiler tool computes average power consumption based on the switching activity of the nets in the design, which is available by RTL or gate-level simulation. As the switching power is much larger than the leakage power in $0.25\mu m$ CMOS technology and the relative precision of the gate-level power simulation by the Power Compiler is pretty accurate [8], we take it as a reference to measure the power of our design. To get the precise power characteristic of the design in this paper, a gate-level power analysis by means of Synopsys Power Compiler is performed. Table 2 shows the results of the power and area analysis with TSMC 0.25µm CMOS process at 50M Hz (an ordinary inverter occupies 11 in area). We can see from Table 2 that the total power consumption at 50M Hz with TSMC 0.25 µm CMOS technology is about 2mW and the entire area occupation is about 4000 equivalent gates (4000 ordinary inverters). In part0, as shown in Fig. 2, the load of the clock tree is decreased by clock-gating that realized by a 3-to-8 decoder and 8 AND gates, which reduces the power of the clock tree to minimum; in part1, which occupies 54% of the entire area, by portioned 8 separate logic units and only one of them works at a time, the switching activity is very low and thus a low power consumption is realized; part2 and part3 perform the adding function, for the low power consumption required, we first make the outputs of L0 to L7 become only one none-zero values, which then can be summed together directly by OR gates, power synthesized results shows that part2 and part3 consume only 113 + $213 = 326 \mu W$ , less than the power consumed by LO to L7. If we sum the outputs of LO to L7 by adders, the adding function will consume a lot of extra power than that of part1; from Table 1, part4 should consist of 2 15-bit 2-to-1 switches and 30 DFFs and consumes $709\mu W$ as shown in Table 2, more than one third of the total power consumption, $1856\mu W$ . Table 2 Synthesized results of power and area | part | part0 | part 1 | part2 | part3 | part4 | Total | |-------------------------------|-------|--------|-------|-------|-------|-------| | Pow er | 328 | 493 | 113 | 213 | 709 | 1856 | | $/\mu \mathrm{W}$ | 19% | 26% | 6% | 11% | 38% | 100% | | Area | 6687 | 24099 | 5529 | 2592 | 5356 | 44263 | | ( times to a<br>certain cell) | 15% | 54% | 12% | 6% | 13% | 100% | If L0~ L7 are replaced by 8 small ROMs, each of which is one part of an evenly divided ROM that storing all the twiddle factors, the new circuit performs the same function. Because only one small ROM among the 8 small ROMs is accessed for one address, the power consumption of this new circuit is smaller than that of an entire ROM. Obviously, the area occupation of the proposed design is bigger than that of a ROM, so there will be a trade-off between area and power to a certain application. To a given FFT, it needs an optimization among the number of the logic blocks, power, and the area occupation. The more the logic blocks, the smaller the power consumption of part 1 in Table 2, the larger the power consumption of part2, part3, and part4. To our knowledge, the number of logic blocks in part1 is much critical to the optimization of the total power consumption and the area occupation. #### 5 Conclusion FFT finds its extensive applications in wireless OFDM transceiver systems in recent years, and this makes it very important in design of low power FFTs. It costs much power in an N-point FFT operation to read the coefficients of the twiddle factors when the data are stored in conventional ROM for a large N. Many methods have been introduced to solve the problem of large power consumption caused by accessing to ROM by means of reducing the ROM size, decreasing the memory access or minimizing the power consumption of one access. This paper proposes a low power twiddle factor generator that consumes about 2mW at 50MHz working frequency. The area occupied by this ROMless twiddle factor generator is about 4000 gates. Besides the features of low power consumption, this twiddle factor generator is entirely realized by logic gates, which can be synthesized by EDA tools without any additional modification or IPs from any third part. Thus it is very much suitable for the modern CMOS VLSI implementations. #### References - [1] Sadat A, Mikhael W B. Fast Fourier transform for high speed OFDM wireless multimedia system, circuits and systems. Proceedings of the 44th IEEE 2001 Midwest Symposium on MWSCAS, 2001, 2: 938 - [2] Jiang Y, Zhou T, Tang Y, et al. Twiddle-factor-based FFT algorithm with reduced memory access. Parallel and Distributed Processing Symposium, 2002: 70 - [3] Hasan M, Arslan T. Scheme for reducing size of coefficient memory in FFT processor. Electron Lett, 2002, 38(4):163 - [4] Ma Y, Wanhammar L. A hardware efficient control of memory addressing for high-performance FFT processors. IEEE Trans Signal Processing, 2000, 48(3): 917 - [5] Park S Y, Cho N I, Lee S U, et al. Design of 2K/4K/8K-point FFT processor based on CORDIC algorithm in OFDM receiver. 2001 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2001, 2: 457 - [6] Ma Y, Wanhammar L. A coefficient access control for low power FFT processors. 1999 42nd Midwest Symposium on Circuits and Systems, 1999, 1: 512 - [7] Ma Y. A VLSI-oriented parallel FFT algorithm. IEEE Trans Signal Processing, 1996, 44(2): 445 - [8] Synopsys Inc. Power compiler reference manual, 2001 382 半 导 体 学 报 25 卷 ## 一种新的 1024 点基-2 FFT 旋转因子产生电路的结构\* #### 李金城 杨华中 (清华大学电子工程系, 北京 100084) 摘要: 设计了一个新的无存储器的基-2 1024 点 FFT 旋转因子产生电路. 这个旋转因子产生电路用若干逻辑模块来产生数据, 然后用这些数据合成所需要的旋转因子. 用 Synopsys Power Compiler 进行功耗分析表明, 用 TSM C 0. $25\mu m$ CM OS 工艺综合出来的电路在 50 M Hz 时的功耗为 2 m W. 这种旋转因子产生电路非常适合用于低功耗的设计中, 尤其是移动通信和其他手持设备中. 关键词: FFT; 旋转因子; 低功耗; 无存储器; 移动通信 **EEACC**: 1517; 6730 中图分类号: TN43 文献标识码: A 文章编号: 0253-4177(2004)04-0377-06 <sup>\*</sup> 国家自然科学基金(批准号: 60025101, 90207001) 和国家重点基础研究发展规划(合同号: G1999032903) 资助项目 李金城 博士,主要从事混合信号和 RF CMOS 研究. 杨华中 教授,博士生导师,主要从事集成电路设计及电子系统设计自动化方面的研究.