# Low-power digital ASIC for on-chip spectral analysis of low-frequency physiological signals\*

Nie Zedong(聂泽东)<sup>1,2</sup>, Zhang Fengjuan(张凤娟)<sup>1</sup>, Li Jie(李杰)<sup>1</sup>, and Wang Lei(王磊)<sup>1,†</sup>

<sup>1</sup>Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China <sup>2</sup>Graduate University of the Chinese Academy of Sciences, Beijing 100049, China

Abstract: A digital ASIC chip customized for battery-operated body sensing devices is presented. The ASIC incorporates a novel hybrid-architecture fast Fourier transform (FFT) unit that is capable of scalable spectral analysis, a licensed ARM7TDMI IP hardcore and several peripheral IP blocks. Extensive experimental results suggest that the complete chip works as intended. The power consumption of the FFT unit is 0.69 mW @ 1 MHz with 1.8 V power supply. The low-power and programmable features of the ASIC make it suitable for 'on-the-fly' low-frequency physiological signal processing.

Key words: low-power; processing-on-node; spectral analysis DOI: 10.1088/1674-4926/33/6/065004 EEACC: 1265

# 1. Introduction

Recent progress in body sensor networks (BSNs) has enabled the development of smart sensing devices for healthcare and leisure applications. A BSN is typically comprised of several sensing nodes for on-body/in-body physiological measurements and a base-station node for information aggregation. In most circumstances, the sensing nodes have to be very small, have low power consumption and able to operate wirelessly with the base-station node. It is therefore desirable to empower the sensing devices with processing-on-node abilities, wherein the raw data is calculated 'locally' and the key features are abstracted on-node, prior to the wireless transmissions.

Thus far, different off-the-shelf solutions have been adopted to tackle the bottleneck of on-node computation. An FPGA, however, involves relatively high power consumption. DSP was also suggested, but its complicated architecture and dedicated instruction set prohibit its practical use in BSN applications. As an alternative, an application-specific integrated circuit (ASIC) can be fully customized, providing maximal design flexibility at the lowest-possible power consumptions<sup>[1, 2]</sup>. A low-frequency, low-noise and low-power IC design methodology for physiological measurements has also been proposed<sup>[3]</sup>.

It is envisaged that most physiological signals are periodic, and many of a body's vital signs, such as electroencephalogram (EEG) measurements and body respiration have primary features in low frequency domains (less than 1 kHz)<sup>[4]</sup>. In addition, frequency domain information could contribute to more effective data compression and the removal of motion artifacts. Therefore spectral analysis is quite often a fundamental building block employed within BSN sensing devices. In this paper we present a low-power digital ASIC for spectral analysis of physiological signals based on the fast Fourier transform (FFT). Compared with previous articles<sup>[5-8]</sup>, a hybrid FFT architecture is proposed, and a standard digital cell library is employed, which leads to high compatibility and fast prototyping.

# 2. System design

In the system, the FFT unit was implemented as a coprocessor and could be scaled up for 256 point FFT computations. ARM7TDMI IP was incorporated as a system controller and a memory management unit (MMU) was designed to manage peripherals such as SRAM and flash memory. All the system components were connected by AMBA bus. For the FFT unit, the radix-2 algorithm was adopted for its simple control logic and fewer multiplications in one butterfly, which is suitable for ASIC implementation and peak power savings. The FFT structure is illustrated in Fig. 1. Four design tactics were implemented, as illustrated below.

## 2.1. Hybrid architecture

Parallel and full pipeline structures of the FFT processor were mostly adopted in previous works, however, the high computing speed of the FFT was achieved by sacrificing the die area and power consumption for more than one butterfly unit working at one time. Considering the low frequency characteristics of physiological signals and battery operated BSN devices, both sequential and pipeline architectures were used to enhance the computation efficiency in this design, as Figure 1 shows, there is only one butterfly unit incorporated and all data were computed sequentially; at the same time, there is a three-pipeline in the butterfly operations (read data from memory, computation and write data to memory), each stage in this pipeline only needs one clock due to the use of the following technologies.

† Corresponding author. Email: wang.lei@siat.ac.cn Received 25 November 2011, revised manuscript received 24 December 2011

<sup>\*</sup> Project supported by the National Natural Science Foundation of China (Nos. 60932001, 61072031), the Guangdong Innovation Research Team Fund for Low-Cost Healthcare Technologies, the National Basic Research Program of China (No. 2010CB732606), and the 'Onehundred Talent' and the 'Low-Cost Healthcare' Programs of the Chinese Academy of Sciences.



Fig. 1. Architecture of the FFT, in which has only one butterfly unit and a three pipeline in one butterfly operation.

| Table 1.  | Address | generation | for | SRAM | initialization.              |
|-----------|---------|------------|-----|------|------------------------------|
| 100010 11 |         |            |     |      | minuter in the second state. |

| Points | Sequence number: a               | Address: d                       |
|--------|----------------------------------|----------------------------------|
| 8      | 00000a[2]a[1]a[0]                | 00000a[0]a[1]a[2]                |
| 16     | 0000a[3]a[2]a[1]a[0]             | 0000a[0]a[1]a[2]a[3]             |
| 32     | 000a[4]a[3]a[2]a[1]a[0]          | 000a[0]a[1]a[2]a[3]a[4]          |
| 64     | 00a[5]a[4]a[3]a[2]a[1]a[0]       | 00a[0]a[1]a[2]a[3]a[4]a[5]       |
| 128    | 0a[6]a[5]a[4]a[3]a[2]a[1]a[0]    | 0a[0]a[1]a[2]a[3]a[4]a[5]a[6]    |
| 256    | a[7]a[6]a[5]a[4]a[3]a[2]a[1]a[0] | a[0]a[1]a[2]a[3]a[4]a[5]a[6]a[7] |

#### 2.2. Memory construction

In order to ensure computation accuracy, the width of 'data\_out' was extended to be 40-bit with the real and imaginary parts taking 20-bits each. For 256-point computation, the depth of the SRAM and the ROM must be 256 and 128, respectively. Considering that the power consumption of a  $40 \times 256$  SRAM is larger than two  $20 \times 256$  SRAMs during read and write cycles, and reading and writing operations were executed almost all the time during FFT computation, the SRAM was constructed into two banks (entitled 'ram\_real' and 'ram\_ imaginary', respectively) to store the real part and imaginary part , so as to save the averaging power consumption. Furthermore, the SRAM was set to be dual-port in order to improve throughput and reduce operation cycles.

#### 2.3. Complex multiplication

Assume the complex inputs in the Radix-2 butterfly to be  $Y = Y_r + jY_i$  and that the twiddle factors are represented as  $W = W_r + jW_i$ , complex multiplication of  $W \times Y$  can be expanded in Eq. (1):

$$W \times Y = (W_{\rm r}Y_{\rm r} - W_{\rm i}Y_{\rm i}) + j(W_{\rm r}Y_{\rm i} + W_{\rm i}Y_{\rm r}).$$
 (1)

Equation (1) could be transformed as follows:

$$W \times Y = [W_i(Y_r - Y_i) + Y_r(W_r - W_i)] + j[W_r(Y_r + Y_i) - Y_r(W_r - W_i)].$$
(2)

Comparing with Eq. (2), the number of real multiplications was reduced from 4 to 3 whilst the number of additions was increased from 2 to 5. Since the multiplication operation uses far more computation resources than the addition operation, Equation (2) was employed in our design.

#### 2.4. Address generation

Three types of addresses should be generated for SRAM initialization, twiddle factor ROM access and SRAM temporary date storage. SRAM initialization was used to store the ordered input data before FFT operation. We denoted the sequence number of the input data as "a" and the generated address as "d". Because our design scaled to 256 points, the width of "a" and "d" should be 8 bits. The address generation method for SRAM initialization is illustrated in Table 1.

The twiddle factor ROM access and SRAM temporary date storage address generation algorithms are proposed in Eqs. (3)–(5), where MASK  $(b, \log_2 N_1-1-p)$  infers to set all the lowest  $(\log_2 N_1-1-p)$  bits of "b" to "0"s, R(x, y, m) infers to make the lowest "y" bits of "x" rotate left by "m" bits. "N" is the maximum point length of the FFT unit and " $N_1$ " is the point length that is being computed. "p" is the FFT compute stage number and "b" is the butterfly counter in the current compute stage.

$$\operatorname{rom}_{\operatorname{add}} = [\operatorname{MASK}(b, \log_2 N_1 - 1 - P)] \ll \log_2 N - \log_2 N_1,$$
(3)

| Table 2. Performance Comparisons of 256-point FFT computations. |              |                                 |                                  |  |  |  |  |
|-----------------------------------------------------------------|--------------|---------------------------------|----------------------------------|--|--|--|--|
| Solution                                                        | Clock cycles | Energy/256-point FFT ( $\mu$ J) | Comment                          |  |  |  |  |
| This design                                                     | 2063         | 1.42                            | SMIC 0.18 $\mu$ m tech           |  |  |  |  |
| Generated FPGA IP                                               | 1024         | 29.9                            | Spartan XC3SD3400A 90 nm tech.   |  |  |  |  |
| FPGA implementation                                             | 2063         | 8.25                            | Spartan XC3S500E 90 nm tech.     |  |  |  |  |
| Commercial DSP*                                                 | 25001        | 41.25                           | TMS320C6416TBGLZ1 90 nm tech.    |  |  |  |  |
| Embedded <sup>#</sup>                                           | 608043       | 160.23                          | ARM7TDMI SMIC 0.18 $\mu$ m tech. |  |  |  |  |

\*: 'TMS320C6000 programmer's Guide (REV.G)' Texas Instruments. Aug.1.2002. <sup>#</sup>: Simulated using the Realview Debugger, energy was calculated based on the 'LF039 Configuration and Core Performance Summary'.



Fig. 2. (a) Chip microphotograph and metrics, die size is  $5000 \times 3300 \ \mu m^2$ . The equivalent NAND gate number of the FFT unit was approximately  $2 \times 10^5$ . (b) A snapshot of our test bench board in which the ASIC was socketed. (c) On board testing results using the Tektronix logic analyzer. The falling edge of 'FFTOK\_OUT\_PAD' indicates a finished FFT compute procedure, A: 8 point, 30 cycles; B: 16 points, 72 cycles; C: 32 points, 170 cycles; D: 64 points, 396 cycles; E: 128 points, 910 cycles; F: 256 points, 2063 cycles;

$$\operatorname{sram}_{\operatorname{add0}} = R(2b, \log_2 N_1, p), \tag{4}$$

sram\_add1 = 
$$R(2b + 1, \log_2 N_1, p)$$
. (5)

## 3. Results and discussion

The design was fabricated using an SMIC 0.18  $\mu$ m CMOS process. Figure 2(a) shows the microphotograph of the complete die, which is 5000 × 3300  $\mu$ m<sup>2</sup> (including the pads). Without the pads, the core size of the ASIC is 4000 × 2300  $\mu$ m<sup>2</sup>, of which the FFT unit, the on-chip memory, the ARM7 (with wrapper and AMBA bus), and the MMU approximately occupied 10%, 40%, 10%, and 40%, respectively. The equivalent NAND gate count for the FFT unit was approximately 2 × 10<sup>5</sup>. The ASIC was assembled into an LQPF-128 package for bench tests. Figure 2(b) illustrates the test-bench with the packaged chip; Figure 2(c) presents the on\_board testing



Fig. 3. (a) One EEG data episode from the MIT-BIH polysomnographic database, sample rate was 250 Hz. (b) Comparisons of data processed by floating-point calculation and this chip.

(b)

results from a Tektronix logic analyzer, the falling edge of 'FFTOK\_OUT\_PAD' indicates a finished FFT compute procedure, which verified that the FFT unit performed well.

On-chip spectral analyses using the FFT unit were intensively validated against PC-based floating-point computations. Figure 3(a) shows an EEG signal episode from the MIT-BIH polysomnographic database<sup>[4]</sup>. Figure 3(b) demonstrates the spectral analysis outputs from both approaches. Statistical analysis indicated that the relative spectral analysis error was less than 3%, which is tolerable for most physiological signal spectral analysis<sup>[2, 4]</sup>.

The power consumption of the FFT unit and the complete chip were 0.69 mW and 1.12 mW, respectively, whilst the supply voltage was 1.8 V and the clock rate was 1 MHz. The FFT unit took approximately 2 ms to complete a 256-point FFT calculation at a 1 MHz clock rate. Table 2 compares the power consumptions of the 256-point FFT calculations using

| Table 5. Performance comparisons with other works. |                   |                   |                     |               |  |  |  |
|----------------------------------------------------|-------------------|-------------------|---------------------|---------------|--|--|--|
| Parameter                                          | Wu <sup>[6]</sup> | Yu <sup>[7]</sup> | Chen <sup>[8]</sup> | This work     |  |  |  |
| Technology (µm)                                    | 0.18              | 0.18              | 0.18                | 0.18          |  |  |  |
| Clock speed (MHz)                                  | 100               | 20                | 51                  | 1             |  |  |  |
| Radix                                              | Radix-2           | Radix-8           | Radix-2/22          | Radix-2       |  |  |  |
| Word length (bit)                                  | 16                | $2 \times 11$     | $2 \times 13$       | $2 \times 20$ |  |  |  |
| Core area (mm <sup>2</sup> )                       | 4.73              | 4.84              | 1.47                | 0.92          |  |  |  |
| Voltage (V)                                        | 3.3/1.8           | 3.3/1.8           | 1.8                 | 1.8/3.3       |  |  |  |
| Power dissipation (mW)                             | 89.18             | 25.2              | 33.3                | 0.69          |  |  |  |
| Normalized power dissipation (mW/MHz)              | 0.88              | 1.26              | 0.65                | 0.69          |  |  |  |

Please note that Chen's work<sup>[8]</sup> was designed specifically for OFDMA communication and could not be simply adopted for low frequency applications.

this ASIC and other approaches, which included an unlicensed FPGA IP, an FPGA implementation of our FFT unit, a DSP implementation with unlicensed codes, and a pure software (embedded with the ARM7 processor) implementation with unlicensed codes. It is envisaged that, in terms of power savings, the digital ASIC designed by us outperformed all other approaches. Specifically, the energy consumption of our ASIC was less than 0.9% of the ARM7-based embedded solution, demonstrating dramatic system advantages.

Comparisons with other prior works (digital ASICs for 256-point FFT) are listed in Table 3. Equation (6) was used to normalize the power consumptions across different running frequencies. It could be concluded that our ASIC is low power and suitable for on-chip FFT calculations.

Normalized Power = power/clockrate mW/MHz. (6)

## 4. Conclusion

This paper proposed a low power digital ASIC chip for onchip spectral analysis of low frequency physiological signals. Compared to the conventional pipeline or parallel FFT units employed for middle- or high-speed applications, this design employed a hybrid architecture designated for low frequency physiological signal processing. Effective memory constructions and address generation methods were established and reduced complex multiplications were adopted in order to further reduce the power consumptions. Extensive tests suggested the chip works as intended and we are developing several BSNbased sensing devices incorporating this digital chip.

### References

• ... ...

- [1] Kim H, Jun H. Bio-medical CMOS ICs. New York: Springer, 2010
- [2] Mai S P, Zhang C, Chao J, et al. A new cochlear prosthetic system with an implanted DSP. Journal of Semiconductors, 2008, 29(9):1745
- [3] Zhang J Y, Nie Z D, Huang J, et al. Towards low frequency low noise low power body sensor network-on-chip. International Conference on Green Circuits and Systems (ICGCS), 2010: 115
- [4] Goldberger A L, Amaral L A N, Glass L, et al. Physiobank, physiotoolkit, and physioNet: components of a new research resource for complex physiologic signals. Circulation, 2000, 101(23): e215
- [5] Sridhara S R, DiRenzo M, Lingam S, et al. Microwatt embedded processor platform for medical system-on-chip applications. IEEE J Solid-State Circuits, 2011, 46(4): 721
- [6] Wu G, Ying L. A register array based low power FFT processor for speech recognition. Journal of Information Science and Engineering, 2008, 24: 981
- [7] Lin Y W, Liu H Y, Lee C Y. A dynamic scaling FFT processor for DVB-T applications. IEEE J Solid-State Circuits, 2004, 39: 2005
- [8] Chen C M, Hung C C, Huang Y H. An energy-efficient partial FFT processor for the OFDMA communication system. EEE Trans Circuits Syst II: Express Briefs, 2010, 57: 136