# A cascaded charge-sharing technique for an EDP-efficient match-line design in CAMs

Zhang Jianwei(张建伟)<sup>1,†</sup>, Ye Yizheng(叶以正)<sup>1</sup>, Liu Binda(刘滨达)<sup>2</sup>, and Lan Jinbao(兰金宝)<sup>1</sup>

(1 Microelectronics Center, Harbin Institute of Technology, Harbin 15001, China)

(2 Department of Electrical Engineering, National Cheng Kung University, Taiwan 70101, China)

**Abstract:** A novel cascaded charge-sharing technique is presented in content-addressable memories (CAMs), which not only effectively reduces the match-line (ML) power by using a pre-select circuit, but also realizes a high search speed. Pre-layout simulation results show a 75.9% energy-delay-product (EDP) reduction of the MLs over the traditional precharge-high ML scheme and 41.3% over the segmented ML method. Based on this technique, a test-chip of 64-word × 144-bit ternary CAM (TCAM) is implemented using a 0.18- $\mu$ m 1.8-V CMOS process, achieving an 1.0 ns search delay and 4.81 fJ/bit/search for the MLs.

Key words: content-addressable memory; energy-delay-product; cascaded charge-sharing DOI: 10.1088/1674-4926/30/6/065009 EEACC: 1280; 2570D

## 1. Introduction

A content-addressable memory (CAM) is an application specific memory that allows its entire contents to be searched in parallel and returns the matched address within a single clock cycle<sup>[1]</sup>. A binary CAM (BCAM) performs exact-match searches, while a more powerful ternary CAM (TCAM) provides a pattern matching function with the use of "don't care". A "don't care", or "x", represents both logic "0" and "1", allowing a wildcard operation that is particularly attractive in longest-prefix-match searches in network routers<sup>[2]</sup>. Other CAM applications mainly focus on search-intensive domains, such as database accelerator, image processing, cache memory, translation lookaside buffers (TLBs), and neural networks. Still there is an increasing demand for CAMs with low power and high search speed in these applications. The main CAMdesign challenge is to reduce the power consumption associated with the large amount of parallel active circuitry, without sacrificing speed or memory density.

A basic CAM architecture<sup>[3]</sup> is illustrated in Fig. 1. CAM cells are connected in parallel (NOR type) or in series (NAND type) by a wire match line (ML). Data stored in a CAM are searched by applying the reference word to the bit lines (BLs), which run vertically, through bit line drivers. In a precharge-high NOR type architecture<sup>[4]</sup>, the ML will be pulled down to ground if any mismatch is detected. Otherwise, ML will keep high if it is initially precharged to a high level. Sense amplifiers (SAs) are used to sense the ML level to determine match or mismatch. In case of multiple matches in a TCAM, a priority encoder is needed to select a single row, and to assert a hit signal and the corresponding address of the selected row. Usually, two SRAM bits are combined to represent 0, 1, and *x* in a TCAM. As can be seen from Fig. 1, any search word presented is searched in parallel. The major portion of the CAM

power is consumed during this parallel comparison, where all of the highly capacitive MLs are charged and discharged in every cycle.

Many circuits and architectures have been proposed to reduce the CAM power<sup>[5-10]</sup>. The NAND/AND ML architecture<sup>[11, 12]</sup> is useful to reduce CAM power due to the serial nature of search operation. In order to achieve higher speeds, a NOR ML architecture is preferred. Most approaches try to reduce the voltage swing across the  $MLs^{[13-16]}$ , whereas some approaches use system level optimization<sup>[17,18]</sup>. Other system level approaches try to minimize the pre-charging and evaluating events<sup>[19, 20]</sup>, utilizing the application specific properties, as in the cases of IP forwarding engines<sup>[21]</sup> and cache memories<sup>[22]</sup>. Statistical results show that testing four least (most) significant bits is sufficient to determine over 80% of the mismatches<sup>[21, 22]</sup>. Thereby, the number of transitions in the MLs is significantly reduced. However, further optimization can be made from the aspects of both power and speed. So, the energy-delay-product (EDP) of the CAM can be greatly improved.



Fig. 1. Simplified CAM architecture illustrating two types of CAM cells.

<sup>†</sup> Corresponding author. Email: Centuryue@gmail.com

Received 20 Noverber 2008, revised manuscript received 19 February 2009



Fig. 2. The proposed match line architecture of the word circuit: (a) The detailed circuit of PreSelect 8 bits; (b) General architecture of the proposed architecture; (c) The detailed circuit of the two cascaded Segment.

A segmented ML was presented in Ref. [23]. In order to implement charge sharing between segmented MLs to achieve low power, the search lines (SLs) have to be reset to ground in the precharge stage that doubles the SL power. SL power has been a crucial problem in today's CAM design. For example, the SL power consumption in Refs. [24–26] amounted to about 46%, 71%, and 82% of the total power, respectively.

In this paper, we present a cascaded charge-sharing technique that not only effectively reduces the ML power, but also realizes a high search speed. The SL drivers remain to be of a non-reset type. The main idea of the proposed scheme is that , based on the statistical results , we first filter out a large amount of mismatched MLs using a small section of the CAM, and then we accelerate the search speed in the remaining section. So, the EDP of the total CAM is greatly reduced.

## 2. Proposed match line architecture

A general architecture for the proposed ML is shown in Fig. 2(b). As can be seen, two parts are included in the word circuit: the first part (the pre-select 8 bits) and the second part (the remainder 136 bits). The CAM cell of the first part (M1) is shown in Fig. 3(a). As can be seen from this figure, transmission gates are used in order to achieve a full voltage swing at node N in order to alleviate the leakage power of the static logic (see Fig. 2(a)). The PMOS transistors MP1 and MP2 are used to pull the node N to  $V_{DD}$  when a "don't care" [(D1, D2) = (0, 0)] is stored. The CAM cell of the second part (M2) is shown in Fig. 3(b). For simplicity, the word line and the bit line are not shown for both types of CAM cells.

The first part is composed of the pre-select 8 bits, with its detailed circuit shown in Fig. 2(a). In Internet routers, testing the four most significant bits is sufficient to determine over 80% of all mismatches<sup>[21]</sup>. So, in this work, eight most significant bits are used as the pre-select 8 bits, aiming to filter out a large amount of mismatched MLs, which leads to a great deal of energy reduction. The pre-select 8 bits, including the gated clock logic, is implemented with static gates. Only if a match occurs in the eight most significant bits, can the second segment start its evaluation by asserting Pre\_0 low and Match\_0 high in the evaluation stage (i.e., the high level of CLK). Usually, the match circuit is in the critical path of a CAM. So, we implement the pre-select 8 bits with static logic in order to "hide" its evaluation process in the pre-charge stage of the CAM (i.e., the low level of CLK), during which the search word is driven on the search line and the word circuit is being pre-charged.

The second part consists of the remainder 136 bits. As can be seen from Fig. 2(b), it is comprised of two branches of two-stage cascaded segments, which are implemented by a charge-sharing ML architecture. The detailed circuit of the four segments is shown in Fig. 2(c). As can be seen from this figure, the ML inside a block is divided into ML\_a and ML\_b by the transistor MN1. In order to achieve a low voltage swing to reduce the energy consumption, a stray capacitance is used to implement charge sharing between ML\_a and ML\_b in the evaluation phase. The circuit works in a pre-charge and evaluation style. In the pre-charge phase, with Match\_(n - 1) being low and Pre\_(n - 1) being high, ML\_a is pre-charged to a high voltage level and ML\_b is pre-charged to ground. Then, in the



Fig. 3. The detailed CAM cell circuit for (a) M1 of the first part and (b) M2 of the second part.

evaluation phase with Match<sub>(n</sub> - 1) being high and  $Pre_{(n)}$ - 1) being low, charge sharing occurs between ML<sub>a</sub> and ML\_b. The voltage level of ML\_b after charge sharing is determined by the stray capacitance ratio of ML\_a and ML\_b [i.e.,  $V_0 = V_{DD}C_a/(C_a + C_b)$ , where  $V_0$  is the voltage level of ML<sub>b</sub> after charge sharing,  $C_a$  and  $C_b$  are the stay capacitance on ML<sub>a</sub> and ML<sub>b</sub>. In this design  $C_a : C_b = 2 : 1$ . That is, the number of TCAM cells connected to ML\_a is two times the number connected to ML\_b. So,  $V_0 = 2V_{DD}/3 = 1.2$  V]. Meanwhile, if any mismatches are found in ML\_a or ML\_b in the evaluation phase, the ML\_b voltage will remain at a voltage close to ground. Otherwise, the ML<sub>b</sub> voltage rises to  $V_0$ . A sense amplifier (SA) is used to sense the ML<sub>b</sub> voltage to detect a match or mismatch. Only if the preceding segment detects a match, can the evaluation of the next segment be activated by asserting Match\_n high and Pre\_n low. Such cascaded architecture leads to a further power reduction in addition to that made by charge sharing.

If more segments are serial cascaded, more power will be saved. But the search speed will be adversely affected. In addition, the size of MN1 in Fig. 2(c) determines the process speed of the charge sharing that affects the final search speed. So, there should be an optimization process for the number of cascaded segments and the size of MN1, which leads to an optimized EDP of the word circuit. Section 3 discusses this optimization process.

## 3. EDP optimization of the word circuit

In this section, the property of the proposed cascaded charge sharing circuit is discussed. And then the EDP opti-



Fig. 4. Performance variation with the number of stages: (a) Delay per stage as a function of the number of stages; (b) Word circuit search time as a function of the number of stages. The size of MN1 in Fig. 2 (c) varies from two times (2X) the minimum transistor size (0.44  $\mu$ m) to eight times (8X) the minimum transistor size.

mization method of the circuit is presented. The circuit simulation and implementation are carried out using a 0.18- $\mu$ m 1.8-V CMOS process.

### 3.1. Property of the cascaded charge sharing circuit

Assume that the 144-bit word circuit is composed of one branch *n* cascaded segments (so, one stage equals one segment here), whose detailed circuits are shown in Fig. 2(c), and that inside each segment, the number of TCAM cells connected to ML\_a is two times the number connected to ML\_b (i.e.,  $C_a$ :  $C_b = 2 : 1$ ).

Firstly, a different number of stages (or segments) greatly affect the circuit evaluation speed. Figure 4 shows the search time per stage and word circuit as a function of the number of stages. The size of MN1 in Fig. 2(c) also affects the circuit performance. Figure 4 shows this effect by plotting a group of curves with the MN1 size varied from two times (2X) the minimum transistor size ( $0.44 \ \mu m$ ) to eight times (8X) the minimum transistor size. As can be seen from Fig. 4(a), the delay per stage decreases when the number of stage increases. The delay time is defined here as the time period from the rising edge of Match\_(n - 1) to the falling edge of Pre\_n in a full match case in Fig. 2(c), both referred to 50% of  $V_{DD}$ . Also, it can be seen that the delay per stage decreases when MN1



Fig. 5. Voltage glitch of Pre\_n in case of a 1-bit mismatch on the ML\_a with MN1 being 2X side. When the word circuit is segmented from 1 to 3 stages, the voltage glitch decreases from 1.4925 to 0.09064 V.

becomes larger. Meanwhile, as can be seen from Fig. 4(b), the search time of the word circuit linearly increases when the number of stages increases. Also, the search time decreases when MN1 becomes larger.

Secondly, voltage glitches will be generated on Match\_*n* in Fig. 2(c) in case of mismatches. This situation will be even more pronounced when only a one bit mismatch occurs on ML\_a. Figure 5 shows such simulation results with MN1 having a 2X size. As can be seen from this figure, more stages lead to a smaller glitch. When the word circuit is segmented from one to three stages, the voltage glitch decreases from 1.4925 to 0.09064 V. Meanwhile, a larger MN1 augments the glitch. The latter effect is not shown in this figure for simplicity. A glitch is harmful to the circuit performance. A strong glitch not only generates a fake result, but also wastes a great deal of energy. Therefore, efforts should be made to reduce glitches. The shaded range in Fig. 6 shows the feasible number of stages for different sizes of MN1, given that the voltage of the glitch should not exceed  $V_{DD}/2$ .

#### 3.2. EDP optimization of the proposed circuit

As mentioned above, testing the four least (most) significant bits is statistical sufficient to determine over 80% of the mismatches, and so a great deal of energy is saved. Any additional effort made to reduce the energy will not achieve a remarkable reduction on EDP. Instead, we try to accelerate the search speed to achieve an optimized EDP.

Less stages lead to a faster search operation, but the glitch will be bigger. For the sake of safety, we configure the circuit as two stages with four segments. As can be seen from Fig. 2(b), every two cascaded blocks work in parallel with the other two cascaded blocks. Based on this architecture, we give the optimization process for the circuit. We first determine the size of MP1, MN2, and MN3 in Fig. 2(c). These parameters can be derived according to Fig. 4(a). As can be seen from this figure, the stage delay is about 200 ps when the number of stages is greater than two. Thus, the sizes of MP1, MN2, and MN3 are determined so as to make the circuit finish the pre-charging/



Fig. 6. Feasible number of stages for different sizes of MN1.

evaluation in about 200 ps. In detail,

(1) MP1: MP1 should be sized so that ML<sub>a</sub> can be precharged from ground to  $V_{DD}$  in about 200 ps.

(2) MN3: MN3 should not be a charge leakage bottleneck when only one bit mismatch occurs on ML\_a in the evaluation stage. The MN3 is sized experimentally greater than 5X.

(3) MN2: MN2 should be sized so that ML<sub>b</sub> can be precharged from  $2V_{DD}/3$  to ground in about 200 ps.

Secondly, the size of MN1 should be finely re-sized by sweeping the search time of the circuit. Figure 7 shows the sweeping results. As can be seen, 2 to 3  $\mu$ m is advisable (note that the glitch voltage should not exceed  $V_{DD}/2$ ). The word circuit search time is about 500 ps. Post-layout simulations show that the search time is 910 ps.

#### 3.3. Performance comparison

In order to see the performance achievement of the proposed circuit, we compare our work with two other types of scheme, i.e., the conventional precharge-high<sup>[4]</sup> and the segmented  $ML^{[23]}$ . For fair comparisons, all experiments are carried out in a 0.18- $\mu$ m 1.8-V CMOS process. For direct comparison, all three schemes have no pre-select circuit and so all the word circuits size 136 bits.

The original word length in a segmented  $ML^{[23]}$  was 72 bits. So, doubling 68-bit words that are connected in parallel by an NAND gate forms a 136-bit word (8-bit pre-select circuit is excluded). The number of TCAM cells connected to adjacent ML segments also keeps a ratio of 2 : 1, which is the same as in the proposed work. In the energy calculation, the energy overhead of the match sensor block in Ref. [23] is not included. Although the energy consumption correlates with the input pattern, we assume at least one mismatched bit in each ML segment during the experiment. This assumption is reasonable, given the fact that for a random and unbiased probability input pattern, the mismatch probability for an ML having a segment of 10 or more bits exceeds  $(1 - 1/2^{10})$ .

We defined the search time in Ref. [23] as the minimum time period for the word circuit to give a search result in case of a 1-bit mismatch that is preceded by seven continuous matches. SL driving and match result sensing, which take about 200 and 300 ps, respectively, both lie in this critical



Fig. 7. Search time of the word circuit for different sizes of MN1.



Fig. 8. Micrograph of the test chip.

path.

Table 1 shows a performance comparison of the proposed scheme with the other two circuits. As can be seen, the proposed scheme achieves 9.56 fJ/bit/search with 500 ps search time (The energy index seems high due to the charging of DML in Fig. 2(c) in the precharge phase, but note that a preselect 8 bits circuits in Fig. 2 is employed that can statistically reduce the ML power to  $1/2^8 = 3.9\%$ ). This achievement shows a 75.9% EDP reduction over the conventional pre-charge high method<sup>[4]</sup> and 41.3% over the segmented ML method<sup>[23]</sup>. Nevertheless, the non-resetting SL driving scheme in the proposed circuit will statistically result in half the energy consumption compared to the other two.

## 4. Experimental results

The proposed test chip is fabricated in a 0.18- $\mu$ m 1.8-V CMOS process, and the core area is  $533 \times 640 \ \mu$ m<sup>2</sup>. The micrograph of the proposed 64-word×144-bit TCAM is shown in Fig. 8. The clock signal of the proposed test chip is generated by a VCO. Random data is pre-stored in the chip; so, no decoder is included in the micrograph. The searching data is generated by a linear feedback shift register (LFSR), and the average power consumption can then be measured.

The measured energy is 4.81 fJ/bit/search. As mentioned before, this figure is smaller than the value listed in Table 1 due to the pre-select circuit. Meanwhile, due to the energy consumption of the pre-select circuit, the measured energy does



Fig. 9. Measured waveforms of the test chip. The signal in Channel 2 is inverted for ease of measurement.

Table 1. Comparison of pre-layout simulation results with no preselect circuits.

|                    | Conventional        | Segmented | This  |
|--------------------|---------------------|-----------|-------|
|                    | precharge           | ML [23]   | work  |
|                    | high <sup>[4]</sup> |           |       |
| SL driving style   | Yes                 | Yes       | No    |
| (needs resetting ) |                     |           |       |
| Search time (ns)   | 2.19                | 1.40      | 0.5   |
| Energy             | 9.07                | 5.82      | 9.56  |
| (fJ/bit/search)    |                     |           |       |
| EDP                | 19.86               | 8.15      | 4.78  |
| (ns·fJ/bit/search) |                     |           |       |
| Normalized EDP     | 100%                | 41.0%     | 24.1% |

not approach the theoretical value of  $1/2^8 = 3.9\%$ , as given in Table 1.

The measured waveforms of the test chip are shown in Fig. 9. The search time measured here is from the rising edge of Match\_0 to the falling edge of Word\_out (see Fig. 2). For ease of measurement, the Word\_out is inverted. As can be seen from Fig. 9, the search time is 1.0 ns. So, the measured EDP is 4.81 ns·fJ/bit/search.

## 5. Conclusion

A cascaded charge-sharing technique was presented in this paper that not only effectively reduces the ML power, but also realizes a high search speed. Using a small section of the CAM, we first filtered out a large amount of mismatched MLs, and then we accelerated the search speed in the remaining section. So, the EDP of the total CAM was greatly reduced. Prelayout simulation results showed a 75.9% EDP reduction over the conventional pre-charge high method<sup>[4]</sup> and 41.3% over the segmented ML method<sup>[23]</sup>. The proposed 64-word ×144bit TCAM was implemented using a 0.18- $\mu$ m 1.8-V CMOS process, and the measured EDP was 4.81 ns·fJ/bit/search.

## References

 Kohonen T. Content-addressable memories. New York: Springer-Verlag, 1987

- [2] Chao H J. Next generation routers. Proc IEEE, 2002, 90(9): 1518
- [3] Pagiamtzis K, Sheikholeslami A. Content-addressable memory (CAM) circuits and architectures: a tutorial and survey. IEEE J Solid-State Circuits, 2006, 41(3): 712
- [4] Lin P F, Kuo J B. A 0.8-V 128-kb four-way set-associative twolevel CMOS cache memory using two-stage wordline/bitlineoriented tag-compare (WLOTC/BLOTC) scheme. IEEE J Solid-State Circuits, 2002, 37(10): 1307
- [5] Huang P T, Chang S W, Liu W Y, et al. A 256×128 energyefficient TCAM with novel low power schemes. IEEE International Symposium on VLSI Design, Automation and Test, 2007: 1
- [6] Mohan N, Sachdev M. Low-capacitance and charge-shared match lines for low-energy high-performance TCAMs. IEEE J Solid-State Circuits, 2007, 42(9): 2054
- [7] Tyshchenko O, Sheikholeslami A. Match sensing using matchline stability in content-addressable memories (CAM). IEEE J Solid-State Circuits, 2008, 43(9): 1972
- [8] Ruan S J, Wu C Y, Hsieh J Y. Low power design of precomputation-based content-addressable memory. IEEE Trans Circuits Syst, 2008, 16(3): 331
- [9] Wang C C, Wang J S, Yeh C. High-speed and low-power design techniques for TCAM macros. IEEE J Solid-State Circuits, 2008, 43(2): 530
- [10] Zhang J W, Ye Y Z, Liu B D. A current-recycling technique for shadow-match-line sensing in content-addressable memories. IEEE Trans VLSI Syst, 2008, 16(6): 677
- [11] Shafai F, Schultz K J, Gibson G F R, et al. Fully parallel 30-MHz, 2.5-Mb CAM. IEEE J Solid-State Circuits, 1998, 33(11): 1690
- [12] Li H Y, Chen C C, Wang J S, et al. An AND-type match-line scheme for high-performance energy-efficient content addressable memories. IEEE J Solid-State Circuits, 2006, 41(5): 1108
- [13] Kasai G, Takarabe Y, Furumi K, et al. 200 MHz/200 MSPS 3.2
  W at 1.5 V V<sub>DD</sub>, 9.4 Mbits ternary CAM with new charge injection match detect circuits and bank selection scheme. Proc IEEE Custom Integr Circuits Conf, 2003: 387
- [14] Zhang J W, Ye Y Z, Liu B D. A low-power technique based

on charge injection and current-saving methods for match-line sensing in content-addressable memories. IEEE Asia-Pacific Conf Circuits Syst, 2006: 1293

- [15] Arsovski I, Sheikholeslami A. A mismatch-dependent power allocation technique for matchline sensing in content-addressable memories. IEEE J Solid-State Circuits, 2003, 38(11): 1958
- [16] Arsovski I, Chandler T, Sheikholeslami A. A ternary contentaddressable memory (TCAM) based on 4T static storage and including a current-race sensing scheme. IEEE J Solid-State Circuits, 2003, 38(1): 155
- [17] Lin C S, Chang J C, Liu B D. A low-power precomputationbased fully parallel content-addressable memory. IEEE J Solid-State Circuits, 2003, 38(4): 654
- [18] Panigrahy R, Sharma S. Reducing TCAM power consumption and increasing throughput. Proc Symp High Performance Interconnects, 2002: 107
- [19] Zukowski C, Wang S. Use of selective precharge for low-power CAMs. IEEE Int Symp Circuits Syst, 1997: 1788
- [20] Pagiamtzis K, Sheikholeslami A. A low-power contentaddressable memory (CAM) using pipelined hierarchical search scheme. IEEE J Solid-State Circuits, 2004, 39(9): 1512
- [21] Vijayasarathi D S, Nourani M, Akhbarizadeh M J, et al. Rippleprecharge TCAM: a low-power solution for network search engines. Int Conf Computer Design, 2005: 243
- [22] Efthymious A, Garside J D. A CAM with mixed serial-parallel comparison for use in low energy caches. IEEE Trans VLSI Syst, 2004, 12(3): 325
- [23] Baeg S. Low-power ternary content-addressable memory design using a segmented match line. IEEE Trans Circuits Syst, 2008, 55(6): 1485
- [24] Arsovski I, Sheikholeslami A. A current-saving match-line sensing scheme for content-addressable memories. Int Solid-State Circuits Conf Dig Tech Papers, 2003: 304
- [25] Choi S, Sohn K, Lee M W, et al. A 0.7 fJ/bit/search, 2.2 ns search time hybrid type TCAM architecture. Int Solid-State Circuits Conf Dig Tech Papers, 2004: 498
- [26] Wang J S, Li H Y, Chen C C, et al. An AND-type matchline scheme for energy-efficient content addressable memories. Int Solid-State Circuits Conf Dig Tech Papers, 2005: 464