# A 10 Gb/s receiver with half rate period calibration CDR and CTLE/DFE combiner\*

Gao Zhuo(高茁)<sup>1,2,†</sup>, Yang Zongren(杨宗仁)<sup>1</sup>, Zhao Ying(赵莹)<sup>1</sup>, Yang Yi(杨祎)<sup>1,2</sup>, Zhang Lu(张璐)<sup>1</sup>, Huang Lingyi(黄令仪)<sup>1</sup>, and Hu Weiwu(胡伟武)<sup>1</sup>

(1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China)
 (2 Graduate University of the Chinese Academy of Sciences, Beijing 100049, China)

**Abstract:** This paper presents the design of a 10 Gb/s low power wire-line receiver in the 65 nm CMOS process with 1 V supply voltage. The receiver occupies  $300 \times 500 \,\mu\text{m}^2$ . With the novel half rate period calibration clock data recovery (CDR) circuit, the receiver consumes 52 mW power. The receiver can compensate a wide range of channel loss by combining the low power wideband programmable continuous time linear equalizer (CTLE) and decision feedback equalizer (DFE).

**Key words:** serial link; receiver; CDR; equalizer **DOI:** 10.1088/1674-4926/30/4/045008 **EEACC:** 6150D; 1280

# 1. Introduction

Advancement of integrated circuit fabrication technology with innovative circuits and architectural techniques leads to that the complex signal processing system can be implemented in a single chip (system on chip or SOC) with low cost and high performance, such as multi-core microprocessor, network on chip and mix signal processing chip. Frequently, these chips consisting of billions of transistors operate at multi-gigahertz frequency and require considerable off-chip bandwidth to communicate with the outside world efficiently. To avoid that the off-chip communication becomes the bottleneck of the overall system performance, the SOC's off-chip bandwidth grows increasingly<sup>[1–3]</sup>.

Moreover, today's personal computers, workstations and servers, routers and switches, consumer electronics and game consoles not only need higher bandwidth to meet the increasing performance demand of new applications, but also gradually care about the power efficiency of the chip-to-chip communication.

Although increasing both the number of IO pins and the data rate of each IO channel can improve the interconnect bandwidth, from the authors' point of view, the better choice to achieve the off-chip bandwidth requirement should be to increase the bandwidth of each IO and reduce the IO numbers for saving area and power. As the number of IO decreases, the distance of clock distribution for every IO, the number of impedance matching cells combining with current sources, and the corresponding loads of on-chip and off-chip also scale down accordingly, which in turn benefit both the area and power.

In this paper, the design of a low power 10 Gb/s wire-line receiver in 65 nm CMOS process with 1 V supply voltage is

presented. The proposed receiver architecture and period calibration technique used to optimize power dissipated by the receiver clocking circuits-the dominant source of power dissipation in an I/O link are discussed. The details of circuit building blocks are also described.

# 2. Architecture

As shown in Fig. 1, a typical point to point wire-line transceiver system consists of transmitter, channel and receiver. The receiver includes two main circuit modules: equalizer and clock data recovery (CDR). In this work, both two circuits have been improved and they contribute to the reduction of total power dissipation.

### 2.1. Channel equalizer

The channel in this paper is FR4 PCB trace. Unfortunately, the physical channel always has some non-ideal effects. For FR4 PCB trace, these non-ideal effects are parasitical resistance loss combining with skin effect, dielectric loss, reflections because of impedance mismatch, channel coupling and electromagnetic radiation loss. All above contribute to the transfer signal's amplitude and phase distortion and all these non-ideal effects degrade the signal noise ratio (SNR) of receiver end greatly.

The equalization technique can compensate the channel



<sup>\*</sup> Project supported by the State Key Development Program for Basic Research of China (No. 2005CB321600), the National High Technology Development Research and Program of China (No. 2008AA110901), the National Natural Science Foundation of China (Nos. 60801045, 60803029, 60673146, 60603049), and the Beijing Natural Science Foundation (No. 4072024).

© 2009 Chinese Institute of Electronics

<sup>†</sup> Corresponding author. Email: gaoz.mail@gmail.com Received 11 September 2008, revised manuscript received 1 November 2008



Fig. 2. (a) Pulse response of a band limited channel; (b) DFE system.

non-ideal effects, such as time domain inter- symbol interference (ISI), frequency dependent loss, dispersion and reflection<sup>[4,5]</sup>. To achieve high data rate, wide band signal can be transferred through the non-ideal band limited channel with high SNR and low bit error rate (BER) by using channel equalizer.

There are two types of the equalizers: linear and nonlinear equalizers. The linear equalizer is a feed forward equalizer which uses continuous time or discrete time linear filters to compensate the channel distortion. The nonlinear equalizer conventionally is decision feedback equalizer (DFE) which employs previous nonlinear decisions to eliminate ISI caused by previous detected symbols on the current symbol being detected. To be convenient for circuit implementation, usually, discrete time linear equalizer exists in transmitter side and often is called by pre-emphasis equalizer. In the receiver side, there are continuous time linear equalizer (CTLE) and DFE.

In the frequency domain, the CTLE has the transfer function of CTLE(s) and the channel has the transfer function of C(s). The high pass equalizer should compensate the low pass channel's frequency dependent loss. This amounts to flatten the spectrum over the desired transmission bandwidth. Ideally, the perfect compensation might be

$$C(s) \times \text{CTLE}(s) = \text{Const},$$
 (1)

where the constant:  $Const \in (0,1]$ .

Pulse response in a band-limited channel and a DFE system for post-cursor cancellation are explained in Fig. 2, where x < n > is the received signal, y < n > is the voltage signal output of the DFE,  $y_{-d} < n - 1 >$  is the data output of the DFE, and the DFE feedback coefficients are b < k > for k = 1, 2, 3,.... The magnitude of the pulse response at the end of a symbol interval is normalized to one and the amounts of ISI contributing to the subsequent ones are c < i > for i = 1, 2, 3,... These values can be obtained from the channel measurement by using vector network analyzer (VNA) or time domain reflectometer (TDR). The channel modeling with EM solvers such as HFSS, ADS can also be used. If the transmitting signal is s < n >, then the received signal x < n > can be expressed as

$$x < n >= 1 \times s < n > + c < 1 > \times s < n - 1 > + c < 2 > \times s < n - 2 > + c < 3 > \times s < n - 3 > + \cdots$$
(2)



Fig. 3. Conventional CDR: (a) Phase tracking CDR; (b) Phase Picking CDR.

It can be found from Fig. 2 that the voltage signal output of the DFE y < n > is

$$y < n >= x < n > -b < 1 > \times y_{d} < n - 1 > -b < 2 > \times y_{d} < n - 2 > -b < 3 > \times y_{d} < n - 3 > - \cdots$$
(3)

Substitution x < n > from Eq. (2) into Eq. (3) gives

$$y < n > = s < n > + (c < 1 > x s < n - 1 >)$$
  
-b < 1 > xy\_d < n - 1 >) + (c < 2 > x s < n - 2 >)  
-b < 2 > xy\_d < n - 2 >) + (c < 3 > x s < n - 3 >)  
-b < 3 > xy\_d < n - 3 >) + .... (4)

The optimal DFE coefficients b < k > should remove the postcursor ISI c < k > and minimize the error between y < n > and s < n >. The best choice is making b < k > be equal to c < k >. Finally, the DFE can be represented in *z*-domain as

$$Y(Z) = X(Z) - \sum_{k=1}^{N} b < k > \times y_{-}d(z) \times Z^{-k}.$$
 (5)

A particular character of equalizer circuit in this work is that it combines the advantages of both CTLE and DFE. A low power wideband programmable CTLE and a three order DFE have been implemented by sharing with each other. So the area and power are saved simultaneously. The details of these circuits are described in Section 3.

#### 2.2. CDR

It is shown in Fig. 3 that the architecture of the conventional CDR can be classed by two types: the phase tracking CDR<sup>[6–9]</sup> and the phase picking CDR<sup>[10, 11]</sup>. In the phase tracking CDR, the phase detector (PD) continuously measures the phase relation between the coming date (data) and the sample clock (cks). The generated phase error signal feeds to the filter (analog or digital). The main function of the filter is to generate the phase calibration control signal to adjust the phase of cks which can be provided by VCO, VCDL or phase rotator. For the phase picking CDR, the cks oversamples every data bit. The following digital decimation filter uses the oversampled information to determine the transition position of the data and then decimates the data. The main drawback of these



Fig. 4. (a) Proposed half rate period calibration CDR; (b) Period calibration timing waveforms.



Fig. 5. CTLE/DFE combiner.

two configurations is that all the CDR circuits have to do the phase calibration or oversampling all the time, otherwise they will lose the synchronization between the clock and the date. Obviously, this continuous running mode can hardly achieve the good power efficiency.

Figure 4 shows the block diagram of the proposed half rate period calibration CDR. Only in the calibration period (calibrate\_en = 1), the CDR loop is in full-scale operation. With calibration\_en = 0, the synchronization relation between the clock and the date is kept and saved in CDR's state DFFs. At the same time, many circuits are disabled and in sleep state, so the average power dissipation in the CDR loop is reduced. A novel digital PD has the memorial ability. It can hold the latest history of the effective phase results and avoid to be modified by useless incoming phase information. The CDR's digital controller (CTR\_CDR) is constituted by three parts: sub-sampler, majority vote and digital filter. The subsampler decimates the detected phase information and resists the metastability effectively. The majority vote is used to remove the abrupt false phase information. Finally, digital filter adjusts the bandwidth of CDR loop and generates the phase state control word: phase\_sel (selx), phase\_weight (w). The PD, CTR\_CDR, phase selection and phase interpolator/phase mixer can be shut down by their enable signal. The calibrate\_en is a low frequency period signal which is triggered by a pseudo bit error signal. The pseudo bit error could be the XOR result by the sampler of cks0 and the sampler of cks0 with some small phase offset. The accumulator, DFFs and NAND logic generate the calibrate\_en signal. The DFFs is reset to zero asynchronously, when a pseudo bit error occurs. The calibrate\_en can also be reserved as an interface signal triggered by the microprocessor's protocol controller. The protocol controller will calculate the recent bit error rate or package error rate and then generate the waveform of the calibrate\_en. These compose an adaptive half rate period calibration CDR.

#### 3. Circuit implementation

#### 3.1. CTLE/DFE combiner

By combining the CTLE and DFE, the advantages of both equalizers are obtained. For saving the area and power, as shown in Fig. 5, a programmable CTLE and a three order DFE have been implemented by sharing with each other.







The *RC* combination introduces a real zero 1/RC in the transfer function, potentially providing gain-peaking at high frequency to compensate the channel loss. Inductive peaking is used to extend the combiner bandwidth up to 10 GHz with almost 40% power reduction. Because the middle metal layers have a bigger sheet resistor value comparing to the top metal layer, the layout of the very low *Q* inductor is implemented using six turns M5-M4 metal layers and has a small area :  $60 \times 60 \ \mu\text{m}^2$ . The gain-peaking value can be set continuously by the analog control signal NEQ. The DFE's three coefficients are programmable with three current source decoders: LDEC0, LDEC1 and LDEC2.

#### 3.2. PLL

The block diagram of the PLL used to supply 5 GHz clocks to the CDR loop is shown in Fig. 6. Thanks to the CDR loop's dynamic phase calibration, the CDR itself can compensate and suppress the low frequency input clock jitter.

The simple ring VCO is chosen. Compared to the high Q LC resonant VCO, the benefits of this choice are saving the area and the power consumption on metal interconnection. The PLL employs the digital band selective VCO. Originally, this technique is widely used in an LC VCO to extend the PLL's frequency adjusting range, but the motivation of using here is to reduce the VCO gain sensitivity. The low VCO gain can suppress the reference spur induced by the mismatch in PFD or charge pump and degrade other sources of noise which couple to the VCO. The band selection FSM detects the dummy VCO control voltage and switches to a right VCO band automatically by generating the corresponding thermometer code band select signal b < 14: 0 >.

Figure 7 shows the ring VCO circuit in detail. The oscillator generates the four phase clocks: ck0, ck90, ck180, ck270 feeding to the CDR. The capacitance phase trim hardwire is added to cancel the IQ phase offset which may be induced by the mismatch among devices or clock distribution paths.



Fig. 8. Digital PD in proposed half rate period calibration CDR.



Fig. 9. Phase interpolator/mixer.

#### 3.3. CDR loop circuits

A special digital PD used only in the proposed CDR loop is shown in Fig. 8. The circuit is implemented by CML and its output is converted to CMOS logic by a current mirror amplifier. With the XOR encoder logic and the local MUX feedback, it can hold the latest history of the effective phase results and avoid to be modified by useless incoming phase information. This memorial ability is important for the digital CDR loop. The recovered data bits:  $y_{-}d < n-1 >$ ,  $y_{-}d < n-2 >$  feed back to the CTLE/DFE combiner circuit.

Four simple CML multiplexers constitute the phase selection and send the needed clocks to the phase interpolator/mixer. Figure 9 shows the circuit of the phase interpolator. It has the function of digit to time converter and is driven by two pairs of differential clocks: (n) cka and (n) ckb. It interpolates them and generates 16 phase positions. With the help of phase selection, it can generate a total of 64 phase positions on a 360 degree circle i.e. 2 unit intervals (UI). The phase position is decided by a current-steering DAC with the corresponding thermometer weight code w < 14 : 0 >. The 15 current-steering DAC cells are not uniform. The optimized non-uniform current cells can give the most linear relationship between w < 14 : 0 > and the output phases.

The CDR's digital controller is implemented by the static CMOS logic standard cells. The sub-sampler is a series connected DFFs. The majority vote circuit includes the carry save adder and the compare logic. The programmable ring counter



Fig. 10. Receiver chip layout.

and phase control FSM construct the digital filter. It generates the phase state control word: phase\_sel (sela, selb), phase\_weight (w < 14 : 0 >). The phase state control word holds the latest optimal sample clock's phase position to control the phase selection and the phase interpolator maintaining the correct data recovery effectively in sleep state. In the calibration period, the phase state control word is updated according to the incoming PD output. In the sleep state, the digital controller doesn't consume any dynamic power because its clock CK\_CTR is gated by the calibrate\_en signal. Finally, the test chip of the receiver occupies an area of  $300 \times 500 \ \mu m^2$  and is shown in Fig. 10.

#### 4. Design results

To validate the designed receiver, a whole test bench has been built up. It includes a pre-emphasis transmitter, a FR4

| Table 1. Design summary. |             |             |            |                                      |
|--------------------------|-------------|-------------|------------|--------------------------------------|
| Parameter                | Ref. [12]   | Ref. [13]   | Ref. [14]  | This work                            |
| Technology               | 130 nm CMOS | 130 nm CMOS | 90 nm CMOS | 65 nm CMOS                           |
| Power supply (V)         | 1.5         | 1.2         | 1          | 1                                    |
| Data-rate (Gb/s)         | 12.5        | 10          | 10         | 10                                   |
| Channel equalizer        | nothing     | DFE         | DFE        | CTLE+DFE                             |
| Power (mW)               | 400         | 250         | 130        | CTLE+DFE:10 CDR: 29 Core: 39 = 10+29 |
|                          |             |             |            | Test I/O driver path: 13 Total: 52   |
| Area (mm <sup>2</sup> )  | 1.1         | _           | 0.4        | 0.15                                 |



Fig. 11. Test board measured by VNA.



Fig. 12. Measurement results of time and frequency domain response.

test board and the designed receiver. The programmable preemphasis transmitter sends a pseudo random bit sequence to the designed receiver through the 20 cm differential FR4 PCB trace. The transmitter and receiver are characterized by the post layout data. The 20 cm differential FR4 channel is characterized by the measured *S* parameters.

The manufactured FR4 test board is shown in Fig. 11 and is measured by Agilent VNA. The time domain pulse response, frequency domain transfer and reflection response of this 20 cm differential signaling channel are shown in Fig. 12



Fig. 13. Eye-diagram of the receiver end: (a) Without the equalizer; (b) With the equalizer.

#### respectively.

The eye diagram of designed receiver with this FR4 PCB trace (about 15 dB loss at 5 GHz) and the transmitter is shown in Fig. 13. Without the equalizer (Fig. 13(a)), the eye diagram of the receiver end is closed. When enable the equalizer (Fig. 13(b)), the eye diagram of the receiver end is reopened and most of the ISI at the receiver end has been removed.

The results of Fig. 14 show that the CDR's sample clock demonstrates a high linearity. It has a nonlinearity error of less than 0.4 ps and the minimal time step precision of the phase shift is 3.33 ps.

Figure 15 shows the receiver's bathtub curve (BER versus sampling phase position) characterizing the designed receiver's performance at 10 Gb/s.

When the CDR loop operates in the period calibration mode (the calibration period has an about 10% duty cycle), it consumes only about 29 mW in average. However, when the CDR loop operates in the continuous calibration mode, it consumes about 50 mW. So, a 42% power reduction in CDR loop is achieved. Comparisons with the former published receivers<sup>[12–14]</sup> are summarized in Table 1. Owing to the novel



Fig. 14. Precision and linearity of the CDR's sample clock.



Fig. 15. Receiver's bathtub curve.

CDR loop and the optimized equalizer circuits, when operating at the similar 10Gb/s data rate, this work achieves a comparatively better power efficiency.

## 5. Conclusions

A low power 10 Gb/s I/O link receiver suitable for many chip to chip applications is presented. Comparing to the conventional design, this work has the characters such as: a low power wideband programmable CTLE and a three order DFE have been implemented by sharing with each other, hence, the area and power are saved simultaneously. To reduce the average power dissipation in the CDR loop, a novel half rate period calibration CDR is proposed.

## References

- Dorsey J, Searles S, Ciraula M, et al. An integrated quad-core Opteron<sup>TM</sup> processor. ISSCC Dig Tech Papers, 2007: 102
- [2] Konstadinidis G, Rashid M, Rashid M, et al. Implementation

of a third-generation 16-core 32 thread chip-multithreading SPARC<sup>TM</sup> processor. ISSCC Dig Tech Papers, 2008: 84

- [3] Stackhouse B, Cherkauer B, Gowan M, et al. A 65nm 2-billiontransistor quad-core Itanium<sup>TM</sup> processor. ISSCC Dig Tech Papers, 2008: 92
- [4] Proakis J G. Digital communications. 4th ed. McGraw-Hill, 2001
- [5] Dally W J, Poulton J W. Digital systems engineering. Cambridge University Press, 1998
- [6] Rau M, Oberst T, Lares R, et al. Clock/data recovery PLL using half-frequency clock. IEEE J Solid-State Circuits, 1997, 32(7): 1156
- [7] Cao J, Green M, Momtaz A, et al. OC-192 transmitter and receiver in standard 0.18-μm CMOS. IEEE J Solid-State Circuits, 2002, 37(12): 1768
- [8] Sidiropoulos S, Horowitz M A. A semidigital dual delay-locked loop. IEEE J Solid-State Circuits, 1997, 32(11): 1683
- [9] Takauchi H, Tamura H, Matsubara S. A CMOS multichannel 10-Gb/s transceiver. IEEE J Solid-State Circuits, 2003, 38(12): 2094
- [10] Yang C K K, Horowitz M A. A 0.8-μm CMOS 2.5 Gb/s oversampling receiver and transmitter for serial links. IEEE J Solid-State Circuits, 1996, 31(12): 2015
- [11] Yang C K K, Farjad-Rad R, Horowitz M A. A 0.5-μm CMOS 4.0-Gbit/s serial link transceiver with data recovery using oversampling. IEEE J Solid-State Circuits, 1998, 33(5): 713
- [12] Ohtomo Y, Nishimura K, Nogawa M. A 12.5-Gb/s parallel phase detection clock and data recovery circuit in 0.13-μm CMOS. IEEE J Solid-State Circuits, 2006, 41(9): 2052
- [13] Momtaz A, Chung D, Kocaman N, et al. A fully integrated 10 Gbps receiver with adaptive optical dispersion equalizer in 0.13  $\mu$ m CMOS. IEEE Symposium on VLSI Circuits Dig Tech Paper, 2006
- Bulzacchelli J F, Meghlli M, Rylov S V, et al. A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology. IEEE J Solid-State Circuits, 2006, 41(12): 2885