# A 10–20 Gb/s PAM2-4 transceiver in 65 nm CMOS

Gao Zhuo(高茁)<sup>1,2,†</sup>, Yang Yi(杨袆)<sup>1,2</sup>, Zhong Shiqiang(钟石强)<sup>1</sup>, Yang Xu(杨旭)<sup>1</sup>, Huang Lingyi(黄令仪)<sup>1</sup>, and Hu Weiwu(胡伟武)<sup>1</sup>

(1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100083, China)
 (2 Graduate University of the Chinese Academy of Sciences, Beijing 100049, China)

Abstract: This paper presents the design of a 10 Gb/s PAM2, 20 Gb/s PAM4 high speed low power wire-line transceiver equalizer in a 65 nm CMOS process with 1 V supply voltage. The transmitter occupies  $430 \times 240$   $\mu$ m<sup>2</sup> and consumes 50.56 mW power. With the programmable 5-order pre-emphasis equalizer, the transmitter can compensate for a wide range of channel loss and send a signal with adjustable voltage swing. The receiver equalizer occupies  $146 \times 186 \mu$ m<sup>2</sup> and consumes 5.3 mW power.

 Key words:
 serial link;
 PAM; equalizer; pre-emphasis;
 CTLE

 DOI:
 10.1088/1674-4926/30/1/015004
 EEACC:
 6150D;
 1280

### 1. Introduction

The advancement of integrated circuit fabrication technology due to innovative circuit and architectural techniques has led to a highly complicated signal processing system that can be implemented in a single chip (namely a system on chip or SOC) with low cost and high performance, such as the multi-core microprocessor, network on chip and mix signal processing chip. Frequently, these chips consisting of billions of transistors operate at multi-gigahertz frequency and require considerable off-chip bandwidth to efficiently communicate with the outside world. To avoid that the off-chip communication becomes a bottleneck to the overall system performance, the SOC's off-chip bandwidth grows increasingly<sup>[1-3]</sup>. Increasing both the number of IO pins and the data rate of each IO channel can improve the interconnect bandwidth. From the authors' point-of-view, the better choice to achieve the offchip bandwidth requirement would be to increase the bandwidth of each IO, and reduce the IO number, as much as possible, saving area and power. With the number of IO decreasing, the distance of clock distribution for IO, the number of impedance matching cells combining with current source, the corresponding loading of on-chip and off-chip also scale down accordingly, which in turn benefit both area and power. In this paper, the design of 10-20 Gb/s wire-line transceiver equalizers in a 65 nm CMOS process with 1 V supply voltage is presented. The methodology of modeling is discussed and the FR4 print circuit board (PCB) channel is analyzed. The proposed transceiver and PAM2, PAM4 signaling are analyzed, and the details of the circuit building blocks are described.

### 2. Channel modeling and analysis

A typical point-to-point interconnect transceiver system consists of a transmitter, a channel and a receiver, and is shown in Fig.1. The channel in this paper is an FR4 PCB trace. Unfortunately, the physical channel always has some non-ideal effects. For our FR4 PCB trace, these non-ideal effects are: parasitical resistance loss combining with skin effect, dielectric loss, reflections because of impedance mismatch, channel coupling and electromagnetic radiation loss. All of the above contribute to the transfer signal's amplitude and phase distortion and all of these non-ideal effects greatly degrade the signal noise ratio (SNR) of the receiver end.

In order to transfer the signal effectively, we must qualitatively analyze the signal distortion induced by the channel's non-ideal effects. So first, accurate modeling of the interconnect channel should be extracted and characterized.

Figure 2 shows a channel modeling methodology. There are two distinct paths for modeling. The first path relies on the layout with a channel parameter and EM solvers such as HFSS, ADS. The second path relies on measurements by using a vector network analyzer (VNA) or time domain reflectometer (TDR). Usually, both generate circuit models described by the *S* parameter. For ensuring accuracy, it is preferable to correlate the EM solver-based *S* parameter with the measurement-based *S* parameter model, so that a consistent *S* parameter model is generated after some iterative procedures. Finally, by optimization and fitting mathematically, the *S* parameter model feeding to the circuit-level simulators. Some circuit-level simulators such as SPECTRE can directly use the *S* parameter, so the error prone converting process can be avoided.

### 3. Equalizer and PAM2, PAM4 signaling

The equalization technique can compensate for channel non-ideal effects such as: time domain inter symbol interference (ISI), frequency dependent loss, dispersion and reflection<sup>[4,5]</sup>. To achieve high data rates, a wide band signal can be transferred through the non-ideal band limited channel with high SNR, low BER by using the channel equalizer.

<sup>†</sup> Corresponding author. Email: gaoz.mail@gmail.com Received 18 July 2008, revised manuscript received 14 August 2008



Fig.2. Channel modeling methodology.

There are two types of equalizers: linear and nonlinear equalizers. The linear equalizer is a feed forward equalizer that uses continue time or discrete time linear filters to compensate for the channel distortion. The nonlinear equalizer is conventionally a decision feedback equalizer (DFE) that employs previous nonlinear decisions to eliminate ISI caused by previous detected symbols on the current symbol being detected. Usually, in facility of the circuit implementation, the discrete time linear equalizer locates in the transmitter side, and is often called the pre-emphasis equalizer. In the receiver side, there are a continuous time linear equalizer (CTLE) and a DFE. The transmitter pre-emphasis has the ability to remove both the pre-cursor ISI and the post-cursor ISI simultaneously. However, the receiver CTLE and DFE can only remove the postcursor ISI because the receiver has no future symbol knowledge following the current detecting signal.

The equalized discrete channel's pulse response sequence,  $\{y < n > | n = ..., -2, -1, 0, 1, 2, ...\}$ , can be expressed in terms of the discrete channel's pulse response sequence,  $\{c < i > | i = ..., -2, -1, 0, 1, 2, ...\}$ , and the discrete time equalizer tap coefficients,  $\{a < k > | k = ..., -2, -1, 0, 1, 2, ...\}$ , as

$$Y < n >= \sum_{k=-\infty}^{+\infty} a < k > \times c < n-k >.$$
(1)

Equation (1) can be written in matrix form as

$$Y = CA.$$
 (2)

The optimal tap coefficients should minimize the error between Y and input pulse sequence P. In the minimum mean square error (MMSE) criterion sense, we define:

$$||E||^{2} = ||Y - P||^{2} = ||CA - P||^{2}.$$
 (3)

To achieve the MMSE solution (i.e. minimum  $||E||^2$  solution), some matrix operations are carried out, and the optimal



Fig.3. PAM2 and PAM4 signaling.

tap coefficient vector A is given by

$$A = R_{\rm CC}^{-1} R_{\rm CP},\tag{4}$$

where  $R_{CC} = C^{T}C$  is called the autocorrelation matrix of vector *C* and  $R_{CP} = C^{T}P$  is called the cross-correlation vector between *C* and *P*.

From the view of the frequency domain, the channel has the transfer function of C(s) and the continuous time equalizer has the transfer function of EQ(s). The high pass equalizer should compensate for the low pass channel frequency dependent loss. This amounts to flattening the spectrum over the desired transmission bandwidth. Ideally, a perfect compensation might be as

$$C(s) \times EQ(s) = \text{Const},$$
(5)

where constant Const  $\in (0,1]$ .

In this work, a five order pre-emphasis has been implemented in the transmitter, and it has five tap coefficients a <-1>, a <0>, a <1>, a <2>, a <3> in the MMSE criterion sense. The pre-cursor ISI is removed by a <-1>, and the post-cursor ISI is removed by a <1>, a <2>, a <3>. The receiver equalizer is CTLE which can compensate for the channel loss in the continuous time mode independently.

The modulation scheme used in chip to chip wired electrical links is pulse amplitude modulation (PAM). The choice is primarily due to its good use of the available bandwidth of a wire-line electrical channel. The typical binary signaling is essentially PAM2 as shown in Fig.3 (a). By using PAM4, the channel bandwidth efficiency has doubled, which means we transmit twice the data with the same sampling rate and signal bandwidth. It is easily deduced from Fig.3 (b) that a DAC is required to map every 2 bit into 4 signaling levels for PAM4 signaling. However the payment for PAM4 is the relatively complicated circuits. In order to get the benefits from both the PAM2 and PAM4 signaling, our transmitter and equalizers can work compatibly in PAM2 and PAM4 modes.

#### 4. Circuit implementation

#### 4.1. Transmitter

The top level block diagram of the transmitter is shown in Fig.4. In PAM4 mode, the PRBS is clocked by a 5 GHz clock and generates four parallel 5 Gb/s pseudo random bit sequences. Every two bit sequence is grouped and feed to, the



binary to thermometer encoder (B2T). Each B2T produces a 10 Gb/s thermometer code sequence. The thermometer code has the monotonic switching attribute, so the DAC output glitch can be reduced greatly by using the thermometer code input. The two parallel 10 Gb/s thermometer code sequences are retimed by differential CK and NCK, and form five pair delay sequences {D0, D1}, {D0\_X1, D1\_X1}, {D0\_X2, D1\_X2}, {D0\_X3, D1\_X3}, {D0\_X4, D1\_X4} which are sent to five mux2\_banks. The polarity of each coefficient of the preemphasis is decided by the sign <4:0>. Synchronized by CK and NCK, the mux2\_bank interleaves two parallel 10 Gb/s delay sequences such as {D0, D1} to a serial 20 Gb/s sequence as one of the pre-emphasis inputs. Actually, the transmitter sends data at double data rate (DDR).



Fig.7. LC resonant clock distribution.





Fig.9. Test board measured by VNA.

When the transmitter switches to PAM2 mode, each B2T has only one effective 5 Gb/s input bit sequence, and accordingly it produces just a 5 Gb/s thermometer code sequence. The two parallel 5 Gb/s thermometer code sequences are then retimed by differential CK and NCK. Just like the PAM4 mode processing procedure, finally the serial 10 Gb/s sequences are produced and feed to the pre-emphasis.

Five current-steering DACs compose the five-order preemphasis with five digital programmable tap coefficients: a <-1><4:0>, a <0><2:0>, a <1><4:0>, a <2><4:0>and a <3><4:0>.

DAC's current bank can be programmed by its En signals. In this way, the transmitter can set a suitable equalizer coefficient for an arbitrary channel, and adjust the normal output voltage amplitude for power saving. The DAC\_Main is illustrated in Fig.5. Its function is sending the main-cursor sequence. The current source's value correlative with each bit





Fig.11. Post-layout simulated eye-diagram results.

(D < i >, i = 0, 1, 2) is controlled by 3 bits En<2:0>. The current adjusting step is 0.8 mA, and the adjusting range is 0  $mA \sim 0.8 \times 7 = 5.6 mA$ . So the biggest output current value for DAC\_Main is  $5.6 \times 3 = 16.8$  mA, and the voltage amplitude can adjust from 0 mV to  $25 \times 16.8 = 420$  mV (double 50  $\Omega$ termination) with a step of  $0.8 \times 3 \times 25 = 60$  mV. All the other DACs engaged for pre-cursor ISI and post-cursor ISI removing are the same, and are called DAC\_ISI. Except for having more control bits for current adjusting in detail, the architecture of DAC\_ISI is the same as DAC\_Main. The current source's value correlative with each bit (D < i >, i = 0, 1, 2) is controlled by 5 bits En<4:0>. The current adjusting step is 0.08 mA, and the adjusting range is  $0 \text{ mA} \sim 0.8 \times 31 = 2.48 \text{ mA}$ . So the biggest output current value for DAC\_ISI is  $2.48 \times 3 = 7.44$  mA, and the voltage amplitude can adjust from 0 mV to  $25 \times 7.44 = 186$ mV with a step of  $0.08 \times 3 \times 25 = 6$  mV.

For saving the power of driving DAC, the DAC's programming is actually supported by its pre-driver as shown in Fig.6. So, when a current source branch is disabled, all its corresponding drivers and multiplexers are also shut down.

It is critical to reduce the clock distribution power, because it occupies a significant part of power consuming for today's high frequency SOC. To mitigate the clocking power issue, we use a low current noise LC resonant clock distribution network by trade-off the routing area, and this is shown in Fig.7. The 300  $\mu$ m length clock line is routed by Metal 4 layer, with 1  $\mu$ m width and 1  $\mu$ m spacing. A distributed RLC network has been used in modeling the clock line and its loading. A 4.8 nH differential square spiral inductor with 30  $\Omega$ series parasitize resistor is resonant with a 2 × 400 fF clock distribution capacitor at 5 GHz. The inductor layout uses six turns M7-M6 top two level layers, and has an area cost 110

 $\times$  110  $\mu$ m<sup>2</sup>. For reliable operation across PVT variations, the low resonant quality factor Q = 4 is selected with a 1.25 GHz resonant bandwidth, so no trim capacitors are needed. The resonant peak to peak differential voltage  $(V_{ppd})$  amplitude is about 1200 mV with a 234-930 mV single ended peak to peak voltage swing. A pMOS cross couple loading CML buffer with 2.2 mA bias current is used to drive and compensate for the resonant network. The total power consuming of the LC resonant clock distribution network is about 2.36 mW. Comparing with the normal RC clock distribution network, a 4 × power decrease is achieved. The CMOS logic DFF is much more attractive than the CML counterpart if you just consider the DC power. For this reason, we develop the normal single ended CMOS logic DFF to a fully differential version. It improves speed and satisfies the pipeline's setup and hold time requirements with a margin across all PVT corners at 5 GHz. The PRBS is constituted with 4 LFSRs. The choice of LFSR's character polynomial is  $1 + x^6 + x^{7[6]}$ .

The transmitter consumes 50.56 mW power and occupies an area of  $430 \times 240 \,\mu\text{m}^2$ . The transmitter test chip with 16 test PADs has an area of  $780 \times 690 \,\mu\text{m}^2$ .

#### 4.2. Receiver CTLE

The receiver CTLE circuit is shown in Fig.8. The RC combination introduces a real zero 1/RC in the transfer function, potentially providing gain-peaking at high frequency to compensate for the channel loss. Inductive peaking is used to extend the CTLE bandwidth up to 10 GHz with almost 70 % power reduction. The layout of the very low *Q* inductor is implemented using six turns M5-M4 level layers, and has a small area cost  $60 \times 60 \,\mu\text{m}^2$ . The compensation gain value can be set continuously from 0 dB ( NEQ = 1 V ) to 7 dB ( NEQ  $\leq 0.32$  V)

| Table 1 Design summary                    |                              |                             |                      |                              |
|-------------------------------------------|------------------------------|-----------------------------|----------------------|------------------------------|
| Reference                                 | [7]                          | [8]                         | [9]                  | This work (post-layout       |
|                                           |                              |                             |                      | simulation)                  |
| Technology                                | 65 nm CMOS                   | 90 nm CMOS                  | 90 nm CMOS 1 V       | 65 nm CMOS                   |
|                                           | 1.2 V power supply           | 1.5 V power supply          | power supply         | 1 V power supply             |
| Channel loss @5 GHz (dB)                  | 12 dB                        | 10 cm FR4                   |                      | 20 cm FR4, 15 dB             |
| $T_{\rm x}$ swing (mVV <sub>ppd</sub> /2) | 80                           | Fixed                       | Variable             | Variable                     |
|                                           |                              |                             |                      | Max: 420 Min: 60             |
|                                           |                              |                             |                      | Step: 60                     |
| Data-rate (Gb/s)                          | PAM2: 15                     | Duo-binary: 20              | PAM4: 24             | PAM4: 20                     |
|                                           |                              |                             | PAM2: 12             | PAM2: 10                     |
| Power (mW)                                | <i>T</i> <sub>x</sub> :34    | T <sub>x</sub> :120         | Clock: 77            | Clock: 2.36 PRBS: 10.4       |
|                                           | <i>R</i> <sub>x</sub> :41    | <i>R</i> <sub>x</sub> :75   | Mux : 30             | Mux &DAC: 37.8               |
|                                           |                              |                             | DAC: 83              | <i>T</i> <sub>x</sub> :50.56 |
|                                           |                              |                             | Total: 190           | CTLE:5.3                     |
| Area(mm <sup>2</sup> )                    | <i>T</i> <sub>x</sub> :0.033 | T <sub>x</sub> :0.21        | T <sub>x</sub> :0.23 | T <sub>x</sub> :0.103        |
|                                           | <i>R</i> <sub>x</sub> :0.055 | <i>R</i> <sub>x</sub> :0.11 |                      | CTLE:0.027                   |

by the analog control voltage NEQ.

The CTLE consumes 5.3 mW power and occupies an area of  $146 \times 186 \,\mu\text{m}^2$ . The CTLE test chip with 14 test PADs has an area of  $516 \times 786 \,\mu\text{m}^2$ .

### 5. Conclusion

A low power 10 Gb/s PAM2 and 20 Gb/s PAM4 transceiver suitable for many chip to chip applications is presented.

This work integrates the characteristics such as: fully programmable pre-emphasis coefficients for an arbitrary channel, adjustable normal output voltage amplitude for power saving, PAM2/PAM4 dual mode operation, low current noise  $4\times$  power reduction LC resonant clock distribution, build-up power control in detail, and a low power wideband programmable receiver CTLE.

The FR4 test board has been manufactured, as shown in Fig.9, and is measured by Agilent VNA. We get a consistent *S* parameter for a 20 cm differential signaling channel and it can be converted into an HSPICE compatible *W* element with RLGC by ADS. The time domain pulse response, frequency domain transfer and reflection response are shown in Fig.10, respectively.

The post-layout simulated results of the designed transceiver chip with this differential 20 cm FR4 PCB trace (15 dB loss at 5 GHz) in PAM2 and PAM4 modes are shown in Fig.11. Without the transceiver equalizer, the eye diagram of the receiver end is close. When enabling the transceiver's equalizer function, the eye diagram of the receiver end is reopened, and most of the ISI has been removed (the equalizer tap coefficients are solved by Eq. (4)).

Comparisons with the recently published transceiver equalizers<sup>[7–9]</sup> are summarized in Table 1. Compared with Refs.[7,8], this work can operate on a PAM2/PAM4 dual mode, and has fully programmable pre-emphasis coefficients, adjustable output voltage amplitude for both an arbitrary channel and power saving. Compared with Ref.[9], this work has the detail power control ability, and achieves more power reduction in clock distribution, Mux and DAC.

## References

- Dorsey J, Searles S, Ciraula M, et al. An integrated quad-core Opteron<sup>™</sup> processor. ISSCC Dig Tech Papers, 2007: 102
- [2] Konstadinidis G, Rashid M, Rashid M, et al. Implementation of a third-generation 16-core 32 thread chip-multithreading SPARC<sup>TM</sup> processor. ISSCC Dig Tech Papers, 2008: 84
- [3] Stackhouse B, Cherkauer B, Gowan M, et al. A 65 nm 2-billiontransistor quad-core Itanium<sup>TM</sup> processor. ISSCC Dig Tech Papers, 2008: 92
- [4] Dally W J, Poulton J W. Digital systems engineering. Cambridge University Press, 1998
- [5] Proakis J G. Digital communications. 4th ed. McGraw-Hill, 2001
- [6] Weste N H E, Harris D. CMOS VLSI design: a circuits and systems perspective. 3th ed. Pearson Education Press, 1998
- Balamurugan G, Kennedy J, Banerjee G, et al. A scalable 5–15 Gbps, 14–75 mW low-power I/O transceiver in 65 nm CMOS. IEEE J Solid-State Circuits, 2008, 43(4): 1010
- [8] Lee J, Chen M S, Wang H D. A 20 Gb/s duobinary transceiver in 90 nm CMOS. ISSCC Dig Tech Papers, 2008: 102
- [9] Savoj J, Abbasfar A, Amirkhany A, et al. A 12-GS/s phasecalibrated CMOS digital-to-analog converter for backplane communications. IEEE J Solid-State Circuits, 2008, 43(5): 1207