A 5 Gb/s low area CDR for embedded clock serial links

    Corresponding author: You Li, liyou1@ime.ac.cn
  • Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, China

Key words: clock and data recoveryfrequency and phase trackingdigital filterbang—bang PDphase interpolator

Abstract: A multi-standard compatible clock and data recovery circuit (CDR) with a programmable equalizer and wide tracking range is presented. Considering the jitter performance, tracking range and chip area, the CDR employs a first-order digital loop filter, two 6-bit DACs and high linearity phase interpolators to achieve high phase resolution and low area. Meanwhile the tracking range is greater than ± 2200 ppm, making this proposed CDR suitable for the embedded clock serial links. A test chip was fabricated in the 55 nm CMOS process. The measurements show that the test chip can achieve BER <10-12 and meet the jitter tolerance specification. The test chip occupies 0.19 mm2 with a 0.0486 mm2 CDR core, which only consumes 30 mW from a 1.2 V supply at 5 Gb/s.


1.   Introduction
  • Due to the scaling of CMOS technology, it is possible to pursue the higher data communication bandwidth in telecommunication equipment and computer servers. With the increase of the data rate, the conventional parallel links suffer from multiplexing overhead, signal skews, difficulties in clock and data synchronization, crosstalk and interference. This resulted in a shift towards high-speed serial links aided by the increased on-chip bandwidth. So the multi-gigabit per second (Gbps) serial links are fast replacing the conventional parallel links.

    Figure 1 shows the simplified block diagrams of two common serial link architectures: forward clock (FC) architecture and embedded clock (EC) architecture. As shown in Figure 1(a), the FC architecture has a dedicated clock link sent from the transmitter (TX) to the receiver (RX), so it is often referred to more specifically as source-synchronous clocking. Although this architecture easily recovers the data and clock, it dissipates much power to drive the high speed clock and occupies a large area. As illustrated in Figure 1(b), to solve the problem of the FC architecture, the clock in the EC architecture is embedded in the transmitted signals. The EC architecture uses the edge transitions of the input signals to recover the data and clock. So to ensure a high probability of data edge transitions, the transmitted signals are encoded in the data, such as Manchester encoding, 8b/10b, even statistical coding methods, etc. However, this circuit occupies a smaller area and dissipates less power than the Clk driver in the FC architecture. Due to the low power and small area, the EC architectures are widely used in many applications and standards.

    While the embedded clock high-speed serial link has made extensive applications, such as PCI Express[1], USB[2], SATA[3], and RapidIO[4], there are two major challenges for these wire-line transmission systems. One is the channel losses and the other is the timing jitter between the input signals and local sampling clock in the RX. Meanwhile, the RX used in EC architectures should have the frequency tracking ability because the incoming signals and local clock have fixed frequency offset. Unfortunately, the channel losses and timing jitter are uncertain and unknown for RX and even worsen the bit error ratio (BER) of the whole system. So to design a robust RX with a programmable equalizer and a wide tracking range, it is necessary and important to meet the different applications and standards.

    This paper proposed a low power 5 Gb/s RX with a programmable equalizer and a wide tracking range. By comprehensively analyzing the system parameters and the first order bang-bang (BB) CDR loop dynamics, a first order digital CDR controller and two high linearity phase interpolators (PIs) are adopted to achieve the wide tracking range and good jitter performance. In the following, the proposed RX architecture is described in Section 2; Section 3 analyzes the system parameters; while Section 4 gives the first-order BB CDR loop model and deduces the design parameters; in Section 5, the detailed circuits are presented; finally, the measurement results are shown in Section 6.

2.   Receiver architecture
  • Figure 2 presents the proposed RX architecture which contains core blocks and testing blocks. Core blocks are comprised of an AC coupling circuit, CTLE, four samplers, two 2-to-10 deserializers, two phase interpolators (PIs), two digital to analog converters (ADCs), a digital controller with a first order filter, clock calibration and a divider. The testing blocks are used to verify the function of the RX, such as the 10-to-1 serializer, a current mode logical (CML) buffer and PRBS7 checker. To relieve the timing constraints of the digital controller, the serial stream is sampled and deserialized to 10-bit parallel streams, shown in Figure 2. This architecture has three main features: (1) most of the BB CDR is digital circuits, which are working at 500 MHz, easily ported to other processes and insensitive to the temperature, process and voltage; (2) the first-order filter has lower complexity and higher loop stability than the second-order filter[5]; (3) using the parallel bang-bang phase detector (BBPD) reduces the design difficulty of the phase detection and makes the $K_{\rm PD}$ more stable in the lock state.

    In Figure 2, the AC coupling circuit is to adjust the input common voltage of CTLE; CTLE is to compensate the losses of the transmission line and packages; the deserializer converts serial data to parallel data; clock calibration is used to achieve a good duty cycle and orthogonal clocks, the PIs, DACs and digital controller are used to recover the local clocks to sample the incoming serial signals (RXP/RXN) by adjusting the phase of the 4-phase sampling clocks. To test the function of RX, this chip has a built-in PRBS7 checker to verify if the input PRBS7 is recovered correctly. The CML buffer can drive the re-serialized data streams (TXP/TXN) to test the eyes of the retimed data to monitor the recovery clock performance.

3.   System parameters
  • Figure 3(a) shows the simplified block of Figure 1(b). Ref1 is the reference clock in the TX while Ref2 is the reference clock in the RX. Ref1 and Ref2 may come from different crystal oscillators that have fixed frequency offset ($\Delta f)$. In general, many standards require that the $\Delta f$ is smaller than $\pm $600~ppm. The RX not only tracks the frequency offset but also has to tolerate the sinusoidal jitter component superimposed on the existing jitter. Figure 3(b) shows the jitter tolerance specification. As shown in Figure 3(b), the frequency corner in jitter tolerance specification is at the 5 MHz ($f_{\rm c})$ while the sinusoidal jitter amplitude ($A_{\rm c})$ is 0.2 UI. Considering $\Delta f$ and the sinusoidal jitter, the embedded clock (Emclk($t))$ in the incoming signal is shown in Equation (1).

    $f_{\rm o}$ is the initial frequency of the sampling clock. The phase component $\theta (t)$ belongs to random jitter comprised of thermal noise, flick noise, etc and is much less than the phase jitter induced by fixed frequency offset and sinusoidal jitter. So the total phase of the Emclk($t$) is

    The deviation of Equation (2) is $f_{\rm offset}(t)$ shown in Equation (3),

    The maximum value of $f_{\rm offset}(t)$ is shown in Equation (4).

    Considering the random jitter $\theta (t)$, the maximum slew rate of phase difference is smaller than 2200 ppm$f_{\rm o}$. So if the RX can track the $\pm $2200 ppm fixed frequency offset, it can track the embedded input jitter in the incoming signal and recover the serial data.

4.   Bang-bang CDR loop dynamics
  • The bang-bang PD, due to its undefined $K_{\rm PD}$, makes the analysis of CDRs difficult. Fortunately, several researchers[6, 7] have recently tried to analyze such systems. Based on their analysis, Figure 4 presents a modified system view of the implemented first-order bang-bang (BB) CDR, which uses an accumulator, two phase interpolators (PI) and two DACs to replace the VCO in Reference [6].

    $\phi_{\rm e}(t_{\rm n})$ is defined as the difference between the data phase $\phi_{\rm d}(t_{\rm n})$ and the sampling clock phase $\phi_{\rm PI}(t_{\rm n})$ at the nth sampling time $t_{\rm n}$. For convenience, the initial sampling clock is referred to as the ideal clock source. Assuming the initial sampling clock is $A$sin(2$\pi f_{\rm o}t$), the embedded clock in the incoming signal is $A$sin[2$\pi (f_{\rm o}+\Delta f)t+\pi(t)]$. The phase difference between the embedded clock and the initial sampling clock is shown in Equation (2). The phase step of the first order BB CDR during every update cycle ($T_{\rm update}$ $=$ 10UI) is

    $T_{\rm D}$ is the transition density of the incoming signals. In general, $T_{\rm D}$ is close to 0.5[8], $K_{\rm P}$ is the gain of the first-order digital filter. $K_{\rm PI}$ is the phase step of PI and DAC and equals to 2$\pi $/2$^{N}$ while $N$ is the bit number of the DAC. To track $\pm $2200 ppm frequency offset, the phase update rate of the CDR should be larger than the $\vert f_{\rm offset}\vert _{\rm max }$.

    Equation (6) derives the relationship of the $K_{\rm P}$ and $N$, shown in Equation (7).

    An $N$-bit DAC and a PI create 2$^{N}$ different phases to cover two unite intervals (UI) due to the half rate architecture. $N$ is smaller, the minimum phase step and the jitter of the recovery clock are smaller while the area of the DAC is larger and the tracking range is smaller. Considering the output jitter of the recovery clock, DAC area and tracking range, this paper employs a 6-bit DAC while $K_{\rm P}$ is programmable (2$^{-1}$-2$^{-3})$ to adapt to different applications or standards.

5.   Circuit implementation

    5.1.   RX front-end and local clock calibration

  • In Figure 5(a), to compensate for the losses of packages and transmission channel with non-ideal factors, such as frequency-dependent losses, reflection and other interference sources, a programmable continuous-time linear equalizer (CTLE) realized by using an adjustable capacitor and resistor is used. Before the CTLE, an AC-coupling circuit is adopted to improve the input sensitivity of RX and achieve the impendence matching. The equalized signals are sampled by 4-phase clocks from PIs shown in Figure 2 and de-serialized to 10-bit data and 10-bit edge at a data rate of 500 MHz. The samplers are made by a sense amplifier (SA) and symmetric slave latch[9], shown in Figure 5(b).

    Figure 5(c) shows the CML to CMOS converters, which provide the clocks with CMOS level for the samplers. Meanwhile, to achieve the good duty cycle of the differential clocks, two duty-cycle correctors (DCCs) are used after the CML to CMOS converters. Furthermore, to meet the orthogonal relationship of 4-phase clocks for PI, a phase adjustment block is used to re-interpolate the 4-phase clocks ($\phi_{0}$, $\phi_{90}$, $\phi_{180}$, $\phi_{270})$ from the DCCs to produce the new 4-phase clocks ($\theta_{0}$, $\theta_{90}$, $\theta_{180}$, $\theta_{270})$ shown in Figure 5(d). These clock calibration blocks in Figures 5(c) and 5(d) can assure the orthogonal phase relationship and reduce the duty cycle distortion of the recovery clocks.

  • 5.2.   The digital controller of BB CDR

  • To minimize loop latency[6], the CDR is implemented in only three digital stages. Figure 6 shows the digital controller diagram of the BB CDR that consists of a phase difference extraction (PDE), a first order filter and a binary to thermometer converter. All of them operate at 500 MHz. Firstly, the digital controller calculates the phase difference between the incoming signal and the recovery clocks by using the PDE. The PDE compares Data$<$9 : 0$>$ with Edge$<$9 : 0$>$ to produce the early-late sum$<$4 : 0$>$. Due to the changing transition density[8] of the incoming signal, the loop gain and the tracking range are varying. So in the PDE there is a transition counter, which counts the number (cnt$<$3 : 0$>)$ of transitions of incoming signals during 2 ns ($=$ 1/500 MHz). If the absolute value of the sum equals to the cnt, the PDE outputs the maximum value (01010 : $+10$) or minimum value (10110 : $-10$). This method can improve the tracking ability and tracking speed.

    Secondly, in Figure 6, a first order filter processes the PDE output to filter the low frequency jitter and achieve the required frequency tracking ability analyzed in Section 4. Finally, a binary to thermometer converter generates the thermometer code for the DACs, which control the tail current of PIs to adjust the output phase of recovery clocks, shown in Figure 7.

  • 5.3.   2.5 GHz phase interpolator (PI) and digital-to-analog converter (DAC)

  • The main functionality of a phase interpolator is to use 4-phase clocks from the phase adjustment block to produce different output phases with fine steps. Figure 7 shows the circuits of two identical phase interpolators and two identical 6-bit DACs. One DAC and one PI provide the differential clocks for the data path while the other DAC and PI provide the differential clocks for the edge path. The two most significant bits of DAC determine the phase quadrant. The other four bits divide each quadrant into 16 sections (one section is expressed by stepsize). To avoid glitches happening, one fine step jump is used when the phase step is near the boundary of quadrant.

    The most important performance of the PI is the linearity, which is related to the tracking range and BER performance. Two factors contribute to the nonlinearity of the PI: the output slew of outp/outn and the linearity of the DAC. If the output slew rate is larger, the linearity of the PI is better. So this paper inserts the input buffers to add the input slew rate of (M0, M1, M2, M3) and uses the larger load capacitance ($C_{\rm L})$ to add the output RC time constant. These two ways can add the output slew rate of outp/outn to realize good linearity of the PI. If every bias current source of the DAC could not produce the same current step, the linearity of PI will be worse. To reduce the effect of the DAC, the output current of every bias current source ($I_{\rm bias})$, which always has one path to ground shown in Figure 7, is invariably compared to the $I_{\rm bias}$ controlled by MOS switches. The current sources meet the following equations:
    \begin{split} {}& I_1 +I_2 =15 I_{\rm bias}, \\[2mm]& I_3 +I_4 =15 I_{\rm bias}, \\ \end{split}(8)
    where $I_{1}$ and $I_{2}$ are the current sources of input transistors (M0, M1, M2, M3) of the interpolator for the data path and $I_{3}$ and $I_{4}$ are the current sources for the edge path. The methods above could achieve a higher linearity phase adjustment resolution, shown in Figure 8. In one quadrant, the 16 phase steps have the smaller differential nonlinearity (DNL), which is smaller than one phase step.

6.   Measurement results
  • The test chip of the proposed RX has been implemented in 55 nm CMOS technology. Figure 9 shows a micrograph of the test chip, which includes the RX, the testing CML buffer & re-serializer and testing IOs & PADs. The test chip occupies an area of 0.194 mm$^{2}$ with a 0.0486 mm$^{2}$ proposed RX, which draws about 30 mW from a 1.2 V supply.

    The test chip has been directly mounted on a printed-circuit board (PCB) with the input and output connections provided by the SMA interfaces. The test channel mainly includes the custom flip-chip wire-bond package, two SMA connectors, 8~cm of PCB traced on the evaluation board and 50 cm cable.

    Figure 10 shows the test setups and measurement results. Figure 10(a) presents the test method of the jitter tolerance curve measured by Tektronix BSA 125C BERT, while Figure 10(b) shows the measured jitter tolerance curve, which can pass the jitter tolerance specification, such as PCI Express 2.0 and USB 3.0. Figure 10(c) shows that the FPGA Stratix IV GX F1517 transmits the 5 Gb/s pseudo random bit sequence 7 (PRBS7) serial streams to the test chip and receives the retimed serial streams from the test chip. As shown in Figure 10(e), the test chip achieves BER $<$ 10$^{-13}$ when the FPGA and test chip do not use synchronous reference clocks. Figure 10(d) shows the output eye diagram of the retimed data of the test chip. The eye jitter (RMS) is 7.1 ps including the recovered clock jitter and the inter-symbol interference induced by the CML buffer and re-serializer. Table 2 shows the performance comparison of the recent papers. As shown in Table 2, the proposed CDR has a smaller area and wider tracking range than the others.

7.   Conclusion
  • In this paper, a low power 5 Gb/s RX with a programmable equalizer and wide tracking range CDR circuit was presented. Based on the system parameter analysis and the loop dynamics of first order BB CDR, a first order digital CDR controller is adopted to achieve BER $<$ 10$^{-12}$. Considering the tracking range, jitter and area, this paper employs 6-bit DACs and high linearity PIs to realize a greater than $\pm $2200 ppm tracking range and a 7.1 ps RMS eye jitter. The proposed RX only occupies 0.0486 mm$^{2}$, while it consumes 30 mW from a 1.2~V supply. Also, this RX with digital logic is easily ported to other processes and is compatible with many standards. It can also be used in multi-channel applications and does not induce much more electromagnet interference because it shares the same PLL with TXs and other-lane RXs in the same chip.

Figure (10)  Table (3) Reference (12) Relative (20)

Journal of Semiconductors © 2017 All Rights Reserved