Introduction
With the rapid growth of large-scale integrated circuits,such as processors, memory devices and SOCs,synchronization of unified clock signals for millions of transistors has become more and more sophisticated. In most digital circuit blocks, clock synchronization is accomplished by adopting automatic clock tree synthesis with the assistance of EDA software[1, 2, 3]. However,the balance of clock arrival time to the terminal of sequential devices is performed by inserting and sizing clock buffers and inverters[4, 5, 6, 7, 8],and as such disadvantages are brought about. A large amount of redundancy buffers and inverters are inserted to balance the clock propagation paths and extra area and power consumption are wasted. The phase error between registers connected with a clock distribution network,cited as the term `skew', will be suppressed under an acceptable level,normally 5\% of the clock cycle. Moreover,the precision is highly dependent of the methodology of buffer insertion and sizing.
To achieve higher accuracy of the clock,some specified circuits have been researched and applied. Phase-locked loop (PLL) and its deviation,digital-locked loop (DLL),are widely used in modern chips for high frequency clock generating and distribution[9, 10, 11, 12]. However,the application of PLL on clock distribution is very limited by its analogous structure,the long periods required for locking and the area occupied[13, 14]. The other circuit DLL is simpler on the structure and built by all-digital circuits. Its deviations are dominant in complicated clock synchronization schemes and have received academic attention in recent years focusing on how to decrease the time for full alignment of clock signals[15, 16, 17].
Another all-digital clock synchronization circuit,synchronous mirror delay-line (SMD) was first introduced into memory devices within a pre-built circuit block,which is the mirror of external clock drivers[18]. The entire alignment period is extremely short,only two clock cycles. However,the structure is not suitable for ASIC applications and the tuning step is quite a disadvantage compared with DLL,and the duty cycle of the input clock is also restricted in a narrow range. The problems with duty cycle and accessibility to ASIC are partly solved by its improved deviations,but the shortcoming of precision is still left[19, 20, 21]. A mixed structure of coarse and fine tuning was applied in Reference [22] with digitally-controlled varactors as fine tuning units,but the precision is not very high and the entire 10 clock cycle for synchronization is still very long. The clock synchronization buffer (CSB) which the hybrid architecture of SMD and DLL applied was proposed in References [23, 24], but the precision is 1.5 times the detection timing window and the fine tuning step is eight clock cycles owing to the state machine[23].
In this paper,a novel structure of a high-precision synchronization circuit, HPSC,is proposed with a hybrid framework based on SMD and specified dynamic compensation circuits,DCC. The clock driver is outside of the HPSC and the delay variation is irrelevant to the alignment of clock signals,thus the HPSC is applicable to clock distribution networks. The output signal will be roughly synchronized in 2 clock cycles and precisely aligned in the next 3 clock cycles,with the phase error suppressed less than 3.8 ps. All circuits are implemented using a SMIC 0.13 μm 1P6M process,and the test chip is fabricated to measure and verify the results of the proposed circuits.
Overview of the HPSC
As shown in Figure 1,the HPSC is composed of four parts: input buffer (IB),feedback buffer (FB),coarse delay line (CDL) and dynamic compensation circuit (DCC). It is worth mentioning that the clock driver (CD) is independent and outside of the HPSC. The primary parts of the HPSC are the coarse tuning component,CDL,and the fine tuning component,DCC,both described in the following in detail. There are three ports of the HPSC,input EXTCLK short for the external clock from PLL,output INTCLK short for the internal clock in the distribution network and SYNC,the control signal.
Three reference clocks (REFCLK,PCLK,NCLK) are generated in IB with minor phase difference,and the delay from EXTCLK to REFCLK is d1. The FB is almost the duplicate of the IB but with only one output port,and the delay of the FB is also d1. As mentioned above,the CD is outside of the HPSC and the delay d2 varies in a large range.
The CDL is composed of several function blocks,an interleaved measurement delay line (IMDL),control circuit (CC),dual control compensation delay line (DCCDL) and two selector blocks,ISEL and OSEL,as shown in Figure 1. This structure is similar to the conventional SMD but with a smaller tuning step, TR,which stands for the resolution at the same time,also the critical parameter for the performance of the clock synchronization. In the proposed circuit, TR is close to the minimal delay of one simple inverter,while the resolution of conventional SMD is almost doubled.
The uncertainty is reduced consequently by inserting the DCC into the HPSC for fine tuning. The DCC is composed of a phase detector (PD),state machine (STM) and variable delay paths (VDP). When DCC starts to work,the phase error between EXTCLK and INTCLK will be captured in every working cycle,even the control state turns to stable. The initial delay of the VDP is d0,nearly 4 times that of the inverter delay and much smaller than d1.
In the 1st clock cycle after SYNC signal is active,the selector OSEL will select REFCLK to CCLK,while the STM of the DCC is also uninitialized,thus the propagation path of the input clock EXTCLK is IB,VDP,CD and FB with delay d1,d0,d2 and d1 as shown in Figure 1. We can obtain that the delay from FBCLK to EXTCLK is d1+d0+d2+d1. The delays d0,d1 and d2 follow the relationship in Equation (???):
In the 2nd clock cycle,the selector ISEL will let FBCLK propagate to CDLIN and CDLOUT is selected to CCLK by OSEL,with DCC uninitialized. The reference clock,REFCLK of 2nd cycle is generated with delay d1,thus the phase difference between FBCLK (CDLIN) and REFCLK is d1+d0+d2+d1−d1=d1+d0+d2 and will be obtained. Actually,another form,Tv is measured by IMDL,as Equation (2) shows. However,measurement of the phase difference Tv in digital circuit blocks is limited by the resolution TR,the minimal delay time of each measurement units,listed in Equation~(3).
Thus,there must be an error between real phase difference Tv and measured phase difference T′v. The error δ introduced by IMDL shown in Equation (4) is inevitable in all-digital circuits and associated with the resolution TR. The measured phase difference T′v is passed to DCCDL,then the reference clock REFCLK propagates through DCCDL with the same delay. However,the compensation phase difference Tv″ is not exactly the same with T'_{\rm v}. The error introduced by DCCDL,\sigma,is raised by the circuits and smaller than \delta in the proposed circuits. The propagated clock runs through VDP and CD again with delay d_0+d_2,then outputs to INTCLK. The entire delay from EXTCLK to INTCLK can be calculated as Equation (5). Thus the output clock is roughly aligned in 2 clock cycles with the phase error at \phi=\sigma + \delta.
From the 3rd clock cycle,fine tuning is applied to synchronize the clock signal via DCC. OSEL and ISEL hold the previous states and DCC starts to work. It is difficult to obtain the accurate phase difference using all-digital circuits. However,the PD of the DCC will capture the phase relationship between EXTCLK and INTCLK,then parse the S+/S-/S+ indication signal to STM,and control codes are generated in STM and sent to VDP. Since the phase error \phi is reduced to nearly half of the conventional SMD structure,the number of delay paths is also reduced by half compared with other hybrid clock synchronization circuits,assuming the same resolution \tau is given. Selection of \tau is a trade-off between precision and complexity,and here \tau follows \tau <{(\delta+\sigma)}/{7},thus INTCLK will be fully aligned in 3 clock cycles by utilizing a binary search algorithm as Equation {\ref{eqn:stm2}} shows. At last,the phase error \phi between INTCLK and EXTCLK will be suppressed under \tau.
Implementation of the CDL
As described in Section {\ref{sec:overview}},the precision of the HPSC is related to the resolution of CDL,T_{\rm R}. The improvement of T_{\rm R} shown in this section is based on an interleaved structure of the measurement delay line and compensation delay line. With the comparison to conventional SMD,three reference clocks are generated,REFCLK,NCLK and PCLK. REFCLK plays a similar role to REFCLK in conventional SMD blocks,while NCLK and PCLK are inversions of REFCLK. The distinction of NCLK and PCLK is slight but the leading of PCLK with amount of phase shifting \theta. Even though all the reference clocks are generated in the IB,the driving load of each reference clock varies greatly,which leads to notable differences in the size of generation transistors. Furthermore, the sensitivity of the reference clock generation circuits to PVT variation is shown in Figure 2(a), 2(b) and 2(c). The phase error between REFCLK and NCLK is well controlled at \pm 2 ps under most conditions.
The phase detector in the CDL is a modified TSPC register,which is edge-triggered and with the predominance of T_{\rm setup} which diminishes the risk of failure of detection,while T_{\rm setup} is affected by the quality of clock signal. If the rise or fall time is rather slow,TSPC will capture a fake signal or completely miss. However the risk is associated with the driving ability of REFCLK,and the size of buffer in the IB is limited to a reasonable range.
In the first two clock cycles,FBCLK which delays d_1+d_0+d_2+d_1 will be sent into CDL and propagates through IMDL which is a chain of measurement delay units (IMDU),shown in Figure 3. However,the IMDU is simplified to a minimal inverter,which is balanced on the rise and fall time to avoid distortion of the duty-cycle of the propagating clock signal, shown in Figure 2(d). Therefore,the entire IMDL is divided into odd and even parts,accordingly,the output stage of propagated clock signal,MDLx,is inverted and normal,as shown in Figure 4. On the contrary,the delay units in conventional clock synchronization circuits are NAND gates or a pair of tri-state inverters. The odd IMDUs accept NCLK as the reference clock and the even IMDUs accept REFCLK,then all propagated clocks will perform phase comparison with the reference clocks and the results will be sent to the control circuit (CC) of the CDL in digital bits.
The phase information from unified IMDL can be simplified and formatted to the pattern as ``1...1100...0''. However,a pair of control signals are generated in CC,P for pass and I for clock injection. The signal I is similar to the control signal of conventional SMD circuits,and expressed as in Equation~({\ref{eqn:I}}),while the signal P is added to control the behavior of units in DCDDL,shown in Equation ({\ref{eqn:P}}). Signals Q and NQ are from the phase detectors of CDL,and the total number of units in IMDL is N,assuming the numbers of odd and even parts are equivalent.
As the Equations ({\ref{eqn:I}}),({\ref{eqn:P}}) and ({\ref{eqn:misc}}) show,there will always be one active signal of I array,cited as I_k=1 and other clock injection signal will be logical low. The P signal array follows the pattern as ``1...1100...0'' while the part ranging from P_0 to P_k will hold on logical high,shown in Figure 3. Since all devices are digital,the meta-stability of I and P are eliminated by CC which is noise isolated. The timing cost of generation of I and P is irrelevant to the critical timing path and will not affect the performance of the CDL. The phase difference information is discrete in two series of control signal,I and P,with the continuous error \delta associated with the delay time of the units in the IMDL,T_{\rm R}. The waveform of the internal signals is demonstrated in Figure 4. MDLx is divided into odd and even groups and compensation is started from PCLK since I_{15}=1 in this case,shown in this figure.
In conventional SMD,the delay units in the compensation component share the same circuit structure with units in the measurement part. However,this is not suitable for the CDL owing to the division of odd and even. A novel circuit,dual control compensation delay unit (DCCDU) is proposed to perform as delay unit,as Figure 2(d) shows. DCCDL is a N-stage chain of DCCDU, which is also divided into odd and even parts,while PCLK and REFCLK are accepted. The performance of DCCDU is close to the units of IMDL in this circuit.
The working status of the DCCDU is controlled by the signals I and P. When I=0,P=0,the DCCDU is shut down and will prevent any propagation of clock signals, while the division of DCCDUs marked from k+1 to N will follow this behavior. When I=1,P=1,transmission gate TG is enabled and DCCDU allows clock injection from the reference clock: REFCLK for the even unit or PCLK for the odd unit. Only one DCCDU,labeled as k,will accept clock injection. When I=0,P=1,the DCCDU behaves as one balanced inverter and allows propagation through it,while the units marked from 1 to k-1 will work in this state. And in the case that I=1,P=0 never appears,which is illegal in this circuit.
The output stage of every DCCDU is interpolated with the gated reference clock. However,as the analysis above showed,the case in which two signals simultaneously run into active would never fall in reality,and the potential phase blending of V_{K+1} and valid REFCLK/PCLK,which leads to the risk of confusion,is thus extinguished. To ensure this,the rising/falling time of clocks should be guaranteed,which is related the size of buffers in the IB.
Another challenge for the performance of DCCDU when behaving as inverter is solved by applying a positive feedback structure. The inverter I_1 is inserted to accelerate the falling process of the clock signal and to match the performance of units of IMDL. Otherwise,the DCCDU will lose the balance of rising/fall time when a clock signal is propagating through which leads to a distortion of the duty-cycle. Furthermore,the sizes of the transistors of the DCCDU are also optimized to achieve the balance. And when the control signal P is logical low,the other inverter I_0 of DCCDU is disabled and the output of I_0 is held to a high-impedance state but not logical high,which is critical to the correctness of behavior. Moreover,the dynamic circuit structure is only suited to high-frequency clock application,otherwise an unintended change of V_{k+1} is brought about by the leakage of electronic charge.
As mentioned in Section {\ref{sec:overview}},we know that there is a relationship between \delta and \phi. In this part,a more detailed discussion is given. What we are concerned about is the transmission of the original information and output information. In this system,the input signal is the phase difference between FBCLK and REFCLK,while the output signal is the phase difference,which stands for the output clock of DCCDL and the reference clock. The phase error between EXTCLK and INTCLK,\phi = \delta + \sigma,is related to the resolution T_{\rm R}.
On the other side,the delay owing to the transmission gate TG, \sigma,should also be noted. The loss of performance relies on many factors: the sizes of transistors in the TG,the rise/fall time of the gated reference clock,the driving load of the TG,which is mainly affected by the size of inverter I_0,even the layout of implementation. Moreover,the falling edge is more interfered with than the rising edge of the reference clock,thus the parameter \sigma can be separated as \sigma_{\rm f} and \sigma_{\rm r} with \sigma_{\rm r} \ll \sigma_{\rm f} < T_{\rm R}. The detailed phase error analysis is shown in Equations (10) and (11),while the symbol {\phi}_{\rm o} stands for the odd parts of DCCDL,and the symbol {\phi}_{\rm e} is opposite.
Moreover,phase shifting \theta is manipulated by the generation circuits in the IB,and can be matched with the falling delay of TG,which means \sigma_{\rm f} = \theta,thus the maximum phase error can be obtained from Equation (14) since \sigma_{\rm r} is insignificant. However \sigma is affected by the PVT conditions,in some cases,\phi is even smaller.
Circuits of the DCC
From the 3rd clock cycle,DCC starts to align INTCLK to EXTCLK. As with the results we obtained from the sections below,the phase error is now suppressed under (0,T_{\rm R}),and the required time for fine tuning is greatly decreased. The PD in the DCC is designed to capture the phase difference of INTCLK and EXTCLK, more specifically,the rising edge of the two clocks,then indicates the relationship shown in Figure 5(a). However,the accurate discrepancy of phase is not handled due to the difficulty in resolving the minute difference using digital circuits; even the two signals overlap a bit to some extent.
To detect the phase relationship,the two clocks are sent into the phase detector,with EXTCLK selected as reference `clock' and INTCLK as input `data'. While another reference clock EXTCLKD is added for assistance,which EXTCLK delays with a detection timing window,\tau,as demonstrated in Figure 5(b). As the logical expression listed in Equation ({\ref{eqn:pd}}),S+ indicates that INTCLK arrives earlier,and more delay units in VDP need to be inserted,while S- means INTCLK arrives later,and less delay is required. Furthermore,if INTCLK falls into the gap between EXTCLK and EXTCLKD,the signal is set to SL,a `LOCK' state.
The detection timing window is critical to the precision of DCC and a proper \tau is required to suppress the phase error at a low level. However the \tau is limited by the structure of PD and associated with the counts of clock cycles to complete synchronization. In this paper,we choose \tau as in Equation (16). And the parameter \tau satisfies the constraint T_{\rm R}/8 < \tau <T_{\rm R}/7 .
There are 8 legal states in STM,`S0'\cdots`S6',which stand for non-lock cases and connect to the ports of VDP using the same name,and another state indicating `READY'. Traditionally,to access all states,the operation in the form of forward traversal is applied using shift registers. However this leads to a long period and unnecessary power consumption. 8 clock cycles are required in the conventional way,while another method utilizing a binary search algorithm is implemented to reduce the time to 3 clock cycles.
In the 3rd clock cycle,the DCC starts to work with the initial state set to `S3' even before the clock signal arrives,and correspondingly the control code of VDP is also `S3'. PD will receive the information about phase error \phi,when the clock's rising edge comes,while the indication signal `S+/S-/SL'is released to the STM.
In the 4th clock cycle,the current state is modified by the indication signal. If `S+' is received,the current state runs to `S5',else `S-' is received,then runs to `S1'. Moreover,if `SL' is received,the current state will run into `READY',however the switch of VDP will hold `S3' until the state `READY' is released. Therefore,there are two candidates for current state register in this clock cycle,`S1' or `S5'.
In the 5th clock cycle,as shown in Figure 6(a), the current state register is one of four possible states,`S0',`S2',`S4', `S6'. And all possible states will be accessed since the phase error \phi is suppressed under T_{\rm R} and the detection timing window \tau follows the constraint T_{\rm R}/8 < \tau <T_{\rm R}/7,and the fine tuning process is completed.
In the 6th and later clock cycles,the STM is already stable,with the current state register in the `READY' state and the switch of VDP keeping the last valve of the 5th clock cycle,and the STM is locked.
To keep track of the static delay shifting of the CD,which is raised by the subtle changes in PVT conditions,the DCC will break from the stable state when the phase error \phi is captured again and another dynamic alignment is performed. However,the initial state is chosen as `S3' to keep the simplicity of the control logic of the STM.
The structure of the VDP is demonstrated in Figure 6(b). The delay of the uninitialized path is d_0,while the other delay paths controlled by `Sk' increase by the detection timing window \tau and the delay is d_0+(k+1)\tau. This is achieved by inserting proper capacitance load to the wire and optimizing the sizes of transistors. In this structure of delay paths,MUX is not chosen to avoid performance loss,and the interpolation of different delay paths is preferred.
Experimental results
The proposed HPSC is implemented using a SMIC\linebreak 0.13-{\mu}m CMOS 1P6M process, with the input voltage at 1.2~V. The total active area of the core circuits is 245 \times 134^2 \mum,as shown in Figure 7(a), the micro-photograph of the proposed HPSC. The other area around the core circuits is filled with decoupling capacitors. The power consumption is 1.64 mW at 500 MHz with 1 pF output load.
The critical parameter T_{\rm R},which is close to the delay of a simple inverter, is about 31.6 ps which is verified by the post-layout simulation,while the parameter \tau is nearly 3.8 ps. Figure~7(b) shows the simulation results with different input clock frequency,compared to the conventional SMD circuits.The points distribute randomly but the range is limited. The conclusion is that the phase error is suppressed under \pm 4 ps with DCC enabled,which is rather lower than the case without DCC, nearly \pm 18 ps. The third curve corresponding to the conventional SMD,shows that the tuning step is roughly twice that of the resolution of the CDL.
The simulation results of the timing graph when the input clock runs with the frequency at 200 MHz and the duty cycles at 20\% and 80\% is shown in Figure 8. The total load of driving and wires is assumed to be 1 pF and the post-layout simulation uses the HSPICE tool with normal conditions. As the figure demonstrates,INTCLK is coarsely aligned in 2 clock cycles with EXTCLK,and the phase error is suppressed to -23.3 ps. In the next cycles,the dynamic compensation circuit starts to align INTCLK in much finer grain,and the phase error is suppressed to -12 ,-4.5 and -0.7 ps in 3rd, 4th and 5th clock cycles,as shown in Figure 8. The jitter is associated with the stability of the circuits and suppressed at 0.4 ps (RMS) and 4.5~ps (peak-to-peak).
To measure the phase error of the clocks in an actual testchip is very difficult, The instrument Tektronix TDG-5078 is applied to generate clock signals and Agilent E8361A PNA is used to measure the phase error between external clock source and probed output clock. The delays of the clock drivers are set to four cases: 280,400,600 and 720 ps and the measurements are based on 24 individual testchips. There are 2 broken test chips due to the yield problem and the rest of the chips are tested 10 times independently. The results of measurement of the phase error between the clock source and the aligned output distribute randomly in the range [3.2,8.6] ps and the histogram of distribution is shown in Figure 9. However,the measured results are restricted by the precision of the clock generator and the jitter of the HPSC,even the jitter from clock signal generator. It is still possible to observe the alignment of signals,the results of the testchip with the generated clock at various clock frequencies and duty cycles are shown in Figure 10. As the figures show, the output clock is aligned with the input clock in several clock cycles,2 for coarse tuning and 3 for fine tuning. And the performance compared with other similar circuits is listed in Table 1.
Parameter | JSSC04^{[19]} | ISCAS05^{[20]} | VLSI12^{[24]} | APCCAS12^{[23]} | CICC13^{[16]} | This work |
Process (nm) | 350 | 180 | 130 | 180 | 130 | 130 |
Supply Voltage (V) | 3.3 | 1.8 | 1.2 | 1.8 | 1.5 | 1.2 |
Frequency (MHz) | 170-230 | 200-400 | 300-800 | 150-900 | 80-450 | 200-800 |
Area (mm^2) | 0.79 | -- | 0.015 | -- | -- | 0.03 |
Power (mW) | 14.85 @ 230~MHz | 9.87 @ 400 MHz | 2.4 @ 800 MHz | 15 | 26 | 1.6 @ 500 MHz |
Jitter[rms] (ps) | 9.9 @ 230 MHz | -- | 2.25 @ 800 MHz | -- | 2.3 @ 180 MHz | 2.2 @ 500 MHz |
Jitter[p2p] | 72.7 @ 230 MHz | -- | 21.53 @ 800 MHz | -- | 10 @ 180 MHz | 20.8 @ 500 MHz |
Alignment Period (cycles) | 10 | 2 | 6 | 6 | 8 | 5 |
Phase error (ps) | 140 | 58.7 | 31.2 | 45 | 15 | 3.8 |
Conclusion and future work
In this paper,a novel circuit for high-precision clock synchronization for the sake of large-scale distribution networks,HPSC,is proposed. An interleaved measurement delay line and dynamic compensation circuit are adopted to achieve the precision in a short time,5 clock cycles. Clock driver buffers are moved outside of the HPSC for applications in ASIC and clock distribution networks. As the experimental results show,the operation ranges from 200 to 750 MHz with the duty cycle of the input clock ranging between [20\%,80\%]. The phase error is suppressed under 3.8 ps after 5 clock cycles.
Since another phase error `jitter' is affected by the dynamic compensation process,we are working on the reduction and suppression of clock jitter,even from the external clock source. Other attempts to improve the resolution without much extra cost in area and complexity will be given. This will be a possible future work.