# Hierarchical distribution network for low skew and high variation-tolerant bufferless resonant clocking\*

Xu Yi(徐毅)<sup>†</sup>, Chen Shuming(陈书明), and Liu Xiangyuan(刘祥远)

School of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

**Abstract:** We propose a hierarchical interconnection network with two-phase bufferless resonant clock distribution, which mixes the advantages of mesh and tree architectures. The problems of skew reduction and variation-tolerance in the mixed interconnection network are studied through a pipelined multiplier under a TSMC 65 nm standard CMOS process. The post-simulation results show that the hierarchical architecture reduces more than 75% and 65% of clock skew compared with pure mesh and pure H-tree networks, respectively. The maximum skew in the proposed clock distribution is less than 7 ps under imbalanced loading and PVT variations, which is no more than 1% of the clock cycle of about 760 ps.

Key words:resonant clock; clock distribution network; clock skew; PVT variationDOI:10.1088/1674-4926/32/9/095011EEACC:1180; 1280; 2570D

# 1. Introduction

The clock distribution network (CDN) is one of the most important parts of a high-performance synchronous system<sup>[1]</sup>. However, with clock frequency increasing and technology scaling, the problems of power consumption and timing uncertainty become more and more challenging, which results in complicated designs for global and local clock distribution<sup>[2-4]</sup>.

Charge recovery resonant clocking is a promising clock technology for high-performance VLSI design because of its ability to reduce power dissipation and clock uncertainty. Lowskew global clock distributions employing single-phase<sup>[5]</sup> or differential-phase<sup>[6]</sup> resonant clocking have been studied, respectively. However, the utilization of a decoupling capacitor and conventional clock buffers for local clocks decreases clock power savings due to the fact that most clock power is dissipated in local distributions. Hence, to increase the energy efficiency of resonant clocking, the capacitances of the interconnection network and the clock load are directly driven by an LC tank without any clock buffers, which forms a bufferless resonant clock distribution<sup>[7-11]</sup>. The scheme of resonant clocking without buffers<sup>[7]</sup> is studied, but the chip fails to work at the target clock frequency by the on-chip LC bank because of small energy compensating cells. The skew and power consumption of single-phase<sup>[8]</sup> and two-phase<sup>[9]</sup> H-tree-shaped resonant clocking networks are analyzed. In these clock distribution schemes, an imbalanced clock load and variations of wire geometric parameters become the main cause of skew between different sinks. 1.56 GHz bufferless resonant clocking with a realistic load that employs a mesh interconnection network is implemented<sup>[10]</sup>, but the skew is not analyzed because of size limit. The power efficiency of single-phase and twophase latch based resonant clocking is compared in Ref. [11].

Although bufferless resonant clocking has an obvious advantage for power reduction, there are several issues to be resolved. One of the major challenges is to keep the clock skew within the constraint. Without clock buffers, the skew in local distribution is affected by many factors. We take a differentialphase bufferless resonant clocking network as an example. In a pure mesh interconnection network, different parasitic parameters in the clock wires induce skews, although the mesh architecture provides sufficient tolerance towards variations. In a balanced tree-based interconnection network, an imbalanced clock load and PVT (process, voltage and temperature) variations are the main causes of clock skew. This paper presents a hierarchical interconnection network for bufferless resonant clock distribution. With the mixed interconnection of H-tree and mesh architecture, the hierarchical bufferless resonant clock distribution network (HBRCDN) achieves a much lower clock skew under an imbalanced clock load, and shows high tolerance towards PVT variations. The main factors in HBRCDNs are analyzed in detail, such as interconnect architecture, mesh network modeling, on-chip inductor modeling and energy compensating cells design. The clock skew is studied using a HBRCDN-based pipelined multiplier circuit under a TSMC 65 nm standard CMOS process. Post-simulation results show that the proposed CDN has a clock skew of 4.2 ps at the highest frequency (1.32 GHz) and exhibits high-tolerance under an imbalanced load and PVT variations.

# 2. HBRCDN architecture

As a differential-phase resonant clocking scheme has better power efficiency than a single-phase one<sup>[11]</sup>, and the differential clock signals in the former network are insensitive to power supply noise and other common-mode noise source<sup>[3]</sup>, we choose differential-phase resonant clocking network in this work, but the interconnection architecture and design method

<sup>\*</sup> Project supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2009ZX01034-001-006) and the National Natural Science Foundation of China (No. 60906014).

<sup>†</sup> Corresponding author. Email: xuyi.nudt@gmail.com Received 8 March 2011, revised manuscript received 10 May 2011



Fig. 1. Differential-phase hierarchical bufferless resonant clocking network.

are the same as for the single-phase one.

The hierarchical interconnection network based bufferless resonant clock combines H-tree and mesh architecture, as in Fig. 1. The on-chip inductor is connected to a 4-level H-tree interconnection, which can be seen as the source of the local clock distribution network (the clock signals oscillate by the action of energy compensating cells). All the leaf nodes of the H-tree network are directly connected to the mesh network. Via the mesh and interconnects between mesh and clock sinks, the two-phase clock signals (CK and CKN) distribute to the corresponding ports of all the clock sinks, such as flip-flop, latch or other synchronous cells. Due to the loss of the clocking network, energy compensating cells are necessary to maintain oscillation. The resonant frequency and power consumption of the CDN is determined by the total inductance, total capacitance and effective serial resistance<sup>[7]</sup>.

#### 2.1. Skew and wire cost

To analyze the clock skew in the HBRCDN, we define a few common terms:

(1)  $S = \{s_1, s_2, ..., s_N\}$  is the set of all N clock sinks, and  $C_S = \{c_{s1}, c_{s2}, ..., c_{sN}\}$  denotes the corresponding load capacitance of each sink, where  $s_i$  and  $c_{si}$  denote the *i*th sink and its load capacitance.

(2)  $W_{\text{tree}} = \{w_1, w_2, \dots, w_M\}, L_{\text{tree}} = \{l_1, l_2, \dots, l_M\}$  and  $P_{\text{tree}} = \{p_1, p_2, \dots, p_M\}$  are the sets of wire width (assuming the two-phase clock signals have the same width), length and pitch in the different level of the H-tree. *M* is the depth of the H-tree, and the total number of leaf nodes in the tree is  $2^M$ .

(3) The X, Y dimensions of the chip are given as  $X_{\text{bound}}$  and  $Y_{\text{bound}}$ , and the mesh size is defined by the number of horizontal and vertical segments denoted by m and n respectively. Therefore the total number of nodes in the mesh is  $m \times n$ , from 1 to  $m \times n$  sequentially.  $W_{\text{mesh}}$  and  $P_{\text{mesh}}$  are the wire width and pitch in the mesh network.

(4) Each clock sink  $s_i$  is attached to the nearest mesh node using an interconnect stub of width  $W_{i\_stub}$  and length  $L_{i\_stub}$ . The number of the segments from the attached mesh node to the nearest leaf node of the H-tree is k.

(5)  $DLY_{i\_tree}(W_{tree}, L_{tree}, P_{tree}, M)$  is the delay from clock source to the *i* th leaf node in the H-tree.

(6) DLY<sub>*i*\_mesh</sub>( $W_{mesh}$ ,  $P_{mesh}$ , k,  $W_{i\_stub}$ ,  $L_{i\_stub}$ ,  $c_{si}$ ) is the delay from clock sink  $s_i$  to the nearest leaf node of the H-tree.

Assuming that the clock signals begin at the position of on-chip inductor,  $s_i$  and  $s_j$  are two arbitrary clock sinks in the CDN, and the *k*th and *l*th leaf nodes are the nearest nodes in the H-tree to these two sinks. The clock skew is defined as the maximum difference in clock arrival times between  $s_i$  and  $s_j$ , which can be expressed as follows:

skew<sub>HBRCDN</sub> = 
$$|(DLY_{k\_tree} + DLY_{i\_mesh}) - (DLY_{i\_tree} + DLY_{j\_mesh})|_{max}$$
. (1)

For each clock sink, the clock arrival time is divided into two parts: the delay from clock source to the nearest leaf node apart from the sink, and the delay from that leaf node to the clock sink. The first part of the total delay is a function of the wire width, length, pitch and depth in the H-tree, and the second part is determined by many factors, including the parameters of interconnect between the leaf node and the node attached to the sink, the interconnect parameters of the stub and the load capacitance of the clock sink. The maximum area cost of the HBRCDN interconnect network excluding the on-chip inductor can be expressed as a sum of three parts as follows:

$$A_{\text{HBRCDN}} = 2 \left[ \sum_{i=1}^{M} w_i l_i + W_{\text{mesh}} \left( m X_{\text{bound}} + n Y_{\text{bound}} \right) + \sum_{j=1}^{N} W_{j \text{-stub}} L_{j \text{-stub}} \right].$$
(2)

The first component in the right side of the above equation is the total wire cost of the interconnect area in every level of the H-tree. The secondary one is the wire cost of the total mesh network, which is determined by the chip size, mesh size and the width of the mesh interconnects. The last part is due to the stubs connected to each clock sink. Supposing that the differential-phase interconnection networks have the same geometric parameters, we add a factor of 2 to compute the total wire area cost.



Fig. 2. (a) single- $\pi$  and (b) 3- $\pi$  models for interconnect.

#### 2.2. Wire models in the clocking network

To accurately model the behaviors of the wires in the clocking network, we use the same method in Ref. [13]. For wires smaller than 100  $\mu$ m, a single- $\pi$  model with two capacitors, a resistor and an inductor is applied, and for the longer wires, a 3- $\pi$  model is used, as shown in Fig. 2. In the models, *R*, *L* and *C* are the resistance, inductance, and capacitance per unit length, respectively, and  $\Delta z$  is the wire length. Thus the whole clock interconnection network is composed of a great deal of lossy transmission lines with different parasitic parameters.

In the multi-gigahertz regime, the frequency-dependence of the clock wires' serial resistance must be considered. Ignoring this factor leads to an underestimation of the clock arrival times, which makes the clock skew inaccurate. Therefore, taking the serial resistance per unit length into account, the skineffect is calculated as in Ref. [12].

$$R(f) = \frac{\sqrt{\pi f \mu \rho}}{2(H+W)},\tag{3}$$

where f is the oscillating frequency in the HBRCDN,  $\mu$  is the permeability of the surrounding dielectric,  $\rho$  is the resistivity of the interconnect metal. H and W are the thickness and width of the corresponding wire.

#### 2.3. On-chip spiral inductor

The on-chip spiral inductor is the most important component of the HBRCDN. When the passive device is integrated into the process of silicon-based standard CMOS technology, the substrate losses under it increase heavily due to eddy-current effects in the high conductivity silicon substrate. Additionally, skin effect, proximity effect, capacitive coupling and inductive coupling in the spiral coils must be considered to accurately model the behaviors of inductor.

In order to reduce the area cost of the spiral inductor, we limit the passive devices in an area of  $200 \times 200 \ \mu m^2$ . To get a maximum quality factor for the inductor, the differential onchip spiral inductor employing multi-layer metals is chosen, as shown in Fig. 3. The coils are implemented by several metals, including the thickest top layer metal and some intermedial metals. Metals in different layers are connected by via-array. For the inductor, the key design parameters are the inner coil diameter  $D_{\text{inner}}$ , the wire width  $w_{\text{ind}}$ , the space  $s_{\text{ind}}$  between two nearby coils, the number  $n_{\text{coil}}$  of coils and the number  $n_{\text{ind}}$  of metal layers.

To obtain an accurate and SPICE-format compact circuit of the multi-layer differential inductor, we use the electromagnetic field solver HFSS<sup>[13]</sup> to model and analyze the passive



Fig. 3. Differential on-chip spiral inductor with multi-layer metals.

Table 1. Process parameters for on-chip spiral inductors.

| Parameter                | Description                    | Value            |
|--------------------------|--------------------------------|------------------|
| t <sub>m8</sub>          | Top metal thickness            | 0.9 μm           |
| $t_{m7} - t_{m5}$        | Inter-medial metal thickness   | $0.22~\mu{ m m}$ |
| $R_{\Box, m8}$           | Top metal resistivity          | 0.02 Ω/□         |
| $R_{\Box, m7-m4}$        | Inter-medial metal resistivity | 0.022 Ω/□        |
| $\varepsilon_{\rm eff}$  | Effective dielectric constant  | 3.16             |
| <i>t</i> <sub>diel</sub> | Dielectric thickness           | 5.965 μm         |
| $\varepsilon_{ m sub}$   | Substrate dielectric constant  | 11.7             |
| $\sigma_{ m sub}$        | Substrate conductivity         | 1000 S/m         |
| <i>t</i> <sub>sub</sub>  | Substrate thickness            | 300 µm           |

devices. Inductors with different parameters are designed under a TSMC 65 nm standard CMOS process. Table 1 gives the key parameters of the process, it is notable that  $\varepsilon_{eff}$  is the effective dielectric constant of multi-medium layers and  $\sigma_{sub}$  is an estimated value for silicon substrate with high conductivity.

#### 2.4. Energy compensating cells

Due to the resistive losses in the interconnection network, the resonant clocks in the HBRCDN would attenuate in every cycle and finally fail to oscillate. To overcome the losses and keep the network oscillating, energy compensating cells are necessary for the resonant clocking distributions. In the proposed HBRCDN, we use cross-coupled inverter pairs for energy compensation. Actually, the inverter-pair acts as negative resistance in the interconnect network, which eliminates the parasitical resistance of the clock wire.

As in Fig. 1, the cross-coupled inverter pair is inserted close to the on-chip inductor. If we assume that the NMOS and PMOS transistors have the same trans-conductance parameters in each inverter ( $\beta_{\text{NMOS}} = \beta_{\text{PMOS}}$ ) and the channel-length modulation effect and body effect of the transistors could be ignored, then the effective trans-conductance  $G_{\text{eff}}$  of the total

| Index | $w_{\rm ind}~(\mu {\rm m})$ | $s_{ind} (\mu m)$ | $D_{\text{inner}}(\mu m)$ | n <sub>coil</sub> | n <sub>ind</sub> | $L_{\rm dc}$ (nH) | $Q_{p}$ | SRF (GHz) | $Q_{\rm p}/A_{\rm ind}~({\rm mm}^{-2})$ |
|-------|-----------------------------|-------------------|---------------------------|-------------------|------------------|-------------------|---------|-----------|-----------------------------------------|
| I1    | 12                          | 2                 | 90                        | 3                 | 1                | 2.248             | 4.32    | 9.61      | 342.4                                   |
| I2    | 12                          | 3                 | 70                        | 4                 | 2                | 2.687             | 5.04    | 9.06      | 335.1                                   |
| 13    | 8                           | 3                 | 90                        | 4                 | 2                | 3.375             | 5.51    | 8.95      | 417.4                                   |
| I4    | 12                          | 3                 | 70                        | 3                 | 3                | 2.416             | 4.77    | 7.82      | 447.2                                   |
| 15    | 10                          | 2                 | 90                        | 3                 | 3                | 3.052             | 5.45    | 10.05     | 498.4                                   |
| I6    | 12                          | 3                 | 50                        | 4                 | 3                | 2.37              | 5.28    | 7.74      | 438.5                                   |
| I7    | 8                           | 3                 | 90                        | 4                 | 3                | 3.296             | 5.53    | 10.2      | 418.9                                   |
| 18    | 10                          | 3                 | 50                        | 5                 | 3                | 2.758             | 5.33    | 6.93      | 394.8                                   |
| I9    | 12                          | 3                 | 70                        | 5                 | 3                | 3.864             | 5.9     | 8.26      | 292.6                                   |
| I10   | 10                          | 2                 | 90                        | 3                 | 4                | 2.962             | 5.72    | 9.23      | 523.1                                   |
| I11   | 12                          | 3                 | 50                        | 4                 | 4                | 2.375             | 5.67    | 7.54      | 470.9                                   |
| I12   | 12                          | 3                 | 70                        | 4                 | 4                | 3.178             | 5.86    | 8.44      | 389.6                                   |

Table 2. Geometries and design metrics of on-chip inductors by HFSS.

energy compensating cells can be expressed as follows:

$$G_{\rm eff} = NG_{\rm inv} \approx -\frac{N\beta_{\rm NMOS}}{2} \left( V_{\rm DD} - V_{\rm thn} - V_{\rm thp} \right), \quad (4)$$

where *N* is the total width of NMOS transistor,  $V_{\text{DD}}$  is the supply voltage,  $V_{\text{thn}}$  and  $V_{\text{thp}}$  are the threshold voltages of PMOS and NMOS transistors, and  $\beta_{\text{NMOS}} = K' \frac{W_{\text{NMOS}}}{L_{\text{NMOS}}}$  is the gain factor of the NMOS transistor with unit length. To get full swing oscillating signals,  $G_{\text{eff}}$  must satisfy the condition  $G_{\text{eff}} \ge (R_{\text{eff}}C_{\text{total}})/L_{\text{total}}$ , of which  $R_{\text{eff}}$  is the effective serial resistance of the interconnection network,  $C_{\text{total}}$  and  $L_{\text{total}}$  are the total capacitance and inductance of the CDN.

# **3.** Experiment results

In this section, an HBRCDN-based synchronous circuit including 4 parallel pipelined multipliers is designed to analyze the time uncertainty in the proposed clocking network. The target frequency f of the circuit is 1.32 GHz and the whole circuit involves 1732 flip-flops (FFs) with uniform size, all the FFs are modified to adapt the clocking scheme. The total capacitance of the two-phase CDN is about 32.3 pF. We choose a chip size of  $X_{\text{bound}} = Y_{\text{bound}} = 1000 \,\mu\text{m}$ , and limit the on-chip inductor in an area of  $200 \times 200 \,\mu\text{m}^2$ . The widths of the PMOS and NMOS transistors for the energy compensating cell are fixed at 582.4  $\mu\text{m}$  and 457.6  $\mu\text{m}$ , respectively, of which the effective trans-conductance is large enough to get a full swing oscillation for all the experiments.

The synchronous circuit is synthesized, placed and routed using commercial EDA tools under a TSMC 65 nm CMOS process. Then the netlist of the physical design without clocking network is extracted by a RC extraction tools for SPICE. The resistance, inductance and capacitance per unit length of interconnects in the clocking network are extracted using 2-D electromagnetic field solver by HSPICE<sup>[14]</sup>. The entire interconnection network including H-tree, clock mesh and stub wires are created for post-simulation. The on-chip inductors with different design parameters are modeled and analyzed by HFSS, and finally the passive device, which has the maximal  $Q_p/A_{ind}$ , is selected for the resonant clock network ( $A_{ind}$  is the area cost of the inductor). Additionally, all the two-phase interconnects in the clocking network have no shields.

#### **3.1. On-chip inductor selecting**

In our experimental circuit, we design the on-chip inductors with different parameters by 3-D full-wave field solver HFSS, although it is time consuming, we can achieve accurate SPICE netlists for simulation. Table 2 lists the comparisons of dc inductance  $L_{dc}$  at 1.3 GHz, peak quality factor  $Q_p$ and the self-resonant frequency (SRF) in different inductors, of which  $w_{ind}$  is from the set {8, 10, 12} in  $\mu$ m,  $s_{ind}$  is from the set {2, 3} in  $\mu$ m,  $D_{inner}$  is chosen in the set {50, 70, 90} in  $\mu$ m,  $n_{coil}$  is from the set {3, 4, 5}, and  $n_{ind}$  is from {1, 2, 3, 4, 5}. The result of  $Q_p/A_{ind}$  is used for choosing the exact inductor. Finally, I10 is selected for the experimental circuits because of its largest  $Q_p/A_{ind}$ .

#### 3.2. Clock skew analysis

The clock skew in the proposed RCDN is subject to many factors, such as the design parameters in the interconnect network, loading imbalances and PVT variations. We analyze the skew of the experimental circuit under different cases in postsimulation.

#### 3.2.1. Impact of interconnect architecture

To evaluate the impacts of interconnect architecture on the HBRCDN, we design several interconnect networks with different mesh sizes and H-tree levels, the comparison of maximal latency, clock skew and wire cost are given in Fig. 4. For the case of a pure H-tree, different depths of 8-level, 16-level and 20-level are designed and analyzed.

In the case of a zero-level H-tree, the interconnect network is a pure mesh architecture. In our experiment, the mesh size varies from  $8 \times 8$  to  $20 \times 20$ , with wire width in the set of  $\{3, 2.1, 1.6, 1.3\} \mu$ m. The maximal width in the H-tree is limited at  $12 \mu$ m, which is according to the design rule of the process. From the results of Fig. 4(a) we can see that, as the mesh size and H-tree level increase, the clock skew sharply reduces in the HBRCDN.

The results show that as the depth increases, the clock skew in the pure H-tree falls about 2 ps from 8-level to 20-level Htree. Compared with the 8-level pure H-tree architecture, the HBRCDN with  $16 \times 16$  mesh and 6-level tree can reduce more than 65% skew. While compared with the pure mesh architecture, the HBRCDN with H-tree can decrease by at least half of the clock skew, for the networks with 6 levels H-tree in both



Fig. 4. Comparison of (a) maximal latency and skew and (b) wire cost in different interconnect architecture.

 $16 \times 16$  and  $20 \times 20$  mesh based architecture, the skew is less than 5 ps and more than 75% improvement achieved than in the pure mesh. The total wire cost  $A_{\text{HBRCDN}}$  remains constant under different cases, which guarantees that the CDNs oscillate at the same frequency. For pure H-tree CDNs, the wire area of stub interconnect reduces sharply with tree depth. While for the pure mesh and HBRCDNs, as the mesh size and H-tree level increase, the wire area of the stub interconnect reduces a little, and the wire cost of the mesh and tree increases, as shown in Fig. 4(b). Additionally, as the depth of the H-tree grows, the wire cost of the H-tree even exceeds that of the mesh network. The results show that, for the design of HBRCDN, an H-tree with small levels, such as 4 or 6, is effective on controlling the clock skew in an acceptable range.

### 3.2.2. Impact of imbalanced clock loads

To analyze the impact of imbalanced clock loads on clock skew in the HBRCDN, we firstly divide the full design into 4 parts, in each part a multiplier is placed with different chip utilization. Figure 5 shows the placement of the circuit (only DFFs are given), Mul\_1 and Mul\_2 have utilizations of 100% and 75% respectively, Mul\_3 and Mul\_4 have the same utilization of 65%. Then, the imbalanced clock loads are implemented by activating different multipliers. For the activated multiplier, an input stimulus with different switching activity (SA) is ap-



Fig. 5. Clock sinks with various chip utilizations and case studies for imbalanced loads.



Fig. 6. Clock skew variations of (a) pure H-tree and (b) HBRCDN versus switch activities.

plied. The pure H-tree with 16-level and HBRCDN with  $16 \times 16$  mesh and 6-level H-tree are designed and compared.

Figure 6 gives the relationships between clock skew and SA in each CDN. For the pure H-tree architecture, clock skew increases quickly with SA, and the maximal skew variation of 9ps occurs in case4, of which Mul\_1, Mul\_2 and Mul\_3 are activated. For the HBRCDN architecture, the maximal clock skew variation of 1.9 ps occurs in case3, of which Mul\_1 and Mul\_2 are activated while Mul\_3 and Mul\_4 are inactivated. For the

Table 3.  $3\sigma$  variations for various parameters.

|                             | -                   |
|-----------------------------|---------------------|
| Parameter                   | $3\sigma$ variation |
| NMOS/PMOS channel length    | $0.004~\mu{ m m}$   |
| NMOS/PMOS threshold voltage | 30 mV               |
| Wire resistance             | 20%                 |
| Wire capacitance            | 10%                 |
| Temperature                 | 20 °C               |
| V <sub>dd</sub>             | 10%                 |



Fig. 7. Ratios of PVT variations impact on the average latency.

case of all multipliers inactivated, the clock skews are 10.6 ps and 4.1 ps in those two CDNs, which are not given in Fig. 7.

#### 3.2.3. PVT-variation induced uncertainty

In this paper, we consider the sources of PVT variations, including power supply noise, temperature variations and the effects of process variation on transistors and wires. Table 3 lists the  $3\sigma$  variations for different parameters of the process in our experiments.

To evaluate the effect of PVT variations, we chose three HBRCDNs with different interconnect architectures, which are  $8 \times 8$  mesh+2-level H-tree,  $12 \times 12$  mesh+4-level H-tree,  $16 \times 12$ 16 mesh+6-level H-tree. For each interconnect architecture, we carry out many simulations with one or more parameters altering, and select 500 paths in each CDN for latency calculation. Finally, the impact of different parameters on the average latencies is found as shown in Fig. 7. For all the HBRCDNs, wire resistance and capacitance variations are the main contributors (more than 60%) to latency changes, the next are temperature and source voltage. Channel length and threshold voltage variations of energy compensating cells transistors have little effect on clock latency, less than 7% in all CDNs. For an HBRCDN with a denser mesh and a deeper H-tree, the impact of wire resistance grows gradually and is caused by the narrower mesh wire and longer H-tree path.

Table 4 compares the clock skews non-considering and considering PVT variations in the above three HBRCDNs. All the HBRCDNs are significantly affected by different variations, more than 50% increasing in clock skew after considering PVT variations. However, the skew values are quite small

Table 4. Maximal clock skews induced by PVT variations.

|                     |                   | 2              |      |
|---------------------|-------------------|----------------|------|
| Interconnect        | Max. skew without | Max. skew with | Inc. |
| architecture        | PVT var. (ps)     | PVT var. (ps)  | (%)  |
| $8 \times 8$ mesh   | 9.1               | 13.8           | 51.6 |
| + 2-level H-        |                   |                |      |
| tree                |                   |                |      |
| $12 \times 12$ mesh | 7.4               | 11.5           | 55.4 |
| + 4-level H-        |                   |                |      |
| tree                |                   |                |      |
| $16 \times 16$ mesh | 4.2               | 6.7            | 59.5 |
| + 6-level H-        |                   |                |      |
| tree                |                   |                |      |
|                     |                   |                |      |

as compared with the clock cycle, which is less than 1% about 760 ps for the target frequency.

## 4. Conclusion

In this paper, a hierarchical interconnect network has been proposed for a bufferless resonant clock network, which can reduce clock skew effectively and show high tolerance towards PVT variations. The impacts of different variables have been analyzed in post-simulation by a real synchronous circuit, and guidelines for choosing an interconnect architecture are given. To the best of our knowledge, this is the first work to study the clock skew of a bufferless resonant clocking scheme that considers the impacts of PVT variations. There is more work left to be done, such as the optimization of the interconnect network, handling obstacles of macro or IP blocks and power optimization of the HBRCDN.

# References

- Restle P J, McNamara T G, Webber D A, et al. A clock distribution network for microprocessors. IEEE J Solid-State Circuits, 2001, 36(5): 792
- [2] Xanthopoulos T, Bailey D W, Gangwar A K, et al. The design and analysis of the clock distribution network for a 1.2 GHz alpha microprocessor. IEEE Int Solid-State Circuits Conf Dig Tech Papers, 2001: 402
- [3] Tam S, Limaye R D, Desai U N. Clock generation and distribution for the 130-nm Itanium2 processor with 6-MB on-die L3 cache. IEEE J Solid-State Circuits, 2004, 39(4): 636
- [4] Stolt B, Mittlefehldt Y, Dubey S, et al. Design and implementation of the POWER6 micro-processor. IEEE J Solid-State Circuits, 2008, 43(1): 21
- [5] Chan S C, Shepard K L, Restle P J. Uniform-phase, uniformamplitude, resonant-load global clock distributions. IEEE J Solid-State Circuits, 2005, 40(1): 102
- [6] Chan S C, Shepard K L, Restle P J. Distributed differential oscillators for global clock networks. IEEE J Solid-State Circuits, 2006, 41(9): 2083
- [7] Drake A J, Nowka K J, Nguyen T Y, et al. Resonant clocking using distributed parasitic capacitance. IEEE J Solid-State Circuits, 2004, 39(3): 1520
- [8] Chueh J Y, Sathe V, Papaefthymiou M. Experimental evaluation of resonant clock distribution. IEEE Comput Soc Annu Symp VLSI Proc, 2004: 135
- [9] Chueh J Y, Sathe V, Papaefthymiou M. 900 MHz to 1.2 GHz two-phase resonant clock network with programmable driver and

loading. IEEE Custom Integrated Circuit Conf, 2006: 777

- [10] Hansson M, Mesgarzadeh B, Alvandpour A. 1.56 GHz on-chip resonant clocking in 130 nm CMOS. IEEE Custom Integrated Circuit Conf, 2006: 241
- [11] Sathe V S, Kao J C, Papaefthymiou M C. Resonant-clock latchbased design. IEEE J Solid-State Circuits, 2008, 43(4): 864
- [12] Rabaey J M, Chandrakasan A, Nikolic B. Digital integrated circuits: a design perspective. 2nd ed. Prentice Hall, 2003
- [13] Ansoft Corporation, HFSS. [Online]. Available: http://www. ansoft.com/products/hf/hfss/
- [14] Synopsys. HSPICE signal integrity user guide. Version Z-2007. 03, 2007: 109