# Circuit design of a novel FPGA chip FDP2008\*

Wu Fang(吴方), Wang Yabin(王亚宾), Chen Liguang(陈利光), Wang Jian(王健), Lai Jinmei(来金梅)<sup>†</sup>,

Wang Yuan(王元), and Tong Jiarong(童家榕)

(State Key Laboratory of ASIC & System, Fudan University, Shanghai 201203, China)

**Abstract:** A novel FPGA chip FDP2008 (Fudan Programmable Logic) has been designed and implemented with the SMIC 0.18  $\mu$ m CMOS logic 1P6M process. The new design method means that the configurable logic block can be configured as distributed RAM and a shift register. A universal programmable routing circuit is also presented; by adopting offset lines, complementary hanged end-lines and MUX + Buffer routing switches, the whole FPGA chip is highly repeatable, and the signal delay is uniform and predictable over the total chip. A standard configuration interface SPI is added in the configuration circuit, and a group of highly sensitive amplifiers is used to magnify the read back data. FDP2008 contains 20 × 30 logic TILEs, 200 programmable IOBs and 10 × 4 kbit dual port block RAMs. The hardware software cooperation test shows that FDP2008 works correctly and efficiently.

**Key words:** FPGA; CLB; RAM; programmable routing resource; configuration **DOI:** 10.1088/1674-4926/30/11/115009 **EEACC:** 1130; 1130B

# 1. Introduction

Field programmable gate arrays (FPGAs) have been widely used in communication, multimedia, industrial control, numerical computation etc., and the demand for FPGA chips is increasing. However, most FPGA chips are supplied by the large FPGA corporations abroad. There are few FPGA chips designed by us. FDP2008 has therefore been designed and implemented, and its function and performance are comparable with current mainstream FPGA chips.

Configurable logic blocks (CLB), a programmable routing resource, a configuration circuit, embedded IP cores, and programmable I/O modules are the five primary modules of an FPGA chip<sup>[1]</sup>. Concerning the basic research and design of the five modules mentioned above, there has been lots of fruitful research, so this paper will focus on the additional benefits of the FPGA function and performance when considering the circuit architecture. For instance, CLBs can be configured as distributed RAM and a shift register; the hierarchy and repeatable novel routing resource is designed to benefit the signal delay; high precision sensitive amplifiers are adopted to magnify the read back data, etc. All of the design methodologies will be applied to the FDP2008 chip.

The FDP2008 chip has been taped out with the 0.18  $\mu$ m CMOS logic 1P6M process, containing 600 CLBs, three kinds of interconnect lines, 10 × 4 kbit dual port block RAMs, 200 programmable IOBs. A systematic test and a software hardware co-test both show that FDP2008 works correctly and efficiently.

# 2. Novel circuit design

FDP2008 contains a 20  $\times$  30 programmable logic TILE array, 200 programmable IOBs, a configuration circuit, block RAMs, a clock distribution network, and many other functional circuits. Figure 1 shows an overview of the FDP2008 circuit architecture.

A TILE, which is the basic repeatable unit of the proposed FPGA chip, comprises a CLB and its corresponding routing resource. Each CLB contains two slices, while its corresponding routing resource includes an input multiplexer (IM), an output multiplexer (OM), a general routing box (GRB) and programmable interconnect lines. Figure 2 shows a diagram of a TILE.



Fig. 1. Overview of the FDP2008 architecture.

<sup>\*</sup> Project supported by the National High Technology Research and Development Program of China (No. 2007AA01Z285) and the National Natural Science Foundation of China (No.60876015).

<sup>†</sup> Corresponding author. Email: jmlai@fudan.edu.cn Received 12 April 2009, revised manuscript received 22 June 2009



### 2.1. High-performance CLB circuit

As mentioned above, the proposed CLB of the FDP II chip contains two slices which are the same; it is mainly composed of:

(1) Two 4-input mainstream LUTs<sup>[2]</sup>(G and F), which can be configured to realize arbitrary 4-input combinational logic, several kinds of distributed RAM and two 16-bit shift registers.

(2) Two sequential memory cells, which can be configured as DFF or Latch; also synchronized set/reset and asynchronized set/reset.

(3) Add logic and a vertical carry chain, which can realize fast multi-bit adding.

(4) 5 to 1 MUX, to implement arbitrary 5-input combinational logic.

(5) 6 to 1 MUX, to implement arbitrary 6-input combinational logic. Two slices are needed.

(6) A control unit, serving as a timing controller when the LUTs are configured as distributed RAMs or shift registers.

(7) A capture and write back circuit, which is used to capture and save the value in the sequential memory cells and write the saved value back into the memory cells.

(8) Some MUXes, to select the appropriate output.

By using the proposed slice architecture, two arbitrary 4input logic, one arbitrary 5-input logic, one arbitrary 6-input logic, the adder, carry chain, multiplier, 4-input MUX, 8-input MUX, some 9-input MUX and some 18-input MUX, etc. can be easily realized in a single CLB.

However, there are additional benefits when considering the sequential functions.

Other than the commonly used 5-transistor or 6-transistor memory cell, by simply adding some logic to it, the distributed RAM<sup>[3]</sup> can be easily realized. wl-ram, D\_ram, and Db\_ram are added as the address and data lines of distributed RAM



Fig. 3. Novel memory cell.

(Fig. 3 without the highlighted part). When signal ws is 1, the memory cells are used as distributed RAM. The configuration pattern of distributed RAM is adjustable, and there are four patterns,  $1 \times 16$  bit,  $2 \times 16$  bit,  $1 \times 32$  bit and dual port, all of which can be configured by the distributed RAM control circuit.

By adding some other logic to the memory cell of the distributed RAM, the shift register<sup>[4]</sup> function can be realized, as shown in the highlighted part of Fig. 3. PHI\_1 and PHI\_2 are two phase clocks, SIN is data input and SOUT is data output. When PHI\_2 is 1 and PHI\_1 is 0, the data written to the next memory cell is retained by the dynamic latch, namely the parasitic capacitance of point A in Fig. 3.

Also, as there is a 16 to 1 selecting circuit after the memory cells, the width of the shift register is adjustable. The width of the shift register can be changed by the value of G [4 : 1] or F [4 : 1]. For example, when the value of G [4 : 1] is 0111, the width of the shift register configured by LUT(G) is 8 bit.

The design details of the distributed RAM, the shift register and the control circuit have been discussed before<sup>[5]</sup> and will be omitted in this paper.

#### 2.2. Uniform programmable routing resource

The commonly-used hierarchy routing architecture is used to design the programmable routing resource, and the TILE routing resource can be divided into local routing resource and global routing resource.

Internal CLB feedback paths can provide high-speed connections to LUTs within the same CLB, chaining them together with minimal routing delay. Direct paths can provide high-speed connections between horizontally adjacent CLBs, eliminating the delay of the GRB.

When approaching the global routing resource design, by replacing the traditional CB (connecting box) and SB (switch box) in FPGA routing design, novel GRB, IM, OM are employed to design the TILE routing resource; GRB is to employed implement the connection between the three kinds of interconnect lines: single lines, length-6 lines, long lines. IM employed is to implement the signal input from the interconnect lines to CLBs. OM is employed to implement the CLB output to the three kinds of interconnect lines (Fig. 2). The routing parameters are shown in Table 1.

The routing relation between the three kinds of interconnect lines is shown in Fig. 4; it is designed to facilitate the

| Table 1. Routing parameters.   |               |  |
|--------------------------------|---------------|--|
| Routing parameter              | Value         |  |
| Single lines                   | $24 \times 4$ |  |
| Uni-directional length-6 lines | $48 \times 2$ |  |
| Bi-directional length-6 lines  | $24 \times 2$ |  |
| Long lines                     | $12 \times 2$ |  |
| IM inputs                      | 132           |  |
| IM outputs                     | 26            |  |
| OM inputs                      | 12            |  |
| OM outputs                     | 8             |  |



Fig. 4. Routing relation of interconnect lines.

easy use of software and also to alleviate the load pressure of the interconnect lines concerning signal delay.

The optimization strategies of offset by sets<sup>[6]</sup> in TILE routing and complementary hanged end-lines<sup>[6]</sup> in I/O routing are used to design the top level of the routing resource. Both of the two optimization strategies guarantee that the interconnect circuit needs only one repeating unit, and the relative position of each kind of routing resource is completely identical. So, the top level distribution of the programmable routing resource is unified. This provides the post-layout design with an effective repeatable unit, which will greatly reduce the layout workload. Meanwhile, when the chip is scaled up, the physical layout can be easily extended by copying, shortening the research and development cycles of products greatly.

In addition, the MUX + Buffer structure is employed to realize the routing switch, which promises that the signal delay is highly uniform and predictable for the total chip.

FDP2008 also includes ten 4 kbit dual-port block RAMs. There is a dedicated interconnection between the block RAM and the global routing resource, which adopts the same routing strategies as TILE routing. This not only makes the top level of the routing resource uniform but also reduces the routing congestion around the block RAM.

### 2.3. Flexible configuration circuit

The proposed configuration circuit<sup>[7]</sup> could write each single memory cell in FDP2008, providing more flexible configuration operations, and it can also read data back from FDP2008 successfully. Four main features have been improved in the FDP2008 configuration circuit: (1) Five differ-

Table 2. Configuration performance of FDP2008.

| Parameter                                       | Value     |
|-------------------------------------------------|-----------|
| Total configuration data frames                 | 1536      |
| Bits per frame                                  | 396       |
| Total configuration bits (including dummy bits) | 612536    |
| Data write time of each frame                   | 1.6 µs    |
| Data read back time of each frame               | $2 \mu s$ |
| Total write time                                | 2.6 ms    |

ent kinds of configuration interfaces<sup>[8]</sup> are provided, including a standard configuration interface-Serial Peripheral Interface (SPI), and the other four are slave parallel, slave serial, master serial, JTAG; (2) A Write/Read asynchronous FIFO structure is used to divide the external interface and internal configuration circuit into two clock domains, and designers could set the external clock and internal clock separately; (3) A novel configuration read back circuit could write each single memory cell in FDP2008; (4) A group of high precise sensitive amplifiers are designed to magnify the read back data value.

The area of the configuration circuit is about  $1500 \times 200 \ \mu\text{m}^2$ , and the post layout timing analysis and function verification of the FDP2008 configuration circuit have been performed by Synopsys PrimeTime and Mentors-ModelSim. Through analysis and simulation, each frame in FDP2008 could be written in less than 1.6  $\mu$ s by using a 50 MHz clock and the parallel interface. The read back time of each frame is less than 2  $\mu$ s, and the total configuration time is about 2.6 ms in FDP2008. All the parameters are shown in Table 2.

#### 2.4. Other modules

FDP2008 contains ten 4 kbit block RAMs<sup>[9]</sup>, located at the left and right sides of the chip, five on each side. Each 4 kbit block RAM can be flexibly configured as one of the five work modes, namely, 4 kbit  $\times$  1, 2 kbit  $\times$  2, 1 kbit  $\times$  4, 512 bit  $\times$  8 and 256 bit  $\times$  16, which can satisfy different applications.

Programmable IOBs<sup>[9]</sup> are distributed at the exterior of the FDP2008 core which is made up of a TILE array. Each IOB can be configured as input mode, output mode, bi-directional mode, and three-state mode. All of the input, output and three-state control signals can be configured to be latched or not.

Four clock pads are placed at the center of the top and bottom of the chip. In order to minimize the clock skew, Htree clock networks are used, and the clock buffers are also well-designed to optimize the clock distribution.

There are other modules<sup>[9]</sup> such as the JTAG test module, the boundary scan module, and the three-state bus etc. which will not be discussed here.

### 3. Implementation of FDP2008

The FDP2008 chip contains approximately ten thousand equivalent logic gates and fifty thousand transistors. The layout of the circuit modules such as slice, CLB, GRB, IM, OM,

#### J. Semicond. 30(11)

TILE, IOB, and the interconnect lines is designed with the full-custom method by Virtuoso under a Linux environment; meanwhile, the configuration circuit is designed by the Verilog HDL language and synthesized by Synopsys Design complier tools, and the layout is then generated by the Synopsys Astro tool. The total layout of FDP2008 is also DRC and LVS validated by Calibre.

The most important characteristic of the FOGA layout is that the TILE array core is highly repeatable. So in order to realize convenient jointing of TILEs and enlargeability of the chip scale, the layout of FDP2008 is designed with the methods below:

(1) After flattening of the schematic, the total circuit is partitioned by function modules, then the area, length and width of each module is estimated.

(2) Decide the position of each module by the connection relation.

(3) Prescribe the wordline and bitline directions of SRAMs in each module and the wordline, bitline, and other connection line directions between the modules. Also prescribe the level of metal which is used for each kind of line.

(4) The principle of the inner layout of TILE must ensure power and the grand lines are clear and formulaic enough to join as standard cells.

Besides DRC, LVS and the design methods mentioned above, in order to ensure high yield<sup>[10]</sup>, the layout of FDP2008 is also specially designed for the latch-up and antenna effects. In order to eliminate the latch-up effect, several methods are used: Place NMOS as near GND as possible, place PMOS as near VDD as possible, and keep enough space between PMOS and NMOS to depress the possibility of SCR, guard rings are used, NMOS is surrounded with a P+ ring and connected to GND, PMOS is surrounded with an N+ ring and connected to VDD. In order to eliminate the antenna effect, the long lines are cut near the gate and then joined again with high level metal. When it is inconvenient to change metal for the signal lines, reverse diodes are directly inserted.

Figure 5 shows the total layout of FDP2008. The total TILE array is partitioned into four blocks; each contains a  $10 \times 15$  TILE array, the vertical center is a bitline download chain, the horizontal center is a wordline download chain, the left and right side of the TILE array is a block RAM, and the periphery of the total layout is IO block and PAD.

The FDP2008 chip has been implemented with the SMIC 0.18  $\mu$ m logic 1P6M salicide 1.8 V/3.3 V process. The die size of FDP2008 is about 6.5 × 6.8 mm<sup>2</sup> with a QFP208 package, and the programmable routing resource costs approximately 60% of the total layout area. The total user available programmable resource of FDP2008 is the same as the Xilinx device Xc2s100e which is also implemented with a 0.18  $\mu$ m process and contains ten thousand equivalent logic gates; the programmable resources of FDP2008 and Xilinx Xc2s100e are listed in Table 3.



Fig. 5. Total layout of FDP2008.

Table 3. Comparison between system resources of FDP2008 and Xilinx Xc2s100e.

| System resource         | FDP2008          | Xc2s100e         |
|-------------------------|------------------|------------------|
| Technology (µm)         | 0.18             | 0.18             |
| Area (mm <sup>2</sup> ) | $6.5 \times 6.8$ | $5.8 \times 5.8$ |
| System gates            | 10 000           | 10 000           |
| Block RAM (kbit)        | 40               | 40               |
| Total CLBs              | 600              | 600              |
| Total slices            | 1200             | 1200             |
| Total LUTs              | 2400             | 2400             |
| GCLKs                   | 4                | 4                |



Fig. 6. FDP2008 test platform.

### 4. Test OF FDP2008

To evaluate the performance of the FDP2008 chip, a test platform (Fig. 6) is built up, including an FDP2008 test board, a signal generator (Agilent 33120A), a high precision oscillograph (Agilent 54622D Mixed Signal oscillograph), a digital multimeter (Victor VC9810A), a regular power supply (MOTECH LPS305), a PC, etc. Figure 7 shows the test board

|                            | r r r                                      |         |          |
|----------------------------|--------------------------------------------|---------|----------|
| Parameter                  | Description                                | FDP2008 | Xc2s100e |
| $V_{\rm OH}$ (V)           | Output high voltage                        | 2.9     | 2.28     |
| $V_{\rm OL}$ (V)           | Input low voltage                          | 0.08    | 0.3      |
| I <sub>CCQ_CORE</sub> (mA) | Quiescent current                          | < 20    | < 200    |
| $T_{\rm CLB}$ (ns)         | Combinational logic delay of CLB           | 1.025   | 0.9      |
| $T_{\rm CKQ}  ({\rm ns})$  | Sequential logic delay of CLB              | 0.75    | 1        |
| $T_{\text{CIN2COUT}}$ (ns) | Carry logic delay of CLB                   | 0.28    | 0.15     |
| $T_{\text{LENGTH6}}$ (ns)  | Length-6 line delay spanning 6 TILEs       | 0.48    | 0.50     |
| $T_{\rm LONG} (\rm ns)$    | Long line delay spanning 12 TILEs          | 0.96    | 0.99     |
| $T_{\text{PIN2PIN}}$ (ns)  | Input to output delay of IOB (without PAD) | 3.8     | 3.4      |
| $T_{\rm CK2PIN}$ (ns)      | Clk to pad delay of IOB (without PAD)      | 3.6     | 3.4      |

Table 4. Comparision of performance between FDP2008 and Xilinx Xc2s100e.



Fig. 7. FDP2008 test board.



Fig. 8. FDP2008 test flow.

of FDP2008; the FDP2008 chip is placed at the middle of the test board as can be seen.

Since the function realization of FPGA depends on the corresponding CAD software, an integrated CAD FPGA Design Environment (FDE) tool<sup>[11]</sup> has been developed which includes netlist conversion, partition, technology mapping, placement and routing, and a configuration bit file generation module. The FDE tool is omitted in this paper.

The test flow of FDP2008 is shown in Fig. 8. First, decide the FPGA resource to be tested and its related configuration, and then generate the bit file by the FDP2008 test software or by the FDE flow. Then download the bit file to FDP2008 with the configuration chip (Fig. 6, Cyclone EP1C6T144CB) or PC parallel port by the reference configuration circuit to realize the specific function. Finally, load the test vector to the programmable I/Os, and observe the output response of the LED on the test board, oscillograph or multimeter, etc. Moreover, there is a boundary scan circuit in FDP2008. It can receive the auto test vector generated by the PC or the reference chip,



Fig. 9. Output wave of the test benchmark on an oscillograph.

when the test response is generated. It can also transfer it to PC or the reference chip. So the test flow can be automated.

FDP2008 has been systematically tested; the test result shows that the programmable logic blocks, the programmable routing resource, the configuration circuit, the programmable I/Os, block RAM, and the other circuit modules perform correctly and efficiently, and the test data are shown in Table 4. As the Xilinx Xc2s100e is the same as FDP2008 in terms of technology process and system resources, the performance of Xilinx Xc2s100e is also given in Table 4. It is obvious that the FDP2008 is quite comparable with Xilinx Xc2s100e in performance, and FDP2008 is a reasonable design.

Also, a serial of test benchmark circuits were successfully implemented in the FDP2008; for example, a function circuit based on a 29-bit multiplexer including one 29-bit multiplexer, two 29-bit counters, one 52-bit adder has been realized by FDP2008. It uses 1198 slices (99%) and 2166 LUTs (90%); of these, 1952 (81%) are logic functioned, and the other 214 are used as a routing transit; 3 IOBs and 1 global clock line are also used. Figure 9 shows the output wave of this test benchmark. Some of the test benchmark circuits implemented on FDP2008 are listed in Table 5 and the number of CLBs used is also listed.

## **5.** Conclusion

A novel FDP2008 architecture/circuit has been proposed and implemented with SMIC 0.18  $\mu$ m COMS technology. By re-design of the memory cell, CLB could be configured as

| Table 5. Test benchmark circuits. |                          |               |             |  |
|-----------------------------------|--------------------------|---------------|-------------|--|
| Benchmarks                        | Function                 | No. of slices | Performance |  |
| Multi29                           | 29-bit multiplexer       | 1198          | Right       |  |
| Code                              | Encrypt circuit          | 696           | Right       |  |
| Song                              | Music player             | 242           | Right       |  |
| Colour                            | Colour distinction       | 181           | Right       |  |
| Clock                             | Clock frequency division | 47            | Right       |  |

distributed RAM and a shift register. The new design methodology of the programmable routing resource ensures that the whole FDP2008 chip is highly repeatable and the signal delay is uniform and predictable. A standard configuration interface SPI is adopted and a group of high precision sensitive amplifiers in the configuration circuit could magnify the read back data values. The hardware software co-test indicates that FDP2008 works correctly and efficiently. In addition, the performance of FDP2008 is quite comparable with the Xilinx Xc2s100e device which is the same as FDP2008 both in terms of technology process and system resources.

# References

- Betz V, Rose J, Marquardt A. Architecture and CAD for deepsubmicron FPGAs. Kluwer Academic Publishers, 1999
- [2] Ahmed E. The effect of LUT and cluster size on deep submicron FPGA performance and density. IEEE Trans VLSI Syst, 2004, 12(3): 288
- [3] Johnson R A. RAM with synchronous write port using dynamic

latches. US Patent, No. 5933369, Aug 30, 1999

- [4] Bauer T J. Look up tables which double as shift registers. US Patent, No. 5889413, Mar 30, 1999
- [5] Pan Guanghua, Lai Jinmei. The design and implementation of sequential circuits in FPGA configurable logic block. Acta Electronica Sinica, 2008, 36(8): 1480 (in Chinese)
- [6] Wu Fang, Zhang Huowen. A delay-optimized universal FPGA routing architecture. Proc ASPDAC, 2009
- [7] Wang Y B, Xie J. Design and implementation of the configuration circuit for FDP2008 FPGA. Proc IEEE, APCCAC, 2008
- [8] Schultz D P, Hung L C, Goetting E E. Configuration bus interface for FPGAs. US Patent, No. 6429682, Aug 6, 2002
- [9] Shen Q S. Programmable routing resource design of FPGA with embedded IP cores. MD Dissertation, University of Fudan, 2008 (in Chinese)
- [10] Saint C, Sanit J. IC mask design-essential layout techniques. Tsinghua University Press, Mcgraw-hill Company (S) Pte Ltd, 2004
- [11] Chen Liguang, Wang Yabin, Wu Fang. Design and implementation of an FDP chip. Journal of Semiconductors, 2008, 29(4): 713