# Design and Implementation of an FDP Chip\*

Chen Liguang, Wang Yabin, Wu Fang, Lai Jinmei<sup>†</sup>, Tong Jiarong, Zhang Huowen, Tu Rui, Wang Jian, Wang Yuan, Shen Qiushi, Yu Hui, Huang Junnai, Lu Haizhou, and Pan Guanghua

(ASIC and System State Key Laboratory, Fudan University, Shanghai 200433, China)

Abstract: A novel Fudan programmable logic chip (FDP) was designed and implemented with a SMIC  $0.18\mu m$  CMOS logic process. The new 3-LUT based logic cell circuit increases logic density about 11% compared with a traditional 4-input LUT. The unique hierarchy routing fabrics and effective switch box optimize the routing wire segments and make it possible for different lengths to connect directly. The FDP contains 1,600 programmable logic cells, 160 programmable I/O, and 16kbit dual port block RAM. Its die size is 6.104mm $\times 6.620$ mm, with the package of QFP208. The hardware and software cooperation tests indicate that FDP chip works correctly and efficiently.

Key words: FPGA; programmable logic block; programmable routing resource; switch box

**EEACC:** 1130; 1130B

## 1 Introduction

Today, field programmable gate arrays (FPGAs) have evolved into complex systems-on-chip. These give rise to the challenge of the architecture, tools, and design methodology<sup>[1~9]</sup>. FPGAs are made up of many functional elements: programmable logic cells, interconnect fabrics, embedded IP, and I/O module are four major categories of these elements. This paper will explore some of the characteristics of these categories in order to provide insight into how the criteria or the "architecture/circuit" of these functional elements can be designed as efficiently as possible.

In particular, this paper will focus on additional benefits of FPGA when considering the architecture such as programmable logic cells (PLC), hierarchy routing fabrics, and novel switch box (SB) circuits. The ideas will be illustrated by the examples taken from FDP design. These examples will illustrate progress to date as well as highlight important areas of focus for future development.

FDP was fabricated with a SMIC  $0.18\mu m$  CMOS logic process using a fully custom design method. The test results show that its logic cells and interconnect resources work correctly and efficiently.

## 2 A novel FDP circuit architecture

FDP contains  $20 \times 20$  programmable logic tile (TILE) arrays, 160 programmable I/O block, block

RAM, configuration circuit, JTAG circuit, clock distribution networks, periphery interfaces and many other circuits. Figure 1 shows the block diagram of the FDP circuit architecture.

The TILE is the atomic repetitive cell in FDP arrays. It is made up of PLC and hierarchy programmable interconnection resources. Each TILE contains a programmable logic cluster, a SB, a connection box (CB), and a bus control logic unit circuit.

The programmable I/O block (IOB) contains input/output control logic and dedicated I/O interconnections. Each IOB can be configured as input mode, output mode, bi-direction mode, and other modes.



Fig. 1 Overview of FDP architecture

<sup>\*</sup> Project supported by the Natural Science Foundation of China (No. 60776023) and the National High Technology Research and Development Program of China (No. 2007AA01Z285)

<sup>†</sup> Corresponding author. Email: jmlai@fudan.edu.cn



Fig. 2 Programmable logic slice of FDP

FDP also includes four 4kbit dual-port block RAMs. Each block RAM works in 1bit, 2bit, 4bit and 8bit mode. There is a dedicated interconnection between the block RAM and the programmable logic Cluster.

There is a boundary scan circuit added into FDP, which controls each IOB and makes testing convenient.

FDP's configuration circuit contains an address chain, a data chain, and corresponding control circuits, which offer master serial mode and slave serial mode [1].

#### 2.1 Circuit architecture of PLC

#### 2.1.1 Novel programmable logic slice

Figure 2 is a programmable logic slice of FDP. It is made of two PLC circuits that have the same circuit architecture. The PLC is based on the mainstream LUT. Unlike the normal 4-input LUT structure, the FDP's PLC is made up of two 3-input LUTs, a fast carryout logic, a fast shift logic, and a sequential logic. The shared logic resources between PLCs are used to realize multiple input logic functions. The FDP's slice can be used to realize three input to nine input logic functions.

There are two 3-input LUT circuits in each single PLC, which have the same three inputs and can be used to realize two 3-input logic functions or one 4-input logic function by combining these two LUTs. Combining fast carryout logic and one 3-input LUT, it can also realize a one bit full adder, a one bit full subtract, and multiplexing logic. The fast shift logic can realize a build-in scan chain test. The sequential logic can be configured as a DFF or a latch.

Table 1 Comparison between 4-LUT and FDP-LUT

| Benchmark | 4-LUT | FDP-LUT | Gain/% |
|-----------|-------|---------|--------|
| Z4ml      | 70    | 57      | 18.5   |
| C880      | 158   | 149     | 5.6    |
| Frg1      | 277   | 247     | 10.8   |
| Alu2      | 355   | 291     | 18.0   |
| C5315     | 725   | 628     | 13.3   |
| I10       | 1244  | 1152    | 7.3    |
| rot       | 616   | 560     | 9.0    |
| Average   | 492   | 440     | 11.2   |

The main advantage of an FDP slice is that it can realize double 3-input logic functions in a single PLC. This feature makes the PLC more efficient. It is a waste to use 4-input LUTs to realize 3-input logic functions. The experimental results show that the two 3-input LUTs improve the logic density about 11% compared with a traditional 4-input LUT when a technology mapping algorithm such as FlowMap<sup>[2]</sup> is used. Table 1 shows the comparison between 4-LUT and FDP-LUT.

#### 2.1.2 FDP programmable logic cluster

FDP's cluster consists of two programmable logic slices, a sequential control unit (SCU), and an input MUX array (IMUX) circuit. Figure 3 shows the block diagram of a Cluster circuit. An IMUX circuit is used to choose the Cluster's input signals, which include the outside inputs, inner feedback outputs, and logic zero. The SCU circuit is responsible for processing sequential control signals that come from outside interconnections and clock networks, and for providing a unit clock signal, an enable signal, a reset/set signal, and a function select signal.

There are two advantages of FDP's Cluster over the general one:

Small area; Four logic cells in a FDP's Cluster share the same SCU circuit. A SCU's area is about  $1/4 \sim 1/3$  that of a PLC. Thus, FDP's Cluster has a



Fig. 3 Logic cluster



Fig. 4 FDP hierarchy routing architecture

smaller area than the traditional structure available on Xilinx XC4010, in which each logic cell uses the same SCU.

Fast speed: The logic cells in a cluster can be fast connected by the IMUX circuit. The local interconnections in a Cluster can distinctly reduce the wires length and the number of programmable switches between logic cells, which benefits the circuit speed in a Cluster

#### 2. 2 Programmable routing resource

### 2. 2. 1 Uniquely hierarchy routing architecture

FDP programmable routing architecture employs the most popular hierarchy routing. It consists of two levels:local interconnect routing inside the cluster and segments routing outside the cluster.

Figure 4 displays the top view of the hierarchy routing circuit architecture. The local interconnection includes an IMUX, which chooses the outside inputs from the CB and feedback inputs from cluster outputs. The IMUX consists of 20 Multiplexes of 16 inputs. Signals are routed from the CB or feedback by the IMUX, and the interconnections between clusters or fast interconnections in the cluster are implemented. The segment routing resource consists of wire segments in the routing tracks, a SB, and a CB. There are three types of segment wires in the FDP: length 2 wires, length 4 wires, and long wires. The SB circuit realizes the connections between horizontal and vertical wire segments, and the CB circuit connects the inputs and outputs between clusters and routing channels.

#### 2. 2. 2 Effective SB circuit architecture

The traditional Wilton SB realizes the connection (by transistors) between "domains". It has better performance than other styles of SBs<sup>[4,5]</sup>. But Wilton SB can not connect wires of different lengths, causing a hierarchy routing problem and negatively impacting performance.



Fig. 5 Novel switch box architecture

In FDP, three kinds of wires of different lengths were integrated in one Wilton SB<sup>[4,5]</sup>. By applying this method, all the wires can be connected efficiently<sup>[4,5]</sup>. The circuit architecture is shown in Fig. 5. In this figure, horizontal and vertical connections were ignored. Outside the rectangle, the solid line represents length 4 wires, the dashed represents length 2 wires, and the dotted line represents long wires. Lines inside the rectangle represent the connections between different wires: the solids line represent the connection between length-4 wires, the dashed represent the connection between length-2 wires, and the dotted lines represent the connection between long wires and the other wires. The connections are summarized in Table 2, where the number indicates the connections between wires.

#### 2.2.3 Block RAM and its interconnection

FDP chip includes four 4kbit dual-port block RAMs. There is a dedicated interconnection between the block RAM and the programmable logic Cluster, which makes it convenient to communicate with other components.

Compared with the traditional FPGA architecture, which extend or combine the block RAMs by configuring normal interconnections and logic cells, these dedicated interconnections reduce routing congestion around the block RAM and reduce the delay of the block RAM. Moreover, the dedicated block RAM control logic simplifies the software processing block RAM compared with traditional FPGA architecture.

#### 2.3 FDP architecture evaluation

Many factors have to be considered in FPGA cir-

Table 2 Connections among 3 wire segments

| Segment   | Length 2 | Length 4 | Long wire |
|-----------|----------|----------|-----------|
| Length 2  | 3        | 6        | 3         |
| Length 4  | 6        | 12       | 6         |
| Long wire | 3        | 6        | 3         |



Fig. 6 FDP performance evaluation flow

cuit design, especially, the design of programmable routing resources such as the inputs of the logic Cluster, the routing channel width, the ratios among the wire segments, the SB topology, programmable switch type, metal layer, and width and space in layout.

These factors cannot be considered independently. Thus, special evaluation programs and statistical analysis are necessary in order to achieve the best performance. Figure 6 indicates the FDP performance evaluation flow.

First, the MCNC benchmarks were optimized by SIS<sup>[6]</sup>, FlowMap and FlowPack<sup>[2]</sup> were used to map the benchmarks to LUTs and DFFs, and then the LUTs and DFFs were packed into a cluster by T-VPack<sup>[7]</sup>. Finally, placement and routing were implemented with versatile place and route (VPR)<sup>[3]</sup>. This flow was iterated until achieving the minimal track width. Table 3 shows the parameters of FDP.

Table 3 Parameters of FDP routing resource

| Architecture       | Array size             | $20 \times 20$ |
|--------------------|------------------------|----------------|
|                    | Channel width, W       | 48             |
| Segments           | Long wire              | 12             |
| routing            | Length 4 wire          | 24             |
|                    | Length 2 wire          | 12             |
|                    | Cluster inputs(I)      | 12             |
|                    | Cluster outputs(O)     | 8              |
| Local interconnect | CB inputs conn(Fcin)   | 16/48          |
|                    | CB outputs conn(Fcout) | 24/48          |
|                    | IMUX outputs conn      | 12/12          |
|                    | IMUX feedback conn     | 1/8            |







Fig. 7 (a) FDP Layout (b) Photo (c) FDP chip test board

## 3 Implementation of FDP FPGA

The FDP chip was taped out with a SMIC 0.18 $\mu$ m logic 1P6M process. The layout of the circuit modules such as LUT, PLC, Slice, SB, CB, Cluster, TILE, and the routing resource circuits is designed with the full custom design method. The FDP's die size is about 6.1mm  $\times$  6.6mm with a QFP208 package. Figure 7 indicates the FDP's layout, the photo of the chip, and the test board.

The function realization of FPGA depends on the corresponding CAD software. An integrated CAD FP-GA design environment (FDE) tool was developed which includes netlist convert, partition, technology mapping, place & routing, bit stream generation, and a configuration module. The FDE tool is ignored in this paper.

In order to evaluate the performance of a FPGA device, a full FPGA software development design flow is needed. We use Synopsys's Design Compiler as a synthesis tool and FDE as backend tool in the FDP

Table 4 Test results of logic implementation between FDP chip and XC4010 chip

| Benchmarks | Function<br>description | FDP | XC4010E |
|------------|-------------------------|-----|---------|
| adder24    | 24 bits adder           | 7   | 13      |
| comp32     | 32 bits comparison      | 282 | 409     |
| count16    | 16 bits count           | 7   | 8       |
| gcd        | Greatest common divisor | 88  | 126     |
| traffic    | Traffic signal light    | 62  | 29      |
| dsp        | Digital data process    | 118 | 148     |
| Music      | Music door              | 40  | 54      |
| UART       | UART                    | 366 | 456     |

Table 5 Performance test results of FDP chip

| Parameter      | Description                                    | Value |
|----------------|------------------------------------------------|-------|
| ICCQ_CORE      | Quiescent current                              | <15mA |
| $T_{ m comb}$  | Combination logic Delay of cluster to cluster  | 1.7ns |
| $T_{ m seq}$   | Sequential logic Delay of register to register | 2.3ns |
| $T_{ m carry}$ | Carry logic delay of cluster to cluster        | 1.0ns |

development flow. In contrast, we use Xilinx Foundation as the XC4010's development flow. Table 4 indicates the logic implementation of the FDP and XC4010, the third column denotes the clusters cost of FDP, and the forth column represents the CLBs of the XC4010.

It is difficult to obtain a precise comparison between the FDP device and the XC4010 device because of the different architecture and different development of software. A XC4010 CLB consists of two 4-LUTs and a 3-LUT while a FDP cluster is made up of 8 3-LUTs. Additionally, Synopsys's Design Compiler is an ASIC synthesis tool and can not optimize to FDP architecture, so we can not usually find the best performance. Still, the results of Table 4 indicate that FDP's logic density performance is as good as XC4010's. We need to pay more attention to the interface between the hardware architecture and the

synthesis tool in later research.

Table 5 shows timing performance test results of the FDP chip.

### 4 Conclusion

In this paper a new FDP architecture/circuit was proposed and fabricated with SMIC 0.18 $\mu$ m CMOS technology. The novel 3-LUT based PLC architecture increases logic density about 11% compared with a traditional 4-input LUT. The new design of hierarchy routing architecture and SB makes it possible for different lengths to connect directly. The hardware and software cooperation test indicates that the FDP chip works correctly and efficiently.

#### References

- [1] Wang J. Chen L., Lai J. FPGA downloading circuit design and implementation. The 8th Int Conf on Solid-State and Integrated-Circuit Technology (ICSICT 2006), 2006; 1950
- [2] Cong J, Ding Y. FlowMap; an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Trans CAD, 1994, 13 (1):1
- [3] Betz V, Rose J, Marquardt A. Architecture and CAD for deepsubmicron FPGAs. Kluwer Academic Publishers, 1999
- [4] Shen Q, Tan J, Tong J, et al. Methodology of the design for segments-switchable switch blocks. The 8th Int Conf on Solid-State and Integrated-Circuit Technology (ICSICT 2006),2006;1944
- [5] Tan J, Shen Q, Chen Y, et al. FPGA routing architecture optimization. The 8th Int Conf on Solid-State and Integrated-Circuit Technology (ICSICT 2006),2006;1934
- [6] Sentovich E M, Singh K J, Lavagno L, et al. SIS; a system for sequential circuit analysis. Tech Report No. UCB/ERL M92/41, University of California, Berkeley, 1990
- [7] Marquardt A, Betz V, Rose J. Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density. Proc of ACM Int Symp on FPGAs, 1999;37
- [8] Betz V, Rose J. VPR: a new packing, lacement and routing tool for FPGA research. Int Workshop on Field-Programmable Logic and Applications, 1997:213
- [ 9 ] Trimberger S. Redefining the FPGA. International Conference on Field Programmable Logic and Applications, 2007

## FDP FPGA 芯片的设计实现\*

陈利光 王亚斌 吴 芳 来金梅<sup>†</sup> 童家榕 张火文 屠 睿 王 建 王 元 申秋实 余 慧 黄均鼐 卢海舟 潘光华

(复旦大学专用集成电路与系统国家重点实验室,上海 200433)

摘要:研究了新型的 FDP FPGA 电路结构及其设计实现.新颖的基于 3 输入查找表的可编程单元结构,与传统的基于 4 输入查找表相比,可以提高约 11% 的逻辑利用率;独特的层次化的分段可编程互联结构以及高效的开关盒设计,使得不同的互联资源可以快速直接相连,大大提高了可编程布线资源效率. FDP 芯片包括 1600 个可编程逻辑单元、160 个可用 IO、内嵌 16k 双开块 RAM,采用 SMIC  $0.18\mu m$  CMOS 工艺全定制方法设计并流片,其裸芯片面积为  $6.104mm \times 6.620mm$ . 最终芯片软硬件测试结果表明:芯片各种可编程资源可以高效地配合其软件正确实现用户电路功能.

关键词: FPGA; 可编程逻辑块; 可编程互联资源; 开关盒

**EEACC:** 1130; 1130B

中图分类号: TN304 文献标识码: A 文章编号: 0253-4177(2008)04-0713-06

<sup>\*</sup>国家自然科学基金(批准号:60776023)和国家高技术研究发展计划(批准号:2007AA01Z285)资助项目

<sup>†</sup>通信作者.Email:jmlai@fudan.edu.cn