1. Introduction
3GPP LTE, which is a set of enhancements to the 3G Universal Mobile Telecommunications System (UMTS)[1], has received tremendous attention recently and is considered to be a very promising 4G wireless technology. The Turbo decoder is typically one of the major blocks in a LTE wireless receiver. Key requirements of LTE include packet data support with peak data rates up to 150 Mbps on the downlink and 75 Mbps on the uplink, a low latency of 10 ms layer-2 round trip delay, flexible bandwidths (up to 20 MHz), improved system capacity and coverage[2]. Advanced technologies were selected to meet these requirements including the turbo decoder.
The bottleneck of turbo decoder design is the high latency which is due to the iterative decoding process, the forward-backward recursion in the maximum a posteriori (MAP) decoding algorithm and the interleaving/de-interleaving between iterations. In this brief, the codeword can be processed concurrently by four, eight or sixteen SISO decoders. The design challenge for these parallel modes is how to transmit parallel data simultaneously through the interconnection between SISO decoders and the memory module[3].
To achieve high throughput decoder, the low-latency of SISO and interleaver/de-interleaver had to be considered. This brief focuses on these issues and proposes an improved architecture. For a novel SISO system, forward state matrix α and backward state matrix β are calculated alternately, where the design made both algorithms share the same structure and had less crucial path delay. The improved interleaver/de-interleaver structure realized interleaving and de-interleaving in one cycle.
The rest of the sections in this paper are organized as follows. Section 2 describes the basic structure of QPP and its several algebraic properties. Then this article proposes an interleaver/de-interleaver design based on QPP algebraic properties. Section 3 concentrates on novel SISO architecture, which uses less resources and has low-latency for the Log{\_}map algorithm. Section 4 provides the implementation results followed by a conclusion.
2. Interleaving
The interleaver is essential to the impressive performance of the Turbo code, but its pseudorandom property complicates the parallel processing of a single codeword. A specific mechanism is required to handle parallel data transmission with traditional interleavers[4].
The QPP interleaver of a size-N block can be expressed as
f(x)=f1x+f2x2modN, | (1) |
f(x+d)=[f(x)+g(x)]modN, | (2) |
g(x)=(2df2x+d2f2+f1d)modN, | (3) |
g(x+d)=[g(x)+2d2f2]modN, | (4) |
As Figure 1 shows, decode data is saved in m×n matrix, where m indicates the number of groups and n stands for each group length. Two steps are exploited in parallel interleaving design. Step d is equal to 1 for the forward interleaver while for the backward one, d would be n−1. Figure 1(a) shows the forward interleaver. Successive two sequences after the interleaver are mapped in different line and row. Figure 1(b) illustrates the backward interleaving process. In backward interleaver sequences, two original adjacent numbers lie with both neighbor in column and line. After backward interleaving, they are separated in a different line and column. The setting means the two interleavers can be uniform disposal.
However, we can deduce some properties from QPP interleaver formula which is very useful for interleaving processing. QPP equator can be rewritten as follows.
f(x)=f1(x+w)+f2(x+w)2modn=f1x+f2x2modn, | (5) |
Figure 2 shows that process. The column interleaving generator combines by two progressive adders. Forward column address generator equations, for computing forward state matrix α, shown as follows.
f(x+1)=[f(x)+G(x)]modn, | (6) |
f(x+1)=[f(x)+G(x)]modn, | (7) |
G(x+1)=[G(x)+delta]modn, | (8) |
delta=(2f2)modn, | (9) |
G(0)=(f1+f2)modn, | (10) |
f(0)=0, | (11) |
f(x−1)=[f(x)+G(x)]modn, |
(12) |
G(x−1)=[G(x)+delta]modn, |
(13) |
delta=(2f2)modn, |
(14) |
G(n−1)=(3f2−f1)modn, |
(15) |
f(n−1)=(f2−f1)modn. |
(16) |
In these group formulas, x is the backward sequence, covered from n - 1 to 0. f(x−1) is the backward interleaved column address. The signal column{\_}c{\_}in would be used in an intra-row permutation process.
Figure 2 illustrates only half of the interleaver/de-interleaver operation and intra-row permutation is another half. Figure 3 gives more detail.
Intra-row permutation includes 16-groups progressive adders. Figure 3 illustrates one group work. Equations (17)-(19) show intra-row permutation. Where m is the number of groups covered 1, 4, 8 or 16, so we can just keep 2, 3 or 4 bits corresponding to the final result instead of the reminder operation.
f(x+1)={[f(x)+g(x)]%n}modm, |
(17) |
g(x+1)={[g(x)+delta]%n}modm, |
(18) |
delta=[(2f2)%n]modm. |
(19) |
For forward address calculation, the equator can be transformed to
row_f(w+x+1)=(row_f(w+x)+row_g(x))modm, |
(20) |
row_g(w+x+1)=(row_g(w+x)+row_delta)modm, |
(21) |
row_delta=((2f2)%n+column_c_in)modm, |
(22) |
row_g(0)=[g(0)%n]modm, |
(23) |
row_f(0)=[f(0)%n]modm. |
(24) |
The same as the forward intra-row permutation, backward calculation can use the same architecture with different initial inputs. Two permutations can share the same adders in various time slots. However, backward intra-row permutation should be adjusted, because the next permutation is in the following line. By this way, the Turbo decoder can generate a mapping/demapping address every cycle.
row_f(w+x−1)=(row_f(w+x)+row_g(w+x))modm, |
(25) |
row_g(w+x−1)=(row_g(w+x)+row_delta)modm, |
(26) |
row_delta=((2f2)%n+column_c_in)modm, |
(27) |
row_g(n−1)=[g(n−1)%n]modm, |
(28) |
row_f(n−1)=[f(n−1)%n]modm. |
(29) |
3. Improved soft in soft out
In order to illustrate the improved MAP shown in Figure 5, we compare it with original sliding window MAP (SW-MAP). Figure 4 shows common SW-MAP decoder architecture which requires one set of α unit, β unit, branch unit, and a Log Likelihood Ratio Calculator (LLRC) unit. A SMP buffer is used to save the stakes for use in the next Turbo iteration. In the SW algorithm, the parity Lp is loaded from the symbol memory in the sequential order. Priori information La is loaded from the double port memory in the sequential order for the first half iteration, and in the interleaving order for the second half iteration[6]. The same as the La information loading scheme, the systematic Ls is the load from the symbol memory. However, the ordinary SW-MAP design has two problems. One is α and β calculation which employs fully parallel add-compare-select-add[7]. This operation needs to be completed in one cycle which means the decoder cannot work in the high speed clock system. Another problem is resource consumption. The new design scopes on these two issues and employs mending structure.
The new add-compare-select-add operator idea is separated into two parts by a register. Two parts can deal different operation, one calculated α while the other calculated β or vice versa. Both α unit and β unit can be replaced by α or β unit. In this way, two α or β units can finish all α and β computation with one clock more than conversion SISO but shorten the critical path delay. Since the output sequence is different, the construction has to be taken.
Compared with a conventional SW-MAP decoder, the improved decoder does not use a sliding window and has only one α or β unit instead of α unit and β unit in the old one. Because α and β calculator structures are almost the same except branch transition probability mapping which can be control by mode select. What is more, the branch of last in first out memory (LIFO) had been saved because it takes parts in LLRC calculation directly.
In the improved MAP decoder, α and β are calculated alternately. Both α unit and β unit employ an add-compare-select-add operator, which is decided critical path delay in IC design ascertaining the system clock. To solve this bottleneck, a register is utilized to split the operator into add-compare and select-add. Because state matrix α and backward state β are independent of each other, the α or β unit calculates α and β in turn. The novel scheme not only yields a fast system clock but also reduces resource consumption by reusing the unit. Alternatively calculating α and β has another advantage: the transition probability computed by the branch unit need not save at all, which takes part in LLRC decoding directly with a clock delay. The new design finished half an iteration in the double group length clock, while SW-MAP using a group length and the width of the window. In short, two new MAP decoders are equivalent~to one SW-MAP decoder in complex, and spend less time in decoding.
Table 1 summarizes the implementation result of the proposed decoder and the hardware comparison with existing decoders. The new SISO architecture was only a trial by FPGA and would be verified by a trial chip in the future.
4. Conclusions
This paper details the implementation of a parallel Turbo decoder for the LTE standard, analyzing aspects including interleaving and SISO design, and the verification idea in FPGA. The proposed circuit is designed to achieve high throughput. Interleaver and Log-MAP algorithm optimizations in a parallelized architecture make it possible to achieve high throughput rates with low latency, without performance loss.