Hardware Acceleration of Rate-Distortion Optimized Quantization Algorithm

Yusuke Funayama, Takashi Kambe†, Gen Fujita‡
Graduate School of Science and Engineering, Kindai University
3-4-1 Kowakae, Higashi-Osaka City, Osaka, Japan
†Dept. of Electric and Electronic Engineering, Kindai University
‡Faculty of Information and Communication Engineering,
Osaka Electro-Communication University, Osaka, Japan

Abstract — Rate-distortion optimized quantization (RDOQ) is an important technology for improving video coding performance. RDOQ is able to determine the optimal value among multiple quantization candidates based on rate-distortion (RD). This paper describes a hardware design of the RDOQ processing for a 4 × 4 block for inter-frame prediction using high-level synthesis technology [1] based on the improved RDOQ algorithm [4]. The hardware design results are also evaluated.

I. INTRODUCTION

Although the Rate-Distortion Optimized Quantization (RDOQ) in both H.264/AVC and H.265/HEVC improves the compression rate, processing times are nevertheless high and implementing it as hardware is complex.

In this paper we implement the algorithm in hardware using high-level synthesis and improve its performance.

II. RATE-DISTORTION OPTIMIZED QUANTIZATION (RDOQ)

The quantization process in both H.264/AVC and H.265/HEVC is implemented using the Rate-Distortion Optimized Quantization (RDOQ) technique. RDOQ is able to determine the optimal quantization level among multiple quantization candidates by minimizing the sum of rate-distortion (RD) costs in each block. RDOQ consists of five steps to and is able to achieve higher encoding performance than conventional approaches.

1. Three \( l_i^{\text{float}} \) rounding candidates, (a) zero level \( l_i^0 \), (b) floor rounding level \( l_i^{\text{floor}} \), and (c) ceiling rounding level \( l_i^{\text{ceil}} \) are enumerated and their distortion values \( D_i^j \) are calculated. Here, \( i \) denotes the coefficient number, \( l_i^{\text{float}} \) denotes the quotient (before rounding) of DCT coefficient \( c_i \) divided by quantization step \( Q_{\text{step}} \), and \( j \) denotes the rounding candidate level.

2. The last non-zero (LNZ) coefficient is found in zigzag scanning order to minimize RD cost.

3. The CBP (Coded Block Pattern) is estimated.

4. The bit-rate estimation of each rounding candidate and its RD cost is calculated using equation (1).

\[ D_i^j = \text{err}_i^j + \lambda \times \text{bits}_i^j \]  

where, \( \text{bits}_i^j \) represents the number of bits obtained by performing entropy coding on the quantized level \( l_i^j \), \( \text{err}_i^j \) indicates the quantization error if the coefficient \( c_i \) is quantized to value \( l_i^j \), and \( \lambda \) is constant for each quantization parameter (QP).

5. Determine whether all coefficients are set to zero by comparing steps 1 to 4 with the sum of RD costs.

RDOQ has a higher encoding performance than conventional techniques but it involves complex processes such as optimizing the quantization value for each quantization coefficient and updating the context of each coefficient. Also, because coefficients for bit-rate calculation are dependent on each other, hardware acceleration using parallel processing is difficult.

[2] proposed reduction of quantization level candidates and simplification of bit-rate estimation. In [4], we improved the bit-rate estimation approach in [2] further, and proposed three new improvements, (a) acceleration of distortion calculation, (b) reduction of LNZ coefficients search range, and (c) decision about whether all coefficients are zero.

III. HARDWARE ACCELERATION USING HIGH-LEVEL SYNTHESIS TECHNOLOGY

Fig. 1 shows the architecture of the RDOQ processing based on the proposed RDOQ algorithm [4]. The target performance is real time processing for a 4 × 4 block for inter-frame prediction using high-level synthesis technology [1]. If this is achieved, the entire RDOQ process for a 720 × 480 frame will also be able to process in real time using a hierarchical parallel processing technique.
A. Parallel processing of RD cost calculation

Because [4] uses only the estimated bit-rate from equations for all coefficients, the proposed RDOQ algorithm can calculate the RD costs of all coefficients in parallel. However, when enumerating rounding candidates and deciding the distortion calculation LNZ coefficient search range there are data dependencies among coefficients. To deal with this problem, a search range flag is set at rounding candidates enumeration and the LNZ coefficient search range is determined using this flag after distortion calculation. This modification eliminates the data dependency between the two processes and all the coefficients can be processed in parallel (S2 and S4 in Fig. 1) All flags are bit-concatenated in reverse coefficient number order and the search range is decided using a one-cycle priority encoder in S2. One-cycle loop pipelining is also applied for each calculation at S3.

B. Function based pipelining of RDOQ process

The RDOQ process is implemented using six-stage functional pipelining. The pipeline consists of (S1) the input of DCT coefficients from RAM, (S2) rounding candidates enumeration and distortion calculation, (S3) optimization of LNZ coefficient, (S4) bit-rate estimation and RD cost calculation, (S5) decision of all coefficients to zero, and (S6) the output of the optimized quantization levels to RAM. The stages communicate using synchronization signals.

IV. HARDWARE DESIGN RESULTS

Five kinds of circuit, shown in Table I were designed to evaluate each of the acceleration methods. The gate level logic was synthesized from RT level HDL using Synopsys' Design Compiler and mapped to Hitachi 0.18 μm CMOS library cells. The clock frequency of all the circuits was 100MHz. The processing time of RDOQ calculation is for one frame (624 x 4 x 4 blocks).

<table>
<thead>
<tr>
<th>Acceleration method</th>
<th>Circuit size [gates]</th>
<th>Processing time [μs]</th>
</tr>
</thead>
<tbody>
<tr>
<td>v0 (Sequential)</td>
<td>213,706</td>
<td>103.419</td>
</tr>
<tr>
<td>v1 (v0 + 16 parallel processing of S2 and S4)</td>
<td>778,654</td>
<td>30.654</td>
</tr>
<tr>
<td>v2 (v1 + Loop pipelining of S3)</td>
<td>794,906</td>
<td>28.556</td>
</tr>
<tr>
<td>v3 (v2 + Priority encoding in S2)</td>
<td>783,035</td>
<td>28.049</td>
</tr>
<tr>
<td>v4 (v3 + functional pipelining)</td>
<td>830,464</td>
<td>4.038</td>
</tr>
</tbody>
</table>

Circuit v0 is a sequential Bach C description of the proposed algorithm. Circuit v1 executes 16 processes in parallel to calculate the 4 x 4 block coefficients at S2 and S4 in v0. In circuit v2 one cycle loop pipelining is applied at S3 in v1 by moving multi-cycle multiplications out of loop. Circuit v3 implements a priority encoder for search range reduction to v2. Circuit v4 adds functional pipelining of the RDOQ process to v3 and is able to process a 4 x 4 block in real time. Using this circuit, RDOQ on a 720 x 480 frame can be performed in real time.

V. CONCLUSION

In this paper, we implemented an improved RDOQ algorithm in hardware using the Bach C high-level synthesis tool. We designed an RDOQ processing circuit module capable of processing a 4 x 4 block in real time and plan to extend our methods to the whole RDOQ process for larger frame sizes.

ACKNOWLEDGEMENTS

The authors would like to thank the Bach system development group in SHARP Corporation, Electronic Components and Devices Development Group. This work is supported by the VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys Corporation.

REFERENCES