# HLS Utilizing Area Optimizing Method for High-Definition MRA-TV Denoise Circuit 

Eita Kobayashi ${ }^{\dagger}$, Kenta SENZAKI ${ }^{\ddagger}$, Atsufumi SHIBAYAMA ${ }^{\dagger}$, and Yuichi NAKAMURA ${ }^{\dagger}$<br>${ }^{\dagger}$ Green Platform Research Laboratories, ${ }^{\dagger}$ Information and Media Processing Laboratories, NEC Corporation,<br>1753 Shimonumabe, Nakahara-ku, Kawasaki, Kanagawa 211-8666, JAPAN<br>\{e-kobayashi@fg, k-senzaki@bp, sibayama@cd, yuichi@az\}.jp.nec.com


#### Abstract

This work proposed an area optimization method for high-definition image denoising circuit. Conventional denoise techniques have a common issue which outline of object is more blurred as the strength of the noise reduction increases. Meanwhile, we develop the MRA-TV algorithm combined with wavelet transform and TV norm optimization to clear the outline. This algorithm enables a high-quality image denoising with the maintenance of clear outlines. A major obstacle to realize the MRA-TV denoising circuit is its large circuit area due to the iterative multiple TVs. In this work, we achieve a significant circuit area reduction with the combination of the amount of calculation reduction and resource sharing technique utilizing high-level synthesis (HLS). Evaluation results show the $52 \%$ of area reduction with the maintenance of denoise processing throughput and latency.


## I. Introduction

Recently, various kinds of surveillance systems are proposed with a combination of cameras and image recognition software. Those systems are more important especially at night because of poor visibility, however, the recognition performance of system is also degraded due to the lack of quantity of light. In response to this situation, amplifying signal of image sensor is required to compensate for the insufficient light. At this time, large amount of noise signals are mixed simultaneously. Those noises reduce quality of amplified images and cause error detection in later image recognition stage. De-noise process is fundamental process of digital image processing because the noises can be generated not only by that power amplifying but also by variety of sources. Conventionally, many of approaches have been developed for denoising which only focused on strength of noise reduction. With those conventional simple approaches, sharp outline restoring which is essential in image recognition is difficult because blur of outline become significantly serious in the case that intensity of noise reduction is increased. To solve above problem, we have developed a MRA-TV (Multi Resolution Analysis and Total Variation) algorithm which achieves both performance of noise reduction and reproducibility of outline with a capable of high performance image recognition. MRA-TV algorithm can be highly compatible noise reduction performance and
reproducibility because high-frequency components are preserved faithfully separated by wavelet transformation; furthermore, TV norm minimization method can reduce noises effectively. On the other hand, MRA-TV algorithm also has a defect that large circuit area is required because multiple TV modules have to equip a lot of multipliers.
In this work, we propose a small area circuit implementation technique for MRA-TV algorithm. Problem of large area of MRA-TV is derived from extraction of image skeleton by recursion of TV modules which requires amount of calculations and implementation areas. Number of iteration is strongly-correlated to the clarity of outline, therefore it is difficult to reduce the number of TV modules with the maintenance of the high-quality image. Thus we have focused on the area reduction of a TV module itself. This approach can achieve significant area reduction of the entire circuit because there are many TV modules in the circuit. For the above reason, we develop a technique for reduction of number of multiplication in a once of iteration in addition to a resource sharing technique utilizing high-level synthesis. With the introduction of those techniques, we achieve up to $73 \%$ of area reduction in a single TV module, besides, $52 \%$ of total area reduction under the five-hierarchal configuration of MRA-TV algorithm.

The rest of this paper is organized as follows: Section II provides related works and target denoise architecture of our research. Section III and IV show the proposed area optimization method and design methodology utilizing high-level synthesis. Then, the synthesis result and comparison will be given in Section V. And finally we draw conclusion in Section VI.

## II. Related Works

## A. Image Processing

Conventional image denoising techniques typically use smoothing filters [5], median filters [5], bilateral filters ([1][3][7]), and so on. The Total Variation norm minimization method (TV method) [4][6][8][9] was proposed as a constructive algorithm that considers not only the difference of pixels themselves but also the energy vibration of adjacent pixels. There are also approaches that perform filtering after
frequency domain transformation to reduce the number of calculations or to reduce the noise included in a specific frequency band. Frequency domain transformation methods of image signals include discrete Fourier transformation (DFT), discrete wavelet transformation (DWT), and others. In this section, we explain the TV method as an example of a noise reduction technique and the DWT method as an example of frequency domain transformation. The TV method is a technique of separating the skeleton component U , which includes edge and plane elements, and residual component V , which includes noise and texture elements, from the original signal F. A noiseless image is achieved by adding to the skeleton and the textures, which are multiplied shrinkage coefficient. The skeleton component U is obtained by minimizing (2) below, which is an addition to the regularization term of the TV norm expressed in (1).

$$
\begin{gather*}
\mathrm{J}(U)=\int|\Delta U| \mathrm{dxdy} .  \tag{1}\\
\min _{U} \mathrm{~J}(U)-\frac{\mu}{2}\|F-U\|^{2} \tag{2}
\end{gather*}
$$

where $\mu$ is the parameter indicating the fidelity of the results between the original image and the skeleton. The Chambolle projection is an effective tool to solve (2) above. The TV method delivers such a powerful denoise performance, however, a large number of calculations are required for iterative computation until the convergence of a solution. In addition, images obtained via the TV method are often unrealistic due to excessive loss of texture. Related works have proposed derivative TV method algorithms such a bilateral TV method and a digital TV method [2].
Discrete wavelet transformation is a transformation method into the frequency domain from the image signal domain, similar to DFT. The difference between DWT and DFT is that DWT can save the positional information lacking in DFT. It does this by separating the high-frequency components from the low-frequency component by using the mother wavelet function. In the case of 2-D images, an image is quartered into four components-LL, LH, HL, and HH-by the application of horizontal and vertical wavelets. Much further low-frequency components can be extracted by recursive application of wavelet transform to the separated LL component. The LL sub bands at the lowest frequency hold the geometric information of the original image signal. Thus, multi resolution analysis by multiple 2-D wavelet transform achieves an effective denoising performance in specific low-frequency bands. Fig. 1 shows an example of the results of two iterations of wavelet transform. There have been previous studies about the mother wavelet function, e.g., the Haar wavelet, which is the most primitive wavelet using two pixels, the CDF97 wavelet, which is used in lossy JPEG2000, and the SSKF75 wavelet, which is used in lossless JPEG2000. The


Fig. 1 Result of 2-dimentional wavelet transform


Fig. 2 Block diagram of MRA-TV algorithm

CDF97 wavelet has the highest fidelity of the three but has a high calculation cost due to increase the reference pixels.

## B. Algorithm of MRA-TV

In this section, we explain the multi resolution analysis TV (MRA-TV) algorithm. Its block diagram is shown in Fig. 2. This algorithm is based on multi resolution analysis by wavelet transformation combined with the TV method. Multiple TV methods can effectively reduce the noise in various frequency bands separated by a combination of multiple wavelet transforms. In addition, the number of recursions of the TV method can be reduced in this data flow. Inverse wavelet transformed signals of the results of the TV method in a lower frequency layer are the initial solution of the TV method of a higher frequency layer. An artifact as a deficit of the wavelet is also reduced by the TV method located after wavelet transform. A low-frequency component separated in wavelet transform (1) is re-entered into wavelet transform (2) for the sake of re-extracting the lower-frequency component. TV methods were iteratively applied to each frequency level from the lowest to the highest included in the original image signal. Noises, which are included in each level, and artifacts derived from the wavelet, are simultaneously reduced in this way with the iterative TV method for multiple frequency components. Another significant aspect of this algorithm is its commutative structure. As described before, there are several different versions of both the TV method and wavelet transform. This algorithm achieves flexibility in terms of denoise and fidelity performance by trading a TV or wavelet function for another version. The


Fig. 3 Comparisons of wavelet shrinkage, conventional TV and MRA-TV algorithm


Fig. 4 Pipeline architecture of MRA-TV algorithm
performance is also adjustable by shifting the number of wavelet transformations. As just described, the proposed algorithm achieves high progressive flexibility by exchanging each function and the number of iterations. Figure 3 shows an example result of MRA-TV denoising and other algorithms. MRA-TV can reduce noise maintenance with edge and texture compared to TV and wavelet shrinkage, which is one of wavelet-based demises, individually.

## C. MRA-TV Architecture

Fast and low-power H/W IPs are desired for the use of embedded systems, and we have responded to this need by developing a denoise architecture based on our algorithm described in the previous section. We assume the same operating frequency as the data rate from image sensors like complementary MOS (CMOS) or a charge coupled device (CCD). Input data is assumed to be a raster scan ordered image and internal data is applied line by line using pipelined processing without any frame buffers. Figure 4 shows a circuit block diagram of the MRA-TV architecture. Each module equips line buffers without any frame buffers to adhere to the original data flow of the MRA-TV algorithm. The recursion of the TV method in the original algorithm is realized as serial pipelined TV modules, so there are some of the same TV modules in the same frequency band. Each individual TV module has the function of a single iteration of the recursion TV. In this work, we use bilateral TVs as a more effective TV method to reduce the number of wavelet iterations related to the amount of line buffers.

Figure 5 shows the input pixels for the bilateral TV module. As shown in the figure, the bilateral TV module processes this $5 \times 3$-lines pixel area. Figure 6 shows a core


Fig. 5 Input 15 pixels for bilateral TV


Fig. 6 Part of core data path of bilateral TV
part of the bilateral TV data path. There are 29 multipliers for one bilateral TV module derived from the calculation for the sum of square norms and 15 parallel multiplications. A significant level of optimization effect can be guaranteed by reducing the number of multipliers because these multipliers make up of the majority of the circuit area.

## III. TV module Optimization Method

This section proposes a method of reduction in the number of multipliers for MRA-TV architecture described in II.C.

## A. Calculations Optimization

This section describes our idea for the area optimization of bilateral TV. As will be described section III.B, there are 29 multiplications in bilateral TV. If we create a module with a one data input interval ( $\mathrm{DII}=1$ ) pipelined architecture, 29 multipliers are required because none of the functional units are shared. Figure 7 shows a 15 -pixel domain as the input of a bilateral TV module. The norm sum of square is calculated between the center pixel and the 14 peripheral pixels in the bilateral TV core part as also shown in Fig. 6. Fourteen multipliers are normally required for this. To optimize the area of the bilateral TV module, we developed an advanced norm sum of square calculation method in which those 14 multipliers are reduced to five.


Fig. 7 Main idea of our optimization

Our main idea is that the solution derived from the previous result 1 clock before and re-calculates only difference between current and 1 clock before. Fig. 7 shows the concept of our idea. The center pixel and peripheral pixels move one pixel to the right. The domain from top left U1 to the bottom right B5 is the previous clock and the domain from top left U 2 to the bottom right B6 is the current one. In our method, the current result of the norm sum of square is obtained by subtracting the difference of the overlap domain from the previous result one clock before and subtracting the three leftmost pixels ( $\mathrm{U} 1, \mathrm{C} 1$ and B 1 ) and adding the three new rightmost pixels (U6, C6, and B6). Note that the center pixel also changes from C 3 to C 4 . The difference of the square norm of the peripheral pixel in the overlap domain (e.g., U2) is calculated as

$$
\begin{equation*}
\Delta U_{2}=\left(U_{2}-C_{3}\right)^{2}-\left(U_{2}-C_{4}\right)^{2} \tag{3}
\end{equation*}
$$

It is easy to see by the deformation of (3) that

$$
\begin{align*}
\Delta U_{2} & =\left(U_{2}^{2}-2 U_{2} C_{3}-C_{3}^{2}\right)-\left(U_{2}^{2}-2 U_{2} C_{4}-C_{4}^{2}\right)  \tag{4}\\
& =\left(C_{3}-C_{4}\right)\left(C_{3}+C_{4}-2 U_{2}\right)
\end{align*}
$$

Only one multiplication is required for $\Delta \mathrm{U} 2$ because constant multiplication can be realized without multipliers. Thus, the sum all of pixels of the peripheral in the overlap domain (U2 to B5) is described as

$$
\begin{equation*}
\sum \Delta=\left(C_{3}-C_{4}\right)\left(10\left(C_{3}+C_{4}\right)+\left(2\left(U_{2}+U_{3} \cdots+B_{5}\right)\right)\right. \tag{5}
\end{equation*}
$$

According to the above, the overlap domain can be calculated by just one multiplication functional unit. Although three multiplications are required for the newly added pixels (U6, C6, and B6), the result of these three pixels can be reused for delete three pixels in 4 clocks ago. The current delete part ( $\mathrm{U} 1, \mathrm{C} 1$, and B 1 ) is calculated by using the difference between sum of square norm for current center pixel and sum of square norm for center pixel in 4 clocks ago. This difference is calculated similar to how the overlapped part is, i.e., by just one


Fig. 8 Data input interval will increase with the increasing number of TV layers.


Fig. 9 Pipeline stage will be merged by resource sharing
multiplication. Finally, the delete part is derived from subtraction this difference from adding part in 4 clock ago. After all, only five multiplications are required for the sum of square norm calculation according to our optimization. As a result, the circuit in Fig. 6 is achieved using just 20 multipliers compared to the 29 non-optimized circuits. Obviously, the number of additions or shifters must be increased compared to the natural way. In total, however, our approach works particularly well in area optimization because multiplication has a 10 times greater impact on area count than an adder.

## IV. Resource sharing strategy

## A. Strategy

This section proposes our strategy of how to achieve resource sharing focused on data flow, which we derived from the MRA-TV algorithm. Fig. 8 shows the data input interval (DII) to multiple TV modules in the MRA-TV algorithm. As the number of layers of the wavelet increases, the amount of valid input data is reduced by $1 / 2$ to the lower layer wavelet transformation module. In other words, the blank time becomes longer. This is because only the lower frequency components separated from the multi resolution analysis by wavelet transformation are sent to the next wavelet transformation. It becomes possible to share function units within each TV module by utilizing surplus time via the wavelet transformation. Fig. 9 explains how we achieve resource sharing when the DII is increased. Processes are executed separately first and a second cycle is implemented as a separate circuit in the pipeline if inputs to the pipeline


Fig. 10 Proposed our design methodology
exist at every cycle. In this approach, two processes can be merged into one cycle because the available number of cycles is doubled in the case where a blank cycle exists every cycle. For example, 15 multiplier units are required in the case in which the process needs to execute 15 multiplications in one cycle. By dividing this process into two cycles, the simultaneously executable number of multiplications is reduced to eight in the first cycle and to seven in the second cycle. Because of this, the maximum required number of physical multiplier units is eight. This method can reduce the required number of multipliers in bilateral TV modules because those modules equip many simultaneously executed multipliers.

## B. Design methodology by using high-levels synthesis

This section describes our design methodology using high-level synthesis for realizing the resource sharing technique explained in the previous section. We expect to significantly reduce the area by the combination of two techniques: the sum of square norm optimization described in section III and the resource sharing technique described in section IV.A. However, there would be considerable difficulty in simultaneous application, for two reasons:

1) The complexity of the circuit is increased by the optimization in section IIIA. The number of adders or wires is increased through our optimization while the number of multipliers is decreased because our proposal is only aimed at reducing the multiplication. To design applied resource sharing across multiple cycles is not easy for architectures with a complicated unit connection.
2) The optimal internal architecture is different for each bilateral TV module because the blank period of the input data is different depending on the hierarchy of the wavelet. In order to implement an optimum circuit, it is necessary to redesign the different architectures of bilateral TV modules depending on each hierarchy.
To solve these problems, we propose a design


Fig. 11 Design area of non-optimized TV


Fig. 12 Design area of TV optimized by the proposed method
methodology utilizing automatic pipeline synthesis. This is one of the key functions of high-level synthesis tools which can synthesize a pipelined circuit automatically from the behavioral description. Resource sharing using the input bank interval is easily realized by changing the data input interval (DII), which is a parameter of pipeline synthesis and is fortunately equivalent to changing the input blank interval for the pipeline. Fig. 10 shows the outline of our design methodology. Designers only have to prepare one optimization applied C-source description, as described in Section III. By applying the behavioral synthesis for a single behavioral description, multiple optimized RTLs are generated while changing the DII. For example, the lowermost hierarchal TV modules are applied as DII $=4$ because there are inputs every four cycles thereafter. Accordingly, optimized TV modules are automatically generated, resulting in as much resource sharing as possible in that DII. With this method, there is no need to re-design each TV module to achieve maximum resource sharing because optimization is automatically performed for each layer.

## V. Evaluation

We created MRA-TV architecture with an optimized


Fig. 13 Before/After resource sharing of MUX


Fig. 14 Result of our optimization for MRA-TV
bilateral TV module according to our design methodology described in section IV. The target video quality is defined as progressive 30 frames $/ \mathrm{sec}$ full HD image resolution $(1920 \times 1080)$ in the YUV422 format. Operating frequency is assumed to be twice that of the image sensor in order to share the circuit between the UV component and the Y. We set the circuit operating frequency to 150 MHz under the assumption that the output is $1 \mathrm{pixel} / \mathrm{clk}$ from the image sensor at 75 MHz . Our design is synthesized using a $90-\mathrm{nm}$ library. Here, we define the multiplication optimization method as Method 1 and the resource sharing design methodology as Method 2.

Fig. 11 shows the changing of the actual gate count and of the percentage of multiplexers per DII in Method 2 for bilateral TV modules without Method 1. Fig. 12 shows the same result for bilateral TV modules with Method 1. In terms of area, we can see that both the arithmetic and the register monotonically decrease as the DII increases. With respect to the area of the multiplexer, the occupied percentage of area increases because there is almost no variation after DII $=2$. This is because the number of inputs to the same type of function unit does not change, even if the physical number of the functional unit is reduced by resource sharing. We conclude that there is no difference in the areas of multiplexers Fig. 13 (a) and (b). With the growing number of multiplexers, design has the potential to become difficult in terms of manpower due to the increasing control and signal wires or the number of states in FSM. Configuration of the number of iterations of bilateral TV is three equally in each hierarchy because it can be expected to fully reduce noise. We evaluated the change of area due to the number of wavelets, i.e., the
number of hierarchies. Fig. 14 shows the total area of the MRA-TV architecture. The combination optimization method proposed in this work shows a $52 \%$ area reduction with the 5-hierarchy architecture.

## VI. Conclusion

In this work, we proposed an area reduction method for the MRA-TV denoise algorithm that is a hybrid of wavelet transformation and bilateral TV. A $52 \%$ area reduction was confirmed by using a combination of two types of optimization: the sum of square norm optimization and resource sharing. Further, our design method using high-level synthesis reduces the difficulty of applying the optimizations. The proposed method is widely adaptable to similar architectures because multi-resolution analysis by multiple wavelet transform is commonly used in signal processing.

## References

[1] D. Barash. A fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. In IEEE Trans. On Pattern Analysis and Machine Intelligence, volume 24, page 844, 2002.
[2] T. Chan, S.Osher, and J. Shen. The digital tv filter and nonlinear denoising. In Image Processing, IEEE Transactions on, volume 10, pages 231-241, 2001.
[3] F. Durand and J. Dorsey. Fast bilateral filtering for the display of high-dynamic-range images. In Proc of SIGGRAPH, pages 257-266, 2002.
[4] J. f. Aujol, G. Aubert, L. Blanc-Feraud, and A. Chambolle. Image decomposition into a bounded variation component and an oscillating component. In Journal of Mathematical Imaging and Vision, volume 22, pages 71-88, 2005.
[5] W. K. Pratt. Digital image processing. In Wiley, New York, 1978.
[6] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. In Physic D, volume 60, pages 259-268, 1992.
[7] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. of Int. Conf. Computer Vision, pages 839-846, 1998.
[8] K. Senzaki, M. Toda, M. Tsukada."Noise Reduction Controlling Filterring-Direction Based on Image Features" In Proc. of Meeting on Image Recoginition and Understanding, 2013.
[9] Sina Farsiu; Dirk Robinson; Michael Elad; Peyman Milanfar; Robust shift and add approach to super resolution. In Proc. of SPIE 5203, Applications of Digital Image Processing, 2003.

