# Using Body Biasing for Energy Efficient Frequency Scaling in a Dynamically Reconfigurable Processor

Johannes Maximilian Kühn University of Tübingen Computer Engineering Sand 13, 72076 Tübingen Germany Hideharu Amano Keio University Department of ICS 3-14-1 Hiyoshi, Kohoku-ku, Yokohama Japan Email: muccra@am.ics.keio.ac.jp Wolfgang Rosenstiel University of Tübingen Computer Engineering Sand 13, 72076 Tübingen Germany

Abstract— Reconfigurable architectures are equally affected by the challenges of continued technology scaling. Mission profiles require extremely versatile devices, with modes ranging from ultra low-power to high performance operation. In this study, STMicro's 28nm UTBB-FDSOI process is examined for use with coarse-grained body biasing in a Dynamically Reconfigurable Processor. We show that through coarsegrained body biasing, performance levels similar to those accomplished by voltage scaling can be reached. This however with 37.79% and 40.76% greater energy efficiency at supply voltages of 0.6V and 0.8V respectively as opposed to scaling supply voltages to the next higher level of 0.8V and 1.0V respectively. Furthermore, we show that through coarse-grained body biasing, we can optimize energy efficiency and considerably mitigate negative side-effects of body biasing without architectural changes. This yields in 16.1%, 10.6% and 12.8% energy efficiency improvements at 0.6V, 0.8V and 1.0V respectively in energy efficiency over whole chip body biasing.

# I. INTRODUCTION

Transistor scaling has long been the primary strategy to increase energy efficiency, but lost its appeal at least since process technologies entered the sub 100nm regimen [1]. Mission profiles of modern computing systems require extreme flexibility, ultra-low power as well as very high performance modes in one chip. This requirement is exacerbated with worsening phenomena like sub-threshold as well as direct gate tunneling leakage. Unchanging power budgets stress the severity of this problem. Ultimately, it is increasingly hard to hold up Moore's Law. One possible approach to counter these phenomena is multi- $V_{TH}$  design. Via  $V_{TH}$ , leakage and delay can be efficiently managed. While however an optimization in regard to leakage requires heightening  $V_{TH}$ , delay optimization requires lowering  $V_{TH}$ . Ideally, such adjustments are done whenever and only wherever required. STMicro's 28nm and future UTBB-FDSOI nodes exactly allow to do this through body biasing. By applying potentials on the transistor body (i.e. body biasing), the charge required to invert the channel can be increased or decreased, thus altering  $V_{TH}$  [2, p. 79]. In STMicro's FDSOI technology nodes < 28nm, this effect is especially strong. As standard cells support this feature per default, body biasing on virtually any granularity can be realized. However naive utilization will result in sub-optimal results at best or more probable, in erroneous behavior due to timing errors. To avoid erroneous behavior and to use this technique in a beneficial way, this study proposes coarse-grained, per-Processing Element (PE) body biasing and strategies to use this technique in a Dynamically Reconfigurable Processor (DRP) to address two major issues: leakage and energy efficiency. To realize this, a slack based body bias voltage (BBV) assignment algorithm is used to target increased clock frequencies at unchanging supply voltage by using Forward Body Biasing (FBB). Through this algorithm, we can leverage the quadratic relationship between dynamic power consumption and supply voltage [2, p. 43]. At the same time, we can contain leakage in PEs that don't require FBB to match timing by employing Reverse Body Biasing (RBB). Furthermore, by avoiding to increase  $V_{DD}$ , leakage itself behaves less severe than at higher  $V_{DD}$  [2, p.79]. We will show that Energy Delay Products (EDP) can be significantly improved compared to designs not using body biasing. Furthermore, we will quantify by how much coarse-grained body biasing improves EDPs over whole chip body biasing. This paper will start with a brief overview of related work in section II, continue with a detailed description of the effects of body biasing in section III. Continuing with a brief introduction to the evaluation flow in IV, section V proceeds with an introduction to the target architecture. In section VI, strategies and algorithms to exploit body biasing for frequency scaling are presented. In section VII experiments to validate the previously introduced methods are described and the obtained results are presented. Section VIII wraps up this work with conclusions.

# II. RELATED WORK AND CONTRIBUTIONS

In [3], a study on energy efficiency for STMicro's 28nm UTBB-FDSOI technology is conducted. For a full CPU core, a 40% improvement in energy efficiency has been achieved by using forward body biasing (FBB) instead of voltage scaling to increase clock frequency. In [4], the authors present an ARM Cortex A9 dual-core processor realized in the same 28nm UTBB-FDSOI process. Through strong FBB ( $V_{BB} = 3V$ ), maximum attainable clock frequency has been increased by 450%, 60% and 34% at  $V_{DD}$  0.5V, 1.0V and 1.3V respectively. In [5] Altera's BB approach is presented for Stratix III and IV. [5] uses only reverse body biasing (RBB) to minimize static

currents. RBB voltages are restricted to two levels, high- $V_{TH}$  for low power and normal- $V_{TH}$  for high speed mode. BBs are assigned via slack evaluation. BB is applied on Logic Array Block (LAB) granularity which comprises multiple Look-Up Tables (LUTs) with local routing and two adjacent routing facilities. In [6], a FPGA-Prototype supporting very fine-grained programmable BB is presented. It is targeted at leakage current reduction. In this FPGA design, each multiplexer, LUT and flip flop constitutes a BBI. The authors achieved a 91.4% leakage reduction. In [7], the authors propose a DVS and BB aware scheduling algorithm for a real-time embedded system. Energy efficiency is improved by 38%.

In [8], a CGRA based accelerator supporting BB is proposed. The accelerator consists of a reconfigurable PE matrix and a micro controller representing one BBI each. Using these BB techniques, power savings of about 10% were realized. [9] propose a CGRA optimized DFVS scaling implementation operating on PE granularity.

This work contributes the following: i) an evaluation of a DRP for both, RBB and FBB on PE granularity ii) programmable BB strategies for a DRP in an advanced SOI process, allowing much greater intensities than previously possible due to process technology limitations, iii) an evaluation of frequency scaling via FBB with further optimization of energy efficiency via RBB in timing uncritical PEs and iv) a comparison of coarse grained BB versus global, whole chip BB. All proposed methods don't require any changes to the architecture and are applicable for most reconfigurable architectures.

### III. BODY BIASING

Through the application of potentials on the transistor body, electrical characteristics such as threshold voltage  $V_{TH}$ can be changed. STMicro's 28nm FDSOI process in particular can be strongly influenced using BB. However not all effects are desired, thus the employment of BB needs to be waged carefully. Fig. 1 describes influence of BB on leakage cur-





rent at  $125^{\circ}C$  for various supply voltages. May the function  $LF_{VDD}(V_{BB})$  represent the depicted factor by which leakage is influenced at a certain  $V_{DD}$  for BB voltage  $V_{BB}$  [V]. These results were gathered through simulation of different standard cells using Synopsys HSPICE with STMicro's standard cell and device models. In Fig. 2 the relationship between BB and delay is visualized. May  $DLF_{VDD}(V_{BB})$  represent the factor by which delay and for that matter attainable clock frequency is influenced. The delay measurements have been conducted using a ring-oscillator setup with five inverter stages. It can



Fig. 2. Impact of BB on delay for  $V_{DD}=\{0.4V, 0.6V, 0.8V, 1.0V, 1.2V\}$  and  $T=125^\circ C$ 

be observed that with BB voltages  $V_{BB} < 0V$  i.e. RBB, leakage decreases. At  $V_{BB} = -0.3V$ , leakage decreases by 23% to 24%. With FBB, leakage doubles at  $V_{BB} = 0.5V$  and reaches between 5.9x for  $V_{DD} = 1.2V$  and 6.42x for  $V_{DD} = 0.4V$  of the respective leakage at  $V_{BB} = 1.3V$ . BB's influence on delay is mainly dependent on  $V_{DD}$  and  $V_{BB}$  itself. For  $V_{BB} < 0V$ , delay increases. At  $V_{BB} = -0.3V$ , delay increases by about 4% for  $V_{DD} = 1.2V$ , whereas it increases by about 24% for  $V_{DD} = 0.4V$ . By using FBB, delay is reduced by factor 2.65 for  $V_{DD} = 0.4V$  and by 1.17 for  $V_{DD} = 1.2V$ . At lower  $V_{DD}$ , BB has a stronger impact on delay than for  $V_{DD} > 0.8V$  while it equally reduces leakage current in all  $V_{DD}$ . BB in this work has two primary beneficial applications: Leakage reduction and delay reduction, i.e. frequency scaling which is focused in this work. Because of library restrictions for EDA tools, only supply voltages  $V_{DD} = \{0.6V, 0.8V, 1.0V\}$  can be covered. For these  $V_{DD}$ , the clock frequency may be scaled to 164% for  $V_{DD} = 0.6V$ , 136% for  $V_{DD} = 0.8V$  and to 124% for  $V_{DD} = 1.0V$ . As described in the previous paragraph, these two applications may be in conflict. Figs. 1 and 2 illustrate the trade-off that has to be sought between leakage and delay.

# IV. EVALUATION FLOW

Due to the lack of freely adjustable BB in EDA flows, alternatives to account for the effect of BB had to be found. SPICE simulations constitute the most accurate method, but also by far the slowest. Thus in this work we used static timing analysis tools in conjunction with SPICE simulations. As commonly known, power consumption is composed of

$$P_{total} = P_{dynamic} + P_{static} \tag{1}$$

[2]. Dynamic power further consists of switching power which scales linear to frequency and short-circuit power which is neglected for this examination [2]. With  $P_{dynamic}$  not dependent on BB, it is correctly approximated in standard static timing analysis tools using SAIF traces for the applications to be evaluated.  $P_{static}$  however is strongly dependent on BB and is thus accounted for through SPICE simulations of various standard cells. In these evaluations, the effect of BB on leakage current is computed (see Fig. 1) and then used as a factor for the static power consumption figure generated by the static timing analysis tool. While this allows coherent power estimation, the results might be pessimistic as after extraction from layout, IO memory and hardware components surrounding the PE array cannot be separated from the array itself. However due to the overall dominance of the PE array, this solution offers the best trade-off compared to full-chip SPICE simulations of representative traces.

#### V. TARGET ARCHITECTURE

The target architecture is a performance centric DRP, focussing on mobile applications which require both high energy efficiency and peak performance [10]. To increase maximum attainable clock frequency, Processing Elements (PE) are are fully pipelined. The pipeline has 5 stages: two instruction fetch, decode, execution and write-back stages. To avoid pipeline stalls due to inter-PE data dependency, this DRP supports vector instructions to optimize PE utilization without requiring additional contexts. The particular instance used in this study is equipped with 16 PEs, while each PE has 32 contexts and 8 registers (see schematic illustration in Fig. 3). Each PE's ALU supports full-width multiplication and common arithmetic as well as logical functions. The PEs are interconnected using a double-nearest network.



Fig. 3. Schematic representation of a PE with Body Bias Selector

Programmable BB in this study is exemplary realized to offer a conservative estimate on area overheads, without the intention to propose a new BB circuit. Through global Vdds and Gnds rails each BBI is supplied with a nmos low and a pmos high (pmos) reference voltage. The respective counterparts nmos high and pmos low are supplied via the complementary high voltage supply. These reference voltages are then used in an operational amplifier controlled by a 4 Bit Digital to Analog Converter (DAC) each. DACs are built from 2R-R ladders. This circuit covers about  $858.5 \mu m^2$  for each nmos and pmos, totaling in  $1717 \mu m^2$ . With one PE covering  $0.021052mm^2$ , the area overhead amounts to 8.16%. Variants with much lower area overhead but reduced flexibility (e.g. analog multiplexers) have been demonstrated before.

## VI. Scaling Clock Frequency USING FBB

To determine the maximum attainable clock frequency with FBB  $F_{max,V_{DD},BB}$  [Hz], it can be simply computed using the functions introduced in section III

$$F_{max,V_{DD},BB} = F_{max,V_{DD}} \cdot \text{DLF}_{V_{DD}}(\text{max}_{FBB}) \qquad (2)$$

where  $F_{max,V_{DD}}$  [Hz] is the maximum clock frequency without FBB. However applying this BB to the whole chip or PE array

would result in tremendous leakage currents. Our evaluations showed that besides the critical path, only little portions of each PE require FBB at  $F_{max,V_{DD},BB}$  at all. On the contrary, most paths even can take RBB allowing leakage reduction. Which portion of a PE is used is depending on the operations performed on this particular PE. This can be expressed using the slack of the concerned paths. Fig. 4 displays the slack for each operation at each examined  $V_{DD}$  for  $F_{max,V_{DD},BB}$ . Depending on the positive slack available, RBB



Fig. 4. Slack measured in fractions of a clock cycle for a PE for all operations supported by the ALU,  $V_{DD} = \{0.6V, 0.8V, 1.0V\}$  and clock frequencies without and with FBB application for each  $V_{DD}$ 

may be applied and similarly FBB may have to be applied to compensate for negative slack if such a operation is used. Actual and required slack are depending on the clock frequency. Thus, the required slack is represented by functions

$$RSF_f(V_{BB}) = (1 - DLF_{V_{DD}}(V_{BB})) \cdot t_{clkp}$$
(3)

where  $t_{clkp} = f^{-1}[s]$  is the clock period. Per-op slack function  $\text{PSF}_f(op)$  is obtained through measurement where slack at  $F_{max,V_{DD},BB}$  may also be computed through translation of the slack curve obtained at  $F_{max,V_{DD}}$ . Figs. 5, 6 and 7 display the slack in one PE of each supported operation at  $F_{max,V_{DD}}$  and  $F_{max,V_{DD},BB}$  versus the amount of slack gained or consumed through the application of BB  $V_{BB}[V]$ . For each  $V_{DD}$  one diagram is created. Graphically, these dia-



Fig. 5. Required slack for  $V_{BB}$  versus per-OP slack for clock frequencies without FBB (274*MHz*) and with FBB (450*MHz*)

grams indicate what BB levels are possible for each operation.

If a data point of  $\text{PSF}_f(op)$  is above a certain data point of  $\text{RSF}_f(V_{BB})$  with corresponding operation frequency f, operation op at BB  $V_{BB}$  does not violate timing constraints. In case of  $V_{DD} = 0.6V$  (Fig. 5), of course all operations can be executed at f = 274MHz without FBB, however for RBB  $V_{BB} = -0.3V$ , timings of ADD/SUB and MULT would fail. At f = 450MHz, ADD/SUB, BS, CMP, MULTTB and MULT need FBB, although differing in intensity, to meet timing. As indicated in



Fig. 6. Required slack for  $V_{BB}$  versus per-OP slack for clock frequencies without FBB (500*M*Hz) and with FBB (681*M*Hz)

Fig. 2, the influence BB exercises on delay decreases with increasing  $V_{DD}$ . Comparing Fig. 5 to 6, this effect becomes visible. For f = 500MHz, only MULT containing the critical path does not meet timing at RBB of  $V_{BB} = -0.3V$ . Furthermore for f = 681MHz, less operations need FBB and those that do, with the exception of MULT require lower levels of FBB. At the same time more operations can tolerate RBB even at this frequency. This trend continues for  $V_{DD} = 1V$  and beyond,



Fig. 7. Required slack for  $V_{BB}$  versus per-OP slack for clock frequencies without FBB (704*M*Hz) and with FBB (871*M*Hz)

where at f = 704MHz all operations except MULT allow for RBB  $V_{BB} = -0.3$ . At f = 871MHz, only ADD/SUB and MULT need FBB to meet timing. The analysis of the above described results leads to the interim conclusion, that clock frequency is limited by a number of circuits, while others could perform at higher frequencies. This could be leveraged if their delay can be made to comply with constraints through BB. To minimize negative side-effects (Fig. 1), the number of affected circuits should be minimized. Thus the following PE granularity BB assignment algorithm is proposed. Lines 1 through 6

**Require:** Operations assigned to PE  $PE_{Ops}$ **Ensure:** Body bias at  $V_{BB}$  or greater meets timing

1:  $\min_{slack} \leftarrow \infty$ for all  $op \in PE_{Ops}$  do 2: 3: if  $PSF(op) < \min_{slack}$  then  $\min_{slack} \leftarrow \text{PSF}(op)$ 4: 5:end if 6: end for  $V_{BB} \leftarrow \text{solve}(\text{RSF}(X) \le \min_{slack}, X)$ 7: if  $V_{BB} > \max_{FBB}$  then 8: 9: return undef 10:end if 11:if  $V_{BB} < \max_{RBB}$  then return  $\max_{RBB}$ 12:13:else return  $V_{BB}$ 14:

15: end if

Fig. 8.  $V_{BB}$  assignment algorithm for one PE

determine the minimum slack  $\min_{slack}$  of the set of operations  $PE_{Ops}$  assigned to the PE in question. In line 7, the smallest value of X where RSF(X) satisfies the inequation to be less or equal to  $\min_{slack}$  is searched. Lines 8 through 15 serve as sanity checks for  $V_{BB}$ . In line 8, the case where more FBB is required to meet timing than can be provided by the circuitry is caught. In this case, timing can't be met and lower clock frequencies should be used. Line 11 checks if the determined BB is stronger RBB than the circuitry provides. RBB is represented as negative voltage, thus if  $V_{BB}$  is less than  $\max_{RBB}$ ,  $V_{BB}$  is stronger RBB than could be provided. In this case,  $V_{BB}$  is set to the maximum feasible RBB. In all other cases  $V_{BB}$  is valid and thus returned.

## VII. Results

The presented results are three-fold: 1. BB assignments acquired through the proposed algorithm (Fig. 8), 2. power consumption and energy efficiency evaluation using the proposed BB scheme and 3. a comparison of PE-grained BB and whole chip BB. All evaluations are conducted for five different applications: alpha an alpha blender, dct a discrete cosine transformation, fir a finite impulse response filter, sad sum of absolute differences and sepia a sepia image filter.

Table I lists the BB voltages assigned to individual PEs acquired through the algorithm described in Fig. 8. The displayed data format is  $\#PEs \ge Voltage [V]$ , i.e. #PEs were assigned a BB of Voltage volts. The results conform to Fig. 5, 6 and 7 and thus can be grouped into two categories: 1. Timing critical operations that require FBB to conform to constraints and 2. uncritical operations that don't need FBB or even may tolerate RBB. In this result, we see the decreasing impact of BB on delays with increasing  $V_{DD}$ . All applications with one exception exhibit decreasing levels of FBB and increasing levels of RBB with increasing  $V_{DD}$ . The mentioned exception is the **fir** application which makes considerable use of full-width multiplication. This operation constitutes the critical path in all PEs. As the maximum frequency with FBB  $F_{max,V_{DD},BB}$  is defined as original maximum clock frequency increased with

maximum FBB (Eq. 2), full-width multiplication needs a maximum FBB of 1.3V for all  $V_{DD}$ . By using these settings for

| Application | Body Bias Voltages $V_{BB}$ at $V_{DD} @ F_{max}, V_{DD}, BB$ |                  |                     |
|-------------|---------------------------------------------------------------|------------------|---------------------|
|             | 0.6V                                                          | 0.8V             | 1.0V                |
| alpha       | 4x0.7, 4x0.8,                                                 | 4x0.3, 4x0.5,    | 4x - 0.1, 4x 0.1,   |
|             | 8x - 0.3                                                      | 8x - 0.3         | 8x-0.3              |
| dct         | 3x0.3, 12x0.8                                                 | 4x - 0.3, 12x0.5 | 4x - 0.3, 12x0.1    |
|             | 1x-0.3                                                        |                  |                     |
| fir         | 4x1.3, 4x0.8                                                  | 3x1.3, 4x0.5     | 3x1.3, 4x0.1        |
|             | 8x-0.3                                                        | 8x - 0.3         | 8x-0.3              |
| sad         | 7x-0.3, 9x0.8                                                 | 7x - 0.3, 9x0.5  | 7x - 0.3, 9x 0.1    |
| sepia       | 10x-0.3, 6x0.7                                                | 10x - 0.3, 6x0.3 | 10x - 0.3, 6x - 0.1 |

TABLE I Per-Application PE-grained BB voltages  $V_{BB}$  assigned through the algorithm described in Fig. 8

coarse-grained BB, the clock frequency can be scaled more efficiently to  $F_{max,V_{DD},BB}$ . In Fig. 9, power consumption differentiated by dynamic, internal and leakage is displayed along with respective Energy Delay Products (EDP) as measure for energy efficiency. Power measurements are visualized in Fig.



Fig. 9. Power consumption for  $V_{DD} = \{0.6V, 0.8V, 1.0V\}$  at respective  $F_{max,V_{DD}}$  and scaled counterpart using FBB  $F_{max,V_{DD},BB}$  versus EDP

9 for the DRP without BB at the original clock frequency  $F_{max,V_{DD}}$  (no scheme), with coarse-grained BB (CBB) and global BB (GBB) at  $F_{max,V_{DD},BB}$ . The EDP curve indicates considerable improvements in energy efficiency of frequency scaling using FBB instead of scaling via  $V_{DD}$ . With the exception of **fir**, all applications exhibit better EDPs than their  $V_{DD}$  scaled counterparts.



Fig. 10. EDP of app. alpha normalized to No-BB of next  $V_{DD}$ 

For alpha, FBB allows to scale at  $V_{DD} = 0.6V$  from 274MHz to 450MHz at almost equal energy efficiency using CBB. For  $V_{DD} = 0.8V$  and scaling from 500MHz to 681MHz, this is improved upon when using CBB, as leakage is successfully contained and the linear increase in frequency

is greater than the increased leakage. The decrease in energy efficiency in GBB is due to the overall leakage increase. For  $V_{DD} = 1.0V$ , a similar effect is visible. By scaling at  $V_{DD} = 1.0V$  from 704MHz to 871MHz, leakage can be even reduced in CBB. GBB still exhibits better energy efficiency than the original 704MHz sample as the linear increase in frequency outweighs the slight increase in leakage. This results in an energy efficiency increase of 41.3% (CBB) or 27.7% (GBB) at  $F_{max,0.6V,BB}$  versus  $F_{max,0.8V}$  and 43.9% (CBB) or 37.2% (GBB) at  $F_{max,0.8V,BB}$  versus  $F_{max,1.0V}$ .



Fig. 11. EDP of app. dct normalized to No-BB of next  $V_{DD}$ 

In dct, containing leakage in CBB is difficult due to the widespread use of additions. Thus GBB is only slightly worse. This behavior is visible for all  $V_{DD}$ . As for alpha, CBB and GBB is more energy efficient than the baseline as the increase in frequency outweighs the slight increase in leakage. The resulting increase in energy efficiency is 32.5% (CBB) or 27.9% (GBB) at  $F_{max,0.6V,BB}$  and 39.9% (CBB) and 39.9% (CBB) or 37.1% (GBB) at  $F_{max,0.8V,BB}$ .



Fig. 12. EDP of app. fir normalized to No-BB of next  $V_{DD}$ 

fir showcases the best-case leakage containment of CBB or the worst-case leakage scenario of GBB. Due to the usage of full-width multiplication, the critical path, GBB requires a 1.3V FBB for the whole chip, while CBB allows to limit the affected area to a sensible minimum. For  $V_{DD} = 1.0V$  however, the leakage increase due to FBB and  $V_{DD} = 1.0V$  cannot be outweighed for neither CBB nor GBB. Energy efficiency improvements are 31.6% (CBB) or -6.5% (GBB) at  $F_{max,0.6V,BB}$  and 33.5% (CBB) or 0.05% (GBB) at  $F_{max,0.8V,BB}$ .

sad behaves similar to dct. Leakage is mitigated in CBB compared to GBB. Energy efficiency improvement is 38.6% (CBB) or 27.6% (GBB) at  $F_{max,0.6V,BB}$  and 42.1% (CBB) or 37.0% (GBB) at  $F_{max,0.8V,BB}$ .

The **sepia** application represents the ideal scenario where CBB can improve energy efficiency in all cases through leakage containment or even reduction at higher  $V_{DD}$  and the linear increase in operation frequency. Energy efficiency improvements are 44.9% (CBB) or 31.9% (GBB) at  $F_{max,0.6V,BB}$  and



Fig. 13. EDP of app. sad normalized to No-BB of next  $V_{DD}$ 



Fig. 14. EDP of app. sepia normalized to No-BB of next  $V_{DD}$ 

44.3% (CBB) or 39.6% (GBB) at  $F_{max,0.8V,BB}$ .

The benefits of CBB over GBB is application dependent. CBB always has better energy efficiency as GBB affects the whole chip or component, including parts such as memeories etc. CBB shows better energy efficiency at low  $V_{DD}$ . CBB however is also important at higher  $V_{DD}$ , as strong global FBB then becomes prohibitive. Furthermore, not using CBB would require very high FBB with a single full-width multiplication, as GBB always has to accommodate the critical path, even if it is only one PE. Over all applications, CBB improves energy efficiency by 16.1%, 10.6% and 12.8% on average for  $V_{DD} =$  $\{0.6V, 0.8V, 1.0V\}$ .

### VIII. CONCLUSION

When using modern SOI technologies, energy efficiency can be significantly increased in clock frequency scaling scenarios. This is done by scaling via BB instead of scaling via  $V_{DD}$ which has a quadratic relationship with dynamic power consumption. Applying BB in a coarse grained manner allows to increase clock frequency while mitigating negative side-effects. This can be done without any changes to the architecture. To use the full potential of a chip, coarse grained BB is essential especially in coming SOI technology nodes. Global FBB would make these efforts unprofitable through an exponential increase in leakage. By maintaining timing through FBB only wherever critical paths are used and RBB elsewhere, leakage can be even reduced while at the same time increasing clock frequency. With the previous work on BB in reconfigurable architectures, it can be concluded that most reconfigurable architectures could benefit from BB.

### Acknowledgment

This work is supported by VDEC with Cadence Design Systems and Synopsys. The authors also express their gratitude to STARC, CMP and STMicro for their cooporation. Furthermore, this work was partially funded by the State of Baden-Württemberg, Germany, Ministry of Science, Research and Arts within the scope of Cooperative Research Training Group EAES, the DAAD, DFG SPP1500 RO-1030/17 and the Things2DO project BMBF 16ES0247.

#### References

- H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in *Computer Architecture (ISCA), 2011* 38th Annual International Symposium on. IEEE, 2011, pp. 365–376.
- [2] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. USA: Addison-Wesley Publishing Company, 2010.
- [3] F. Arnaud, N. Planes, O. Weber, V. Barral, S. Haendler, P. Flatresse, and F. Nyer, "Switching energy efficiency optimization for advanced cpu thanks to utbb technology," in *Electron Devices Meeting (IEDM), 2012 IEEE International.* IEEE, 2012, pp. 3–2.
- [4] D. Jacquet, F. Hasbani, P. Flatresse, R. Wilson, F. Arnaud, G. Cesana, T. Di Gilio, C. Lecocq, T. Roy, A. Chhabra *et al.*, "A 3 ghz dual core processor arm cortextm-a9 in 28 nm utbb fd-soi cmos with ultra-wide voltage range and energy efficiency optimization," 2014.
- [5] D. Lewis, E. Ahmed, D. Cashman, T. Vanderhoek, C. Lane, A. Lee, and P. Pan, "Architectural enhancements in stratix-iii and stratix-iv," in *Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays.* ACM, 2009, pp. 33–42.
- [6] M. Hioki, T. Sekigawa, T. Nakagawa, H. Koike, Y. Matsumoto, T. Kawanami, and T. Tsutsumi, "Fullyfunctional fpga prototype with fine-grain programmable body biasing," in *Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays.* ACM, 2013, pp. 73–80.
- [7] L. Yan, J. Luo, and N. K. Jha, "Joint dynamic voltage scaling and adaptive body biasing for heterogeneous distributed real-time embedded systems," *Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on*, vol. 24, no. 7, pp. 1030–1041, 2005.
- [8] H. Su, W. Wang, K. Kitamori, and H. Amano, "A low power reconfigurable accelerator using a back-gate bias control technique," in *Field-Programmable Technol*ogy (FPT), 2013 International Conference on. IEEE, 2013, pp. 390–393.
- [9] S. M. Jafri, O. Bag, A. Hemani, N. Farahini, K. Paul, J. Plosila, and H. Tenhunen, "Energy-aware coarsegrained reconfigurable architectures using dynamically reconfigurable isolation cells," in *Quality Electronic Design* (*ISQED*), 2013 14th International Symposium on. IEEE, 2013, pp. 104–111.
- [10] T. Katagiri and H. Amano, "A high speed design and implementation of dynamically reconfigurable processor using 28nm soi technology," in *Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL).* IEEE, 2014.