SASIMI2010 Technical Program

The 16th Workshop on Synthesis And System Integration of Mixed Information Technologies

Paper Session IV: System Level Design and Design Experience (II)
Time: 15:15 - 16:50 Tuesday, October 19, 2010
Location: Ballroom
Chairs: Masanori Muroyama (Tohoku University, Japan), Lih-Yih Chiou (National Cheng Kung University, Taiwan)

R4-1 (Time: 15:15 - 15:17)

Title	A Regular Expression Matching Circuit Based on a Modular Non-Deterministic Finite Automaton with Multi-Character Transition
Author	*Hiroki Nakahara, Tsutomu Sasao, Munehiro Matsuura (Kyushu Institute of Technology, Japan)
Page	pp. 359 - 364
Keyword	Regular Expression, IDS, FPGA, reconfigurable device
Abstract	This paper shows an implementation of a regular expression circuit based on an NFA (Non-deterministic finite automaton). Also, it shows that the NFA based one is superior to the DFA (Deterministic finite automaton) based one, in terms of area and time complexity. A regular expression matching circuit is generated as follows: First, the given regular expressions are converted into an NFA. Then, to reduce the number of states, the NFA is converted into a modular non-deterministic finite automaton (MNFA(p)) with p-character transition. Finally, a finite-input memory machine (FIMM) to detect p-characters as well as the matching elements (MEs) realizing the states for the MNFA(p) are generated. We designed MNFA(p) for different p on a Xilinx FPGA. Then, we derived an optimal value p that efficiently uses both LUTs and embedded memories of the FPGA. As for the performance per FPGA area, our method is 6.2-18.6 times better than DFA-based methods, and is 1.8 times better than the NFA-based method. Since our method efficiently utilizes FPGA resource, a low-cost FPGA can be used to implement a high-performance regular expression matching circuit.

R4-2 (Time: 15:17 - 15:19)

Title	Acceleration of a SAT Based Solver for Minimum Cost Satisfiability Problems Using Optimized Boolean Constraint Propagation
Author	*Xin Zhang (The Graduate School of Information, Production and Systems, Waseda University, Japan), Peilin Liu (Department of Electronic Engineering Shanghai Jiao Tong University, China), Shinji Kimura (The Graduate School of Information, Production and Systems, Waseda University, Japan)
Page	pp. 365 - 370
Keyword	SAT-Solver, MinCostSAT, BCP
Abstract	In this paper, we have shown an efficient way to accelerate a SAT based Solver for Minimum-cost problems with fractional coefficients by using Early Conflict Detection based BCP. Various benchmarks are evaluated and our solver with optimized BCP proves to gain significant acceleration rate, especially for satisfiable formulas.

R4-3 (Time: 15:19 - 15:21)

Title	Circuit Synthesis for Fast Memory Access in System LSI
Author	*Kazuya Kishida, Takashi Kambe (Kinki University, Japan)
Page	pp. 371 - 376
Keyword	Behavioral synthesis, Memory Access, System LSI, pipelining, on-chip array
Abstract	High Level Design methodologies are becoming more and more important in the design of Large system LSI devices. Behavioral synthesis from C and other high level languages is a key to achieving the productivity required by such large designs. For memory intensive applications in particular the automatic identification, optimization and synthesis of memory access operations is essential. In this paper, some synthesis techniques for memory access logic are described. The three kinds of memory access method examined are (a) variable registerization (b) memory access pipelining and (c) on-chip arraying. This approach is applied to a speech recognition algorithm and its effectiveness is evaluated.

R4-4 (Time: 15:21 - 15:23)

Title	Clock Gating Optimization with Delay-Matching Cells
Author	*Shih-Jung Hsu, Rung-Bin Lin (Yuan Ze University, Taiwan)
Page	pp. 377 - 382
Keyword	Clock gating, Type matching, Delay matching, Clock skew, Low power
Abstract	Clock gating is an effective method of reducing power dissipation of a high-performance circuit. However, deployment of gated cells increases the difficulty of synthesizing a low-skew gated tree. In this paper, we propose a delay-matching approach to addressing this problem. Delay-matching is achieved using gated cells whose timing characteristics are similar to that of their clock buffer (inverter) counterparts. Experimental results show that delay-matching attains better slew and much better latency with comparable clock skew when compared with type-matching approach. Delay-matching achieves all of these using much less area than type-matching does. Besides, the skew of a delay-matching gated tree, just like the one generated by type-matching, is insensitive to process and operating condition variations. Moreover, the slews of a delay-matching gated tree are less sensitive to process and operating condition variations than that of a type-matching gated tree.

R4-5 (Time: 15:23 - 15:25)

Title	Design and Evaluation of Digital Receiver for Low Power Wireless Communication
Author	*Kazuki Ohya, Keishi Sakanushi, Yoshinori Takeuchi, Masaharu Imai (Osaka University, Japan)
Page	pp. 383 - 388
Keyword	Digital Radio Design, Near Distance Wireless Communication, Low Power
Abstract	This paper studies design and implementation trade-off of digital receivers for low power wireless communications. In these days, wireless communication becomes one of indispensable technologies in our daily life. For mobile terminal devices, which have been widely developed, low power and low energy are required. On the other hand, consideration for noise affects on transmissions are indispensable for wireless communication. According to the characteristics of noise on transmissions, suitable implementation considering power consumption should be varied. In this paper, we design digital circuits of receiver for RuBee, which is one of low power communication standards. We measured area, power consumption, and noise tolerance of designed digital receivers by changing noise characteristics. Experimental results show that trade-off between power consumption and noise tolerance exists and optimal implementation can be changed according to the noise characteristics.

R4-6 (Time: 15:25 - 15:27)

Title	Design and Verification of an Ultra-Low-Power Active RFID Tag with Multiple Power Domains
Author	*Kenichi Agawa, Massimo Alioto, Wenting Zhou, Tsung-Te Liu, Louis Alarcon, Kimiya Hajkazemshirazi, Mervin John, Jesse Richmond, Wen Li, Jan Rabaey (University of California at Berkeley, U.S.A.)
Page	pp. 389 - 394
Keyword	RFID, ultra low power, system-level verification, analog-digital-mixed-signal design, wake-up scheme transceiver
Abstract	An ultra-low-power active RFID tag chip with multiple power and clock domains has been designed and verified. The chip employs a wake-up scheme to reduce its power consumption. It includes a wake-up analog regulator and oscillator as well as a digital power manager, synchronizer, and protocol processor. Thus, we need to verify a complicated wake-up procedure and analog-digital-mixed-signal system for its correct operations. Matlab Simulink models and simulations are explored for system-level verification, and verification time can be suppressed efficiently.

R4-7 (Time: 15:27 - 15:29)

Title	Device Simulation and Experimental Measurement of High-Voltage Unified-CBiCMOS Buffer Driver for Ultra-High-Speed CCD Image Sensors
Author	Toshiaki Koike-Akino (Harvard University, U.S.A.), Takashi Hamahata, *Toshiro Akino, Takeharu Goji Etoh (Kinki University, Japan)
Page	pp. 395 - 400
Keyword	Lateral-BJT, CMOS, High-Speed, High-Voltage, High Capacitive Load
Abstract	We propose a high-performance buffer circuit termed a unified-complementary bipolar CMOS (U-CBiCMOS) technique for ultra-high-speed charge-coupled device (CCD) image sensors. In order to improve driving capability for realizing over-1M frames per second, the U-CBiCMOS buffer driver makes an effective use of a lateral BJT, which is inherent to an MOSFET structure. We designed a prototype of the U-CBiCMOS buffer driver which uses a trench isolated SOI-CMOS process with an asymmetric lightly-doped diffusion (LDD) for high-voltage applications. Through device simulations and experimental measurements, we reveal that the high-voltage CMOS process can offer a peak current gain of higher than 200. In addition, it is demonstrated that the driving performance of the U-CBiCMOS buffer circuit is significantly improved by activating the lateral BJT at a supply voltage of 15V and a load capacitance of 2nF.

R4-8 (Time: 15:29 - 15:31)

Title	Efficient Multiple Regular Expression Matching on FPGAs based on Extended SHIFT-AND Method
Author	*Yusaku Kaneta, Shingo Yoshizawa, Shin-ichi Minato, Hiroki Arimura, Yoshikazu Miyanaga (Hokkaido University, Japan)
Page	pp. 401 - 406
Keyword	regular expression, pattern matching, FPGA
Abstract	In this paper, we study efficient pattern matching over high-speed data streams on a reconfigurable hardware, FPGA (Field Programmable Gate Array) First, we introduce a subclass of regular expressions, called linear regular expressions. Secondly, for the subclass, we present a new architecture, static BP-NFA (static bit-parallel NFA) on FPGA based on one of the well-known pattern matching techniques, bit-parallel pattern matching. Thirdly, we give analysis on the complexity of our architecture. Finally, we show the experimental results on our hardware.

R4-9 (Time: 15:31 - 15:33)

Title	Energy-Aware Partitioning Using a Multi-Objective Genetic Algorithm
Author	Lih-Yih Chiou, Yi-Siou Chen, *Ya-Lun Jian (National Cheng Kung University, Taiwan)
Page	pp. 407 - 411
Keyword	Partitioning, Low Power
Abstract	Incorporating power management during partitioning significantly contributes to energy efficiency architecture. A partitioning approach that partitions a system with the maximum amount of idle time appears to be the most practical for power management, in which the efficiency of power saving inevitably decreases since the partitioning block size becomes unbalance. This work presents a novel energy-aware hardware clustering algorithm in an architecture partitioning domain using a multi-objective genetic algorithm. The proposed algorithm reduces the communication power and component power based on use of an energy estimation approach. Experimental results based on the pareto-optimal solutions demonstrate the effectiveness of the proposed algorithm in generating the close optimal solution faster than the exhaustive approach.

R4-10 (Time: 15:33 - 15:35)

Title	An Extension of Systolic Regular Expression Matching Hardware for Handling Iteration of Strings Using Quantifiers
Author	*Yoichi Wakaba, Masato Inagi, Shin'ichi Wakabayashi, Shinobu Nagayama (Hiroshima City University, Japan)
Page	pp. 412 - 417
Keyword	regular expression, string matching, NIDS, quantifier, FPGA
Abstract	We propose systolic regular expression matching hardware that can handle iteration of subclasses of regular expression (e.g. union of strings,) which cannot directly be handled by conventional systolic hardware. Our proposed hardware handles these patterns by using shift registers. FPGA implementation results showed that the proposed method significantly reduces the number of LUTs required to handle those patterns compared with the conventional one, in which these patterns are expanded into much longer ones.

R4-11 (Time: 15:35 - 15:37)

Title	A Novel Timing Synchronization Method for Fast and Accurate Multi-Core Instruction-Set Simulators
Author	Meng-Huan Wu, *Fan-Wei Yu, Cheng-Yang Fu, Peng-Chih Wang, Ren-Song Tsay (National Tsing Hua University, Taiwan)
Page	pp. 418 - 423
Keyword	Instruction-set simulator, binary translation, multi-core, synchronization
Abstract	This paper proposes a timing synchronization method for fast and accurate Multi-Core Instruction-Set Simulation (MCISS). In order to achieve accurate simulation results of MCISS, a lock-step approach, which synchronizes every cycle, is commonly used. However, this approach introduces immense overhead and lowers the simulation speed. Instead of synchronizing every cycle, our ap-proach synchronizes based on the data dependency among the simulated programs. Therefore, the synchronization overheads can be highly reduced with accurate simulation results. With the proposed approach, the simulation speed of MCISS is up to 40 ~ 1,000 million instructions per second (MIPS) in general. Our major contribution is on clarifying the data dependency issue of multi-core system. The experimental results show that the MCISS using our simulation method can perform fast and accurate simulation.

R4-12 (Time: 15:37 - 15:39)

Title	A Power Efficient Unified Gated Flip-Flop
Author	*Takumi Okuhira, Tohru Ishihara (Kyushu University, Japan)
Page	pp. 424 - 429
Keyword	Low power design, flip-flop, clock gating
Abstract	Since the clock power consumption in today's processors is considerably large, reducing the clock power consumption contributes to the reduction of the total power consumption in the processors. Recently, a gated flip-flop is proposed for reducing the clock power consumption of flip-flop circuits. The gated flip-flop employs a clock-gating circuit which cuts off an internal clock signal if the data stored in the flip-flop does not need to be updated. Although this reduces the clocking power consumption, the power dissipated in the clock-gating circuit is still large. For reducing the power dissipated in the clock-gating circuit, this paper proposes a technique for unifying the multiple clock-gating circuits, which reduces the overhead of the clock-gating circuit. Power measurement results obtained using a test chip demonstrate that our unified gated flip-flop reduces the power consumption of register circuits by 45% compared to the conventional gated flip-flops if the state transition probability of the flip-flop is 0.1 which is an average state transition probability of flip-flops in a commercial microprocessor used in our experiments.

R4-13 (Time: 15:39 - 15:41)

Title	Quantitative Graph-Based Minimal Queue Sizing for Throughput Optimization in Latency-Insensitive Designs
Author	Juinn-Dar Huang, *Yi-Hang Chen, Ya-Chien Ho (National Chiao Tung University, Taiwan)
Page	pp. 430 - 435
Keyword	Latency-Insensitive System, Latency-Insensitive-design, Throughput Optimization, Queue Size Minimization, Integer Linear Programming
Abstract	As manufacturing processes are constantly moving toward very deep submicron (VDSM), global interconnect delay is becoming one of the most critical performance obstacles in system-on-chip (SoC) designs today. Latency-insensitive-design (LID) methodology, which enables multicycle communication to tolerate latency variation at late stages of the design process without substantially modifying pre-designed IP cores, has been proposed accordingly to conquer this issue. However, the system throughput is still degraded due to imbalanced interconnect latency and communication back-pressure residing in an LID. In this paper, a performance optimization technique with minimal queue sizing is presented. We first model a given LID as a newly proposed quantitative graph (QG), which can be further compacted using the proposed compaction techniques, so that much bigger problems can be handled. On top of QG, integer linear programming is applied to achieve the exact solution with minimal queue size based on the proposed constraint formulation in a reasonable runtime. The experimental results show that our approach can deal with moderately large latency-incentive systems in an acceptable runtime and save about 28% of queues as compared to the prior art.

R4-14 (Time: 15:41 - 15:43)

Title	A Reconfigurable Layout Method and Evaluation for Network On Chip
Author	*Yuichi Nakamura (NEC Corp., Japan), Marcello Lajolo (NEC Labs. America, U.S.A.)
Page	pp. 436 - 441
Keyword	NoC, Layout, Reconf
Abstract	This paper presents a reconfigurable layout method for Networks-on-Chip (NoCs) based on partial re-layout. Currently, the layout design time is quite significant, because of the complexity involved with the verification of timing and signal integrity constraints. This limits the possibility to perform incremental changes at the physical design stage. However, a NoC which connects IP cores by network interfaces can be easily reconfigured during place and route. In general, a strict hierarchical design method can provide ease of reconfigurability, but it results in worse area and timing with respect to a flat layout method, which, on the other hand, does not provide reconfigurability. In this paper, we propose a rough hierarchical layout which combines the benefits of both hierarchical and flat layout design styles. Experimental results show area and performance numbers similar to the ones achieved by a flat layout as well as reconfigurability characteristics similar to the ones provided by a strict hierarchical layout.

R4-15 (Time: 15:43 - 15:45)

Title	RER: a Tuning Tool for Implementing a Computational Pipeline Across Multiple FPGAs
Author	Hirokazu Morishita, *Kenta Inakagata (Keio University, Japan), Yasunori Osana (Seikei University, Japan), Naoyuki Fujita (JAXA, Japan), Hideharu Amano (Keio University, Japan)
Page	pp. 442 - 447
Keyword	FPGA, Optimization, CFD
Abstract	RER (Resource Estimation and Reconfiguration) is a quick optimization tool for arithmetic units of Xilinx IP cores and facilitates partitioning deep computational pipelines across multiple FPGAs. It generates an appropriate computational pipeline structure by composing the optimal configuration of IP cores, from the designer's parameters and database of hardware amount for all possible configurations of the arithmetic IP cores. As an example, complicated pipeline used in MUSCL algorithm in computational Fluid dynamics is divided into two FPGAs, considering the results of optimization by RER. For any partition candidates, 5.46x times or more speed up compared to the software on Intel Core 2 Duo processor had achieved.

R4-16 (Time: 15:45 - 15:47)

Title	Soft-error Tolerability Analysis for Triplicated Circuit on an FPGA
Author	*Yoshihiro Ichinomiya, Motoki Amagasaki, Morihiro Kuga, Toshinori Sueyoshi (Kumamoto University, Japan)
Page	pp. 448 - 453
Keyword	FPGA, soft-error, TMR
Abstract	SRAM-based Field Programmable Gate Arrays (FPGAs) are vulnerable to a single event up- set(SEU), which is induced by radiation effects. In order to improve the robustness against a SEU, many dependable design techniques were studied. Also, the reliability analysis become important. In this work, we propose the reliability estimation techniques for triplicated circuits by using conditional probability.

R4-17 (Time: 15:47 - 15:49)

Title	A Tile Based Reconfigurable Architecture with Dual ALU-array/Processor Operating Mode Capability
Author	Shin'ichi Kouyama, Masayuki Hiromoto (Kyoto University, Japan), Yukihiro Nakamura (Ritsumeikan University, Japan), *Hiroyuki Ochi (Kyoto University, Japan)
Page	pp. 454 - 459
Keyword	Coarse-grained Reconfigurable Architecture
Abstract	ALU-based coarse-grained reconfigurable devices are suitable for accelerating simple stream processing by pipelining and parallelization. They are, however, not efficiently used for sequential processing with complicated controls. In this paper, we propose a tile based dynamically reconfigurable architecture in which each tile has dual ALU-array/RISC processor operating mode capability. The ratio of ALU-based hardware accelerator and RISC processor counterpart can be dynamically changed depending on the dominant computation in the application in order to maximize the efficiency.

R4-18 (Time: 15:49 - 15:51)

Title	VLSI Architecture of V-AMDF based Pitch Detection for Tonal Speech Recognizer
Author	*Jirabhorn Chaiwongsai, Werapon Chiracharit, Kosin Chamnongthai (King Mongkut’s University of Technology Thonburi, Thailand), Yoshikazu Miyanaga (Hokkaido University, Japan), Kohji Higuchi (University of Electro-Communications, Japan)
Page	pp. 460 - 465
Keyword	tone classification function, V-AMDF, 4-stage pipeline process
Abstract	Tonal speech recognizer needs tone classification function to guarantee the word correctness. However, tone classification function generally has high computation part because there are many repeated parameters. This paper proposes AMDF-based 4-stage pipeline process VLSI architecture of tone classification function for tonal speech recognizer (TONE-SPEC). The architecture is divided into pitch detection, fundamental frequency extraction and tone decision processes. In the speech recognition, only vowel signal is sent though the pitch detection process so that the sample number is reduced. AMDF detects the periodicity of vowel speech by using 4-stage pipeline process. After that, fundamental frequency is extracted and then tone result is decided by finding look-up table maximum vote in tone decision process. To evaluate the performance of the proposed architecture, the experiment is set and tested with 37 Thai words. The results show the improvement of processing time, comparing to the conventional method.