(Back to Session Schedule)

The 15th Workshop on Synthesis And System Integration of Mixed Information technologies

Poster III: High Performance and Special Feature Design
Time: 16:45 - 18:30 Monday, March 9, 2009
Location: Manza & Kaneohe
Chairs: Nobuyuki Nishiguchi (STARC, Japan), Hiroyuki Higuchi (Fujitsu Microelectronics Limited, Japan)

R3-1 (Time: 16:45 - 16:48)
TitleEvaluation of the Performance of the MIMD Mode of a Dynamically Switchable SIMD/MIMD Processor by Using an Image Recognition Application
Author*Shohei Nomoto, Shorin Kyo, Shinichiro Okazaki (NEC, Japan)
Pagepp. 201 - 206
KeywordSIMD, MIMD, Reconfigurable
AbstractWe have developed an “XC core” processor that achieves low cost, high performance, and low power consumption through the use of a highly parallel SIMD architecture (the SIMD mode), as well as achieves high flexibility by morphing into a MIMD architecture (MIMD mode). In this paper, the effectiveness of the MIMD mode is evaluated by using a white line detection algorithm for open roads. The evaluation shows that real-time processing of the algorithm (less than 33 ms) can be achieved by using the MIMD mode to execute the verification process of white line segments, which is a part of the algorithm not suitable to be executed by the SIMD mode. Moreover we also show that verification can be executed five times faster by using region of interest (ROI) transfer instructions to efficiently transfer the ROI of an image. Furthermore, the execution time in the MIMD mode according to the number of PUs used, from 2 to 32, is also measured. The measured results show that the performance improvement rate slow down when using more than 16 PUs in the MIMD mode, mainly due to the insufficient parallelism in the verification process. As a whole, by using the MIMD mode, a 12.6 times speedup is achieved by using 32 PUs, comparing with only using the SIMD mode.

R3-2 (Time: 16:48 - 16:51)
TitlePipelining SHA-2 Implementations using Carry Save Adders
Author*Anh Tuan Hoang, Katsuhiro Yamazaki (Department of VLSI System Design, Ritsumeikan University, Japan), Shigeru Oyanagi (Department of Computer Science, Ritsumeikan University, Japan)
Pagepp. 207 - 212
KeywordSHA-2, fine-grained pipelining, cryptography, carry save adder
AbstractThe security hash algorithm (SHA), which is used to verify the integrity of a message, involves computation iterations on data. The huge computation delay generated in that iteration limits the entire throughput of the system, and makes it difficult to pipeline the computation. We describe a way to pipeline the computation using fine-grained pipelining with balanced critical paths. One critical path is broken into two by using data forwarding. The other critical path is broken into three stages by using computation postponement. The results critical paths all have two full-adder-layers with some data movements, and thus are balanced. The adders are implemented using carry save adders (CSA). Effectiveness of the usage of the two adder architectures analyzed and compared in terms of hardware size, frequency, throughput, and performance area rate.

R3-3 (Time: 16:51 - 16:54)
TitleHardware Accelerator for Feature Point Detection Part of SIFT Algorithm & Corresponding Hardware-Friendly Modification
Author*Jingbang Qiu, Tianci Huang, Takeshi Ikenaga (Graduate School of IPS, Waseda University, Japan)
Pagepp. 213 - 218
KeywordSIFT, hardware accelerator, inetger solution, one time interpolation, real-time
AbstractWe propose a hardware accelerator structure of the Feature Point Detection part in SIFT which is possible to implement on FPGA. Fully Integer Solution is applied. Also, we re-design the process as a 12-block structure and reduce the times of interpolation so as to lower hardware cost. In our experiment, we achieve Max Clock Frequency of 68.0MHz, which could deal with about 100 640x480-size images per second. The proposal is suitable for real-time FPGA system.

R3-4 (Time: 16:54 - 16:57)
TitleVariability Characterization and Tolerance on Throughput and Power for Chip-Multiprocessors
Author*Wan-Yu Lee, Iris Hui-Ru Jiang (Department of Electronics Engineering, National Chiao Tung University, Taiwan)
Pagepp. 219 - 223
Keywordprocess variability, chip-multiprocessor, voltage island, frequency island, Monte Carlo analysis
AbstractThis paper proposes a new architecture of variability-tolerant chip-multiprocessor. To mitigate the impact of process variability on throughput and power, voltage and frequency islands are introduced into chip-multiprocessors. Thus, voltage island frequency island chip-multiprocessors enable per-core scaling on the supply voltage and operating frequency. It can naturally collaborate with dynamic voltage frequency scaling. The process variations are characterized through an analytical model, and are quantified through Monte Carlo analysis. Compared with the design without process variations, when 70 threads are run on a chip of 70 small cores, our results show throughput degradation is 0.1%, while power reduction is 34.3%.

R3-5 (Time: 16:57 - 17:00)
TitleA Ternary Multi-Ported Content Addressable Memory Architecture utilizing Asynchronous Multiple Search-Operation Technology
Author*Takeshi Kumaki, Masaharu Tagami, Yuta Imai, Tetsushi Koide, Hans Jürgen Mattausch (Hiroshima University, Japan)
Pagepp. 224 - 229
KeywordCAM, multiport, ternary, routing, redundancy
AbstractThis paper presents a ternary multi-ported content addressable memory (CAM) architecture utilizing asynchronous multiple search-operation technology, aiming at efficient high throughput of associative-search operations. The asynchronous multiple search-operation technology adopts a Flexible Multi-ported Content Addressable Memory (FMCAM) architecture, which is reported. The proposed ternary multi-ported CAM architecture achieves a fast associative table-lookup solution for high-speed routing applications, such as IP packet forwarding and effectively realizes a Ternary Flexible Multi-ported Content Addressable Memory which we refer to as TFMCAM in this paper. The main novel points of the architecture are simultaneous multiple associative-search operations and a high implementation-yield ratio. Furthermore, the TFMCAM architecture realizes the necessary background table maintenance function without preventing the associative-search operation. For verifying the effectiveness of the TFMCAM architecture, FPGA and ASIC implementation results will be evaluated for final paper.

R3-6 (Time: 17:00 - 17:03)
TitleA Hardware Design for the First Pass of A Large Vocabulary Continuous Speech Recognition System
Author*Akihiko Eguchi, Joe Hashimoto (Kinki University, Japan), Makoto Saituji (NEC Electronics, Japan), Akihisa Yamada (Sharp Corporation, Japan), Takashi Kambe (Kinki University, Japan)
Pagepp. 230 - 235
Keywordspeech recognition, first pass, C-based architecture design, function based vector pipeline
AbstractSpeech recognition is becoming popular as a technology for the implementation of human interfaces. However, conventional approaches to large vocabulary continuous speech recognition require a high performance CPU. In this paper, we describe a speech recognition system designed using a C-based architecture design methodology, which avoids this limitation. Application specific hardware for the first pass data processing step is designed to achieve real time recognition with low-speed CPU on a portable terminal, and its performance is evaluated.

R3-7 (Time: 17:03 - 17:06)
TitleCoarse-Grained Dynamically Reconfigurable Architecture with Flexible Reliability
Author*Younghun Ko, Dawood Alnajjar, Yukio Mitsuyama, Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Pagepp. 236 - 241
Keywordreliability, soft error, coarse-grained, reconfigurable architecture, TMR
AbstractAbstract—Acceptable soft error rate on a VLSI chip varies depending on applications and operating environment so that recent VLSI designers concern reliability specification. In this paper, we propose a novel coarse-grained dynamically reconfigurable architecture, which offers flexible reliability. A notion of cluster is introduced as a basic element of the proposed architecture, each of which can select four operation modes with different levels of spatial redundancy and area-efficiency. In the TMR operation mode, which attains the highest reliability level, outputs of three execution modules are voted inside of a cluster, making it possible to perform an error recovery without any rollback operations. Evaluation of permanent error rates demonstrates that four different reliability levels can be achieved by the proposed architecture. The area of additional circuits to attain tolerance to soft errors provide flexible reliability accounts for 30.5% of the proposed coarse-grained reconfigurable device.

R3-8 (Time: 17:06 - 17:09)
TitleLow Cost Design of an Advanced Encryption Standard (AES) Processor Using a New Common-Subexpression-Elimination Algorithm
Author*Ming-Chih Chen (Department of Electronic Engineering, National Kaohsiung First University of Science and Technology, Taiwan), Shen-Fu Hsiao (Department of Computer Science and Engineering, National Sun Yat-Sen University, Taiwan)
Pagepp. 242 - 247
KeywordAES, VLSI, chip, CSE, encryption
AbstractIn this paper, we propose an area-efficient design of Advanced Encryption Standard (AES) processor by applying a new common-expression-elimination (CSE) method to the sub-functions of various transformations in AES. The proposed method reduces the area cost of realizing the sub-functions by extracting the common factors in the bit-level expressions of these sub-functions using a new CSE algorithm. Cell-based implementation results show that the AES processor with our proposed CSE method has significant area improvement compared with previous designs.

R3-9 (Time: 17:09 - 17:12)
TitleDSP Array Breadboard System for Application on Foreground Segmentation
Author*Bin Wu, Takao Nishitani (Tokyo Metropolitan University, Japan)
Pagepp. 248 - 253
KeywordDSP array, Gaussian Mixture Model, FPGA
AbstractThis paper describes a DSP array breadboard system for evaluating statistical signal processing architectures for various algorithms. An example algorithm, employed here, is foreground segmentation from a dynamic background. Although several different algorithms have been proposed, the simplest but most popular pixel based algorithm is introduced for the evaluation. A trial of a single chip FPGA implementation is also shown to pave the way to realize future signal processing architecture.

R3-10 (Time: 17:12 - 17:15)
TitleAn Interface for Representing Dynamically Reconfigurable Architectures by using Graph with Configuration Information
Author*Vasutan Tunbunheng, Hideharu Amano (Keio University, Japan)
Pagepp. 254 - 259
Keyworddynamically reconfigurable system, retargetable compiler, architecture representation
AbstractFor developing a new dynamically reconfigurable architecture, a designer requires retargetable compiler for generating configuration data to evaluate the architecture in architectural exploration space. The Black-Diamond compiler using Graph with Configuration Information (GCI) to represent reconfigurable resources inside the target architectures. It translates data-flow graph from C-like front-end description, applies placement and routing by using GCI, and generates configuration data for each element of the architecture. This paper shows an idea of interface using for modifying the design on GCI.

R3-11 (Time: 17:15 - 17:18)
TitleA Case Study of Clockless Bundled-data On-chip Interconnect Design using Double Edge Triggered Flip-flops
Author*Katsunori Tanaka, Yuichi Nakamura (NEC Corporation, Japan)
Pagepp. 260 - 265
Keywordclockless (asynchronous) logic, on-chip interconnect, network-on-chip, bundled-data, four phase protocol
AbstractThis paper shows a case study of clockless (asynchronous) bundled-data on-chip interconnect design using double edge-triggered flip-flops (DET-FFs). Increasing power dissipation by shrinking technology process has led LSI designers to multi-core design, but due to load unbalance, multi-core LSIs still waste large power for cores with small loads. Then, dynamic and flexible adjustment of clock frequencies to the cores and GALS (Globally-Asynchronous, Locally-Synchronous) design using clockless on-chip interconnect are key techniques for reducing the wasted power. Since four-phase handshaking used in clockless logic requires two round-trips of signal communication, it has significant difficulties to provide high-speed inter-core communication. This paper thus proposes use of DET-FFs for higher throughput, and shows an experimental design with its area and performance.

R3-12 (Time: 17:18 - 17:21)
TitleA VLSI Architecture of Tone Classification Function-Based Isolated-Word Speech Recognition
Author*Jirabhorn Chaiwongsai, Werapon Chiracharit, Kosin Chamnongthai (King Mongkut's University of Technology Thonburi, Thailand), Yoshikazu Miyanaga (Hokkaido University, Japan), Kouji Higuchi (University of Electro-Communications, Japan)
Pagepp. 266 - 270
Keywordtone classification function, VLSI implementation, parallel computation, pipeline process, look-up table
AbstractSpeech recognition in tonal languages such as Thai, Chinese, etc. classifies word meaning by using tone. Therefore tone classification function is extremely essential part for improving accuracy rate. This paper presents a novel VLSI architecture of tone classification function-based isolated word speech recognition. The architecture consists of two parts; feature extraction and tone classification function. In feature extraction part, voice detection, pitch period estimation and slope classification are introduced. The proposed pitch period is calculated by using parallel computation and 3-stage pipeline process. In the classification function, look-up table technique is employed to detect tone by using only F0 characteristic information. This takes advantage in reducing the complexity of computation cost of the proposed architecture. Moreover, no training set is used. To evaluate the proposed architecture, the experiment is performed with 100 word vocabularies selected from 20-40 years old dependent-speakers. The architecture is implemented on Altera Cyclone II series FPGAs running at 50 MHz. The results reveal 88.25% accuracy rate and 8.27 ms/word processing time.

R3-13 (Time: 17:21 - 17:24)
TitleSpeculative Configuration Prefetching for Multi-Context Architectures
Author*Sven Eisenhardt, Julio Oliveira, Tommy Kuhn, Wolfgang Rosenstiel (Universität Tübingen, Germany)
Pagepp. 271 - 276
Keywordcoarse-grained, reconfiguration, multi-context, array, prefetching
AbstractMulti-context reconfigurable arrays provide the ability for prefetching the subsequent configuration into the architecture's context memory during execution. This is difficult, however, if the subsequent configuration cannot be determined ahead of execution. In this paper we present a method to minimize the reconfiguration time overhead by speculatively prefetching configurations in non-deterministic sequences. As an example we reconfigured an array to process FFT kernels of different sizes. By applying speculative reconfiguration prefetching it was possible to reduce the reconfiguration overhead by 38%.

R3-14 (Time: 17:24 - 17:27)
TitleEfficient Mode Selection Algorithm for Inter-Layer Residual Prediction of H.264/SVC
Author*Yoshitaka Morigami, Shinpei Matsuoka, Tian Song, Takashi Shimamoto (Tokushima University, Japan)
Pagepp. 277 - 282
KeywordH.264, SVC
AbstractThis paper presents an efficient mode selection algorithm to reduce the computational complexity when using inter-layer residual prediction of H.264/SVC. Proposed two steps algorithm focuses on the complexity reduction of the inter-layer residual prediction. The experiment results show that proposed algorithm can considerably reduce redundant computation complexity with almost no coding efficiency loss.

R3-15 (Time: 17:27 - 17:30)
TitleA Case Study on AES Encryption System Design with SystemBuilder
Author*Yuki Ando, Seiya Shibata, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada (Nagoya University, Japan)
Pagepp. 283 - 288
KeywordAES, HLS, coarse-grain pipelining, SW/HW partitioning
AbstractThis paper presents a case study on designing an Advanced Encryption Standard (AES) Encryption System using our system-level design toolkit named SystemBuilder. We start with a sequential specification of the AES Encryption System behavior and generate an FPGA implementation. In order to improve the performance, we iteratively refine the behavioral description based on the analysis result obtained by a profiler. Finally, AES Encryption System with pipelined hardware implementation achieved 5.0 times better performance than that with software implementation.