Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Saman Biookaghazadeh, Pravin Kumar Ravi, Ming Zhao
High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. PubDate: Thu, 29 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Michiel Van Beirendonck, Jan-Pieter D’anvers, Angshuman Karmakar, Josep Balasch, Ingrid Verbauwhede
The candidates for the NIST Post-Quantum Cryptography standardization have undergone extensive studies on efficiency and theoretical security, but research on their side-channel security is largely lacking. This remains a considerable obstacle for their real-world deployment, where side-channel security can be a critical requirement. This work describes a side-channel-resistant instance of Saber, one of the lattice-based candidates, using masking as a countermeasure. Saber proves to be very efficient to masking due to two specific design choices: power-of-two moduli and limited noise sampling of learning with rounding. A major challenge in masking lattice-based cryptosystems is the integration of bit-wise operations with arithmetic masking, requiring algorithms to securely convert between masked representations. PubDate: Fri, 23 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Nathan Zhang, Kevin Canini, Sean Silva, Maya Gupta
We present fast implementations of linear interpolation operators for piecewise linear functions and multi-dimensional look-up tables. These operators are common for efficient transformations in image processing and are the core operations needed for lattice models like deep lattice networks, a popular machine learning function class for interpretable, shape-constrained machine learning. We present new strategies for an efficient compiler-based solution using MLIR to accelerate linear interpolation. For real-world machine-learned multi-layer lattice models that use multidimensional linear interpolation, we show these strategies run 5-10× faster on a standard CPU compared to an optimized C++ interpreter implementation. PubDate: Thu, 15 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Palash Das, Hemangee K. Kapoor
Convolutional/Deep Neural Networks (CNNs/DNNs) are rapidly growing workloads for the emerging AI-based systems. The gap between the processing speed and the memory-access latency in multi-core systems affects the performance and energy efficiency of the CNN/DNN tasks. This article aims to alleviate this gap by providing a simple and yet efficient near-memory accelerator-based system that expedites the CNN inference. Towards this goal, we first design an efficient parallel algorithm to accelerate CNN/DNN tasks. The data is partitioned across the multiple memory channels (vaults) to assist in the execution of the parallel algorithm. Second, we design a hardware unit, namely the convolutional logic unit (CLU), which implements the parallel algorithm. PubDate: Thu, 15 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Mohit Khatwani, Hasib-Al Rashid, Hirenkumar Paneliya, Mark Horton, Nicholas Waytowich, W. David Hairston, Tinoosh Mohsenin
This article presents an energy-efficient and flexible multichannel Electroencephalogram (EEG) artifact identification network and its hardware using depthwise and separable convolutional neural networks. EEG signals are recordings of the brain activities. EEG recordings that are not originated from cerebral activities are termed artifacts. Our proposed model does not need expert knowledge for feature extraction or pre-processing of EEG data and has a very efficient architecture implementable on mobile devices. The proposed network can be reconfigured for any number of EEG channel and artifact classes. Experiments were done with the proposed model with the goal of maximizing the identification accuracy while minimizing the weight parameters and required number of operations. PubDate: Thu, 15 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Adi Eliahu, Ronny Ronen, Pierre-Emmanuel Gaillardon, Shahar Kvatinsky
Computationally intensive neural network applications often need to run on resource-limited low-power devices. Numerous hardware accelerators have been developed to speed up the performance of neural network applications and reduce power consumption; however, most focus on data centers and full-fledged systems. Acceleration in ultra-low-power systems has been only partially addressed. In this article, we present multiPULPly, an accelerator that integrates memristive technologies within standard low-power CMOS technology, to accelerate multiplication in neural network inference on ultra-low-power systems. This accelerator was designated for PULP, an open-source microcontroller system that uses low-power RISC-V processors. Memristors were integrated into the accelerator to enable power consumption only when the memory is active, to continue the task with no context-restoring overhead, and to enable highly parallel analog multiplication. PubDate: Thu, 15 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Morteza Hosseini, Tinoosh Mohsenin
This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. PubDate: Mon, 05 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Alexis Asseman, Nicolas Antoine, Ahmet S. Ozcan
Reinforcement learning, augmented by the representational power of deep neural networks, has shown promising results on high-dimensional problems, such as game playing and robotic control. However, the sequential nature of these problems poses a fundamental challenge for computational efficiency. Recently, alternative approaches such as evolutionary strategies and deep neuroevolution demonstrated competitive results with faster training time on distributed CPU cores. Here we report record training times (running at about 1 million frames per second) for Atari 2600 games using deep neuroevolution implemented on distributed FPGAs. Combined hardware implementation of the game console, image preprocessing and the neural network in an optimized pipeline, multiplied with the system level parallelism enabled the acceleration. PubDate: Mon, 05 Apr 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
There have been numerous studies on heterogeneous memory systems comprised of faster DRAM (e.g., 3D stacked HBM or HMC) and slower non-volatile memories (e.g., PCM, STT-RAM). However, most of these studies focused on static policies for managing data placement and migration among the different memory devices. These policies are based on the average behavior across a range of applications. Results show that these techniques do not always result in higher performance when compared to systems that do not migrate data across the devices: some applications show performance gains, but other applications show performance losses. It is possible to utilize offline analyses to identify which applications benefit from page migration (migration friendly) and use page migration only with those applications. PubDate: Wed, 24 Mar 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Dat Tran, Christof Teuscher
Emerging memcapacitive nanoscale devices have the potential to perform computations in new ways. In this article, we systematically study, to the best of our knowledge for the first time, the computational capacity of complex memcapacitive networks, which function as reservoirs in reservoir computing, one of the brain-inspired computing architectures. Memcapacitive networks are composed of memcapacitive devices randomly connected through nanowires. Previous studies have shown that both regular and random reservoirs provide sufficient dynamics to perform simple tasks. How do complex memcapacitive networks illustrate their computational capability, and what are the topological structures of memcapacitive networks that solve complex tasks with efficiency' PubDate: Sat, 27 Feb 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Micro-architectural side-channel attacks are major threats to the most mathematically sophisticated encryption algorithms. In spite of the fact that there exist several defense techniques, the overhead of implementing the countermeasures remains a matter of concern. A promising strategy is to develop online detection and prevention methods for these attacks. Though some recent studies have devised online prevention mechanisms for some categories of these attacks, still other classes remain undetected. Moreover, to detect these side-channel attacks with minimal False Positives is a challenging effort because of the similarity of their behavior with computationally intensive applications. This article presents a generalized machine learning--based multi-layer detection technique that targets these micro-architectural side-channel attacks, while not restricting its attention only on a single category of attacks. PubDate: Fri, 29 Jan 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Singe-Electron Transistor (SET) is considered as a promising candidate of low-power devices for replacement or co-existence with Complementary Metal-Oxide-Semiconductor (CMOS) transistors/circuits. In this work, we propose a diagnosis approach for SET array under a more generalized defect model. With the more generalized defect model, the diagnosis approach will become more practical but complicated. We conducted experiments on a set of SET arrays with different dimensions and defect rates. The experimental results show that our approach only has 3.8% false-negative rate and 0.7% misjudged-category rate on average without reporting any false-positive edge when the defect rate is 4%. PubDate: Thu, 21 Jan 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
The method to map the neural signals to the neuron from which it originates is spike sorting. A low-power spike sorting system is presented for a neural implant device. The spike sorter constitutes a two-step trainer module that is shared by the signal acquisition channel associated with multiple electrodes. A low-power Spiking Neural Network (SNN) module is responsible for assigning the spike class. The two-step shared supervised on-chip training module is presented for improved training accuracy for the SNN. Post implant, the relatively power-hungry training module can be activated conditionally based on a statistics-driven retraining algorithm that allows on the fly training and adaptation. PubDate: Wed, 20 Jan 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Due to the globalization of Integrated Circuit (IC) design in the semiconductor industry and the outsourcing of chip manufacturing, Third-Party Intellectual Properties (3PIPs) become vulnerable to IP piracy, reverse engineering, counterfeit IC, and hardware trojans. To thwart such attacks, ICs can be protected using logic encryption techniques. However, strong resilient techniques incur significant overheads. Side-channel attacks (SCAs) further complicate matters by introducing potential attacks post fabrication. One of the most severe SCAs is power analysis (PA) attacks, in which an attacker can observe the power variations of the device and analyze them to extract the secret key. PA attacks can be mitigated via adding large extra hardware; however, the overheads of such solutions can render them impractical, especially when there are power and area constraints. PubDate: Wed, 06 Jan 2021 00:00:00 GMT
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Anwesha Chatterjee, Shouvik Musavvir, Ryan Gary Kim, Janardhan Rao Doppa, Partha Pratim Pande
Voltage/frequency island (VFI)-based power management is a popular methodology for designing energy-efficient manycore architectures without incurring significant performance overhead. However, monolithic 3D (M3D) integration has emerged as an enabling technology to design high-performance and energy-efficient circuits and systems. The smaller dimension of vertical monolithic inter-tier vias (MIVs) lowers effective wirelength and allows high integration density. However, sequential fabrication of M3D layers introduces inter-tier process variations that affect the performance of transistors and interconnects in different layers. Therefore, VFI-based power management in M3D manycore systems requires the consideration of inter-tier process variation effects. In this work, we present the design of an imitation learning (IL)-enabled VFI-based power-management strategy that considers the inter-tier process-variation effects in M3D manycore chips. PubDate: Wed, 06 Jan 2021 00:00:00 GMT