Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The number of battery-powered devices is rapidly increasing due to the widespread use of IoT-enabled nodes in various fields. Energy harvesters, which help to power embedded devices, are a feasible alternative to replacing battery-powered devices. In a capacitor, the energy harvester stores enough energy to power up the embedded device and compute the task. This type of computation is referred to as intermittent computing. Energy harvesters are unable to supply continuous power to embedded devices. All registers and cache in conventional processors are volatile. We require a Non-Volatile Memory (NVM)-based Non-Volatile Processor (NVP) that can store registers and cache contents during a power failure. NVM-based caches reduce system performance and consume more energy than SRAM-based caches. This paper proposes Efficient Placement and Migration policies for hybrid cache architecture that uses SRAM and STT-RAM at the first level cache. The proposed architecture includes cache block placement and migration policies to reduce the number of writes to STT-RAM. During a power failure, the backup strategy identifies and migrates the critical blocks from SRAM to STT-RAM. When compared to the baseline architecture, the proposed architecture reduces STT-RAM writes from 63.35% to 35.93%, resulting in a 32.85% performance gain and a 23.42% reduction in energy consumption. Our backup strategy reduces backup time by 34.46% when compared to the baseline. PubDate: 2023-05-05
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Self-Driving vehicles are expected to thrive in the coming years. These vehicles are designed to analyze the environment around them in real-time to identify obstacles and hazards. One of the most important aspects of designing a self-driving vehicle is to preserve the safety of pedestrians. This requires accurate and rapid pedestrian detection, which is a key operation in various other applications including video surveillance and assisted living. The covariance descriptor is one of the most efficient descriptors used in detecting pedestrians. However, the descriptor is compute-intensive; rendering it less favorable for real-time applications. This paper proposes an accelerated and optimized implementation of the descriptor. Instead of mapping the entire descriptor to a hardware accelerator, we opt for a heterogeneous architecture. In particular, compute-intensive components of the descriptor are accelerated on hardware, while the other components are executed on an embedded processor. The proposed architecture combines both speed and flexibility while being watchful of precious hardware resources. This architecture was validated on a Zynq SoC platform, which hosts FPGA fabric along with an ARM processor. The results of executing the descriptor on the platforms show a performance gain of up to 13.52 × when compared to pure software implementation of the descriptor. PubDate: 2023-04-28
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture. PubDate: 2023-04-26
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Fail-operational behavior of safety-critical software for autonomous driving is essential as there is no driver available as a backup solution. In a failure scenario, safety-critical tasks can be restarted on other available hardware resources. Here, graceful degradation can be used as a cost-efficient solution where hardware resources are redistributed from non-critical to safety-critical tasks at run-time. We allow non-critical tasks to actively use resources that are reserved as a backup for critical tasks, which would be otherwise unused and which are only required in a failure scenario. However, in such a scenario, it is of paramount importance to achieve a predictable timing behavior of safety-critical applications to allow a safe operation. Here, it has to be ensured that even after the restart of safety-critical tasks a guarantee on execution times can be given. In this paper, we propose a graceful degradation approach using composable scheduling. We use our approach to present, for the first time, a performance analysis which is able to analyze timing constraints of fail-operational distributed applications using graceful degradation. Our method can verify that even during a critical Electronic Control Unit failure, there is always a backup solution available which adheres to end-to-end timing constraints. Furthermore, we present a dynamic decentralized mapping procedure which performs constraint solving at run-time using our analytical approach combined with a backtracking algorithm. We evaluate our approach by comparing mapping success rates to state-of-the-art approaches such as active redundancy and an approach based on resource availability. In our experimental setup our graceful degradation approach can fit about double the number of critical applications on the same architecture compared to an active redundancy approach. Combined, our approaches enable, for the first time, a dynamic and fail-operational behavior of gracefully degrading automotive systems with cost-efficient backup solutions for safety-critical applications. PubDate: 2023-04-11
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Detecting and characterising vehicles is one of the purposes of embedded systems used in intelligent environments. An analysis of a vehicle’s characteristics can reveal inappropriate or dangerous behaviour. This detection makes it possible to sanction or notify emergency services to take early and practical actions. Vehicle detection and characterisation systems employ complex sensors such as video cameras, especially in urban environments. These sensors provide high precision and performance, although the price and computational requirements are proportional to their accuracy. These sensors offer high accuracy, but the price and computational requirements are directly proportional to their performance. This article introduces a system based on modular devices that is economical and has a low computational cost. These devices use ultrasonic sensors to detect the speed and length of vehicles. The measurement accuracy is improved through the collaboration of the device modules. The experiments were performed using multiple modules oriented to different angles. This module is coupled with another specifically designed to detect distance using previous modules’ speed and length data. The collaboration between different modules reduces the speed relative error ranges from 1 to 5%, depending on the angle configuration used in the modules. PubDate: 2023-04-07
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Evaluating the effectiveness of system scheduling and energy savings in embedded real-time systems with low-computing resources is the problem addressed in this paper. In such systems, the characteristics of the implemented scheduling policy play a relevant role in both schedulability and energy consumption. Ideally, the scheduling policy should provide higher schedulability bounds and low runtime overheads, allowing for better usage of available slack in the schedule for energy saving purposes. Due its low overhead and simple implementation, the usual scheduling policy employed in real-time embedded systems is based on fixed priority scheduling (FPS). Under this scheme, as the priority of all system tasks are assigned at design time, a simple priority vector suffices to indicate the current ready task to run. System schedulability, however, is usually lower than that provided by dynamic priority scheduling (DPS) according to which task priorities are assigned at runtime. Managing dynamic priority queues incurs higher overheads, though. Deciding whether DPS is a viable choice for such embedded systems requires careful evaluation. We evaluate two implementations of Earliest Deadline First (EDF), a classical DPS policy, implemented in FreeRTOS running on an ARM-M4 architecture. EDF is compared against an optimal FPS, namely Rate-Monotonic (RM). Further, two mechanisms for energy savings are described. They differ by the manner they compute the slack available in an EDF schedule, statically (SS-EDF) or dynamically (DS-EDF). These two approaches are experimentally evaluated. Results indicate that EDF can be effectively used for energy savings. PubDate: 2023-03-15 DOI: 10.1007/s10617-023-09267-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Task mapping and routing are crucial steps in the Networks on Chip (NoC) based Multiprocessor System on Chip (MPSoC) design. While the mapping must ensure an optimized arrangement of the applications’ tasks on the system cores, the routing must ensure the tasks’ communication with the minimum possible delay. We observe that these two problems are highly dependent since finding a routing solution requires first finding a mapping solution. Based on that, this paper analyzes the mapping and routing problems in NoC-based MPSoC and defines a joint version as the Mapping and Routing Problem (MRP). We propose a mathematical model that generates mapping and routing solutions based on a specific bandwidth of NoC links. We also propose three evolutionary metaheuristic algorithms to find optimized solutions to the MRP: Genetic (GA), Memetic (MA), and Transgenetic Algorithms (TA). Experimental results evaluating communication latency demonstrate that the proposed algorithms suit well for the tackled problem, but the TA stands out among all the compared solutions. Overall, TA achieved up to 8% and 19% better performance than the compared algorithms in Global Average Delay and Maximum Delay. Also, it outperformed the other strategies in 55.76% and 51.58% of all the performed simulations in both respective metrics. PubDate: 2023-03-10 DOI: 10.1007/s10617-023-09269-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Real-time resource access protocols are fundamental to bound the maximum delay a task can suffer due to priority inversions. Several real-time protocols have been proposed, for both static and dynamic scheduling approaches in single and multi-core processors. One of the main factors for performance efficiency in such protocols is the way they are implement within a real-time operating system (RTOS). In this paper we present an object-oriented design of real-time access protocols considering single and multi-core systems and also suspension- and spin-based protocols (7 protocols in total). Our design aims at reducing the run-time overhead and increasing code re-usability. By implementing the proposed design in an RTOS and running the protocols in a modern multi-core processor, we provide an analysis regarding the memory footprint, run-time overhead, and the impact of the overhead into the schedulability analysis of synthetically generated task sets. Our results indicate that proper implementation provides low run-time overhead (up to 6.1 \(\upmu \hbox {s}\) ) and impact on the schedulability of real-time tasks. PubDate: 2023-03-01 DOI: 10.1007/s10617-023-09268-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: The present research work proposes an energy efficient diagonal mesh based topology called DiamondMesh. By introducing diagonal links into the baseline mesh topology, the proposed DiamondMesh improves network performance while retaining the regular, simple and scalable properties of the Mesh topology. Topological properties of DiamondMesh have been explored and compared with that of other competitive diagonal mesh topologies. With the help of Booksim2.0 simulator, the proposed topology has been evaluated under a variety of traffic patterns and the results have been compared to those obtained with Mesh and the existing diagonal mesh topologies. The proposed topology and other considered topologies have been synthesised using xilinx vivado design compiler and the results have been analysed. The evaluation results show that there has been a significant reduction of latency compared to Mesh and other diagonal mesh topologies except DMesh and a considerable reduction of area and power compared to the DMesh topology. Thus, DiamondMesh establishes to be a highly efficient diagonal mesh-based topology for a variety of applications. PubDate: 2023-01-02 DOI: 10.1007/s10617-022-09266-0
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Nowadays, embedded systems have multiprocessing capabilities to meet the complexity of modern applications, such as signal processing and multimedia. However, as the embedded system’s functionality expands, complexity increases and numerous constraints become necessary. Constraints, such as high performance, low power consumption, and development time, became critical demands. Therefore, emulation and verification are necessary to assess the correctness and performance of such architectures and accelerate the development phase. We propose a robust, scalable, and flexible hardware-software emulation framework that focuses on design space exploration for MPSoC architectures. Our framework supports 2D and 3D NoC-based architectures built on an open-source RISC-V. According to user configuration, the framework auto-generates the corresponding universal verification methodology environment to explore the design space, evaluate the performance, and compare the results for wide configurations and parameters. Then, it provides the best solution based on provided user criteria. Our framework uses an emulation co-modeling technology to enable the designer to explore and detect architecture failures. We provide numerous experimental results for different 2D and 3D NoC architectures to assess their correctness and performance, including energy and power consumption. Noticeably, results show an acceleration by \(40\times \) in comparison to software simulators. PubDate: 2022-09-14 DOI: 10.1007/s10617-022-09265-1
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Finite State Machines with Input Multiplexing (FSMIMs) were proposed in previous work as a technique for efficient mapping Finite State Machines (FSMs) into ROM memory. In this paper, we present new contributions to the optimization process involved in the implementation of FSMIMs in Field Programmable Gate Array (FPGA) devices. This process consists of two stages: (1) the simplification of the bank of input selectors of the FSMIM, and (2) the reduction of the depth of the ROM. This has a significant impact both on the number of used Look-Up Tables (LUTs) and on the number of the Embedded Memory Blocks (EMBs) required by the ROM. For the first stage, we present two approaches to optimize FSMIM implementations based on the Minimum Maximal k-Partial Matching (MMKPM) problem: one of them applies the greedy algorithm for the MMKPM problem, and the other based on a new multiobjetive variant of the MMKPM and its corresponding Integer Linear Programing formulation. We also propose a modification of the second stage, in which the characteristics of EMBs are taken into account to improve implementation results. The new optimization process significantly reduces the number of used FPGA resources with respect to the previous one. In addition, the proposed approaches achieve an adequate trade-off between the usage of EMBs and LUTs with respect to conventional FSM implementations based on ROM and to those based on LUT. PubDate: 2022-06-02 DOI: 10.1007/s10617-022-09259-z
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Embedded system applications of present-day scenario consume profound energy in execution and its significant fraction is due to an intensive register-file access in the processor architecture. This paper presents a novel architecture incorporating a multi-banked register file organization and a selective replacement technique referred as selective register file cache to capture actively reused and short-lived register operands. This alleviates the load on register file while performing read and write operations. Thus, the proposed architecture achieved maximum energy saving of 68% while accessing a register file over a conventional embedded processor architecture. Subsequently, it consumes an average energy of 8.48 \(\upmu \) J which is 51% lesser than the energy consumption of reduced-instruction set-computer (RISC-V) baseline processor-architecture. PubDate: 2022-05-29 DOI: 10.1007/s10617-022-09264-2
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Pattern matching using Aho-Corasick (AC) algorithm is the most time-consuming task in an Intrusion Detection System, and therefore, the Field Programmable Gate Array (FPGA) based solutions are frequently employed. In this context, the two possibilities are memory based solutions and hardwired solution. The limitation of memory based solutions is the inefficient utilization of slices while the hardwired solutions require a tremendous amount of effort and time as writing Hardware Description Language (HDL) code for thousands of rules is prone to human errors. Consequently, the contributions of this article are twofold. The first contribution is to develop a tool for the automatic generation of Verilog-HDL code from the rule set. The second contribution is to propose an efficient parallel hardware implementation scheme and compare it with a serial hardware implementation scheme in terms of various design parameters such as resource utilization, operational frequency and throughput. The proposed parallel scheme advocates the division of entire rule set into smaller sub-sets for parallel execution. Experimental results reveal that the proposed tool can generate the target code for 10,000 rules in less than a minute without any error. The automatic generation of target code has allowed to perform a comprehensive design space exploration for the parallel implementation of AC algorithm in quick time. Finally, our Xilinx ZC702 evaluation FPGA board based prototype for 10,000 rules can efficiently examine the packet stream coming at a bit rate of 1.56 Gbps at an operational frequency of 195 MHz. PubDate: 2022-01-23 DOI: 10.1007/s10617-021-09257-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Embedded devices are omnipresent in our daily routine, from smartphones to home appliances, that run data and control-oriented applications. To maximize the energy-performance tradeoff, data and instruction-level parallelism are exploited by using superscalar and specific accelerators. However, as such devices have severe time-to-market, binary compatibility should be maintained to avoid recurrent engineering, which is not considered in current embedded processors. This work visited a set of embedded applications showing the need for concurrent ILP and DLP exploitation. For that, we propose a Hybrid Multi-Target Binary Translator (HMTBT) to transparently exploit ILP and DLP by using a CGRA and ARM NEON engine as targeted accelerators. Results show that HMTBT transparently achieves 24% performance improvements and 54% energy savings over an OoO superscalar processor coupled to an ARM NEON engine. The proposed approach improves performance and energy in 10%, 24% over decoupled binary translators using the same accelerator with the same ILP and DLP capabilities. PubDate: 2022-01-14 DOI: 10.1007/s10617-021-09258-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: With the rapid development of Artificial Intelligence, Internet of Things, 5G, and other technologies, a number of emerging intelligent applications represented by image recognition, voice recognition, autonomous driving, and intelligent manufacturing have appeared. These applications require efficient and intelligent processing systems for massive data calculations, so it is urgent to apply better DNN in a faster way. Although, compared with GPU, FPGA has a higher energy efficiency ratio, and shorter development cycle and better flexibility than ASIC. However, FPGA is not a perfect hardware platform either for computational intelligence. This paper provides a survey of the latest acceleration work related to the familiar DNNs and proposes three new directions to break the bottleneck of the DNN implementation. So as to improve calculating speed and energy efficiency of edge devices, intelligent embedded approaches including model compression and optimized data movement of the entire system are most commonly used. With the gradual slowdown of Moore’s Law, the traditional Von Neumann Architecture generates a “Memory Wall” problem, resulting in more power-consuming. In-memory computation will be the right medicine in the post-Moore law era. More complete software/hardware co-design environment will direct researchers’ attention to explore deep learning algorithms and run the algorithm on the hardware level in a faster way. These new directions start a relatively new paradigm in computational intelligence, which have attracted substantial attention from the research community and demonstrated greater potential over traditional techniques. PubDate: 2022-01-12 DOI: 10.1007/s10617-021-09256-8
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: In highly-integrated electronic circuits designs, power reduction must be properly addressed. The standardized ways of power-intent specification are unbearable in modern complex designs, since they extensively prolong the time-to-market of products. In this article, we propose a simplified method of designing energy-efficient systems at the register-transfer level, which is fully compatible with the existing design flow and industrial automation tools. The power-intent specification is abstract enough to be easily integrated into the HDL model, which also simplifies its maintainability. A connection to later design-flow stages (i.e. lower abstraction levels) is achieved by automated synthesis, which translates the simplified specification into the standard means, supported by the existing professional design-automation tools. The benefit of the proposed design method is speed-up of the development process, reduced number of possible power-intent errors, and easier energy-efficient systems design. Such design can be utilized by all designers, even those, which were unable to utilize the standard means due to high complexity. The experiments using 10 000 power-intent specification samples have shown that the proposed specification method is approximately 23-times less complex (in terms of lines of code) than the standard method. Moreover, it is able to achieve the same power-consumption reduction, while requiring much less designer effort. PubDate: 2021-08-15 DOI: 10.1007/s10617-021-09254-w