![]() |
International Journal of Parallel Programming
Journal Prestige (SJR): 0.244 ![]() Citation Impact (citeScore): 1 Number of Followers: 6 ![]() ISSN (Print) 1573-7640 - ISSN (Online) 0885-7458 Published by Springer-Verlag ![]() |
- RMOWOA: A Revamped Multi-Objective Whale Optimization Algorithm for
Maximizing the Lifetime of a Network in Wireless Sensor Networks-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Wireless sensor networks (WSNs) consist of sensor nodes that detect, process, and transmit various types of information to a base station unit. The development of energy-efficient routing protocols is a crucial challenge in WSNs. This study proposes a novel algorithm called RMOWOA, i.e., Revamped Multi-Objective Whale Optimization Algorithm, which utilizes concentric circles with different radii to partition the network. The circles are divided into eight equal sectors, and sections are formed at the intersections of sectors and layers. Each section contains a small number of nodes, and an agent is selected based on specific criteria. The nodes within each section transmit their detected information to the corresponding agent or cluster head. This process is repeated until the base station receives the data. The selection of agents is based on a WOA-based approach, known for enhancing the network's lifetime. The selected agent aggregates the data, performs redundant residue number-based error detection and rectification, and forwards the information to the lower segment's agent within that sector. The proposed RMOWOA algorithm is evaluated through simulation analysis and compared with established benchmark cluster head selection schemes such as SFA- Cluster Head Selection, FCGWO-Cluster Head Selection, and ABC-Cluster Head Selection. The experimental results of the RMOWOA algorithm demonstrate reduced energy consumption and extended network lifespan by effectively balancing the ratio of alive and dead nodes in WSNs.
PubDate: 2024-08-06
-
- Optimizing Three-Dimensional Stencil-Operations on Heterogeneous Computing
Environments-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Complex algorithms and enormous data sets require parallel execution of programs to attain results in a reasonable amount of time. Both aspects are combined in the domain of three-dimensional stencil operations, for example, computational fluid dynamics. This work contributes to the research on high-level parallel programming by discussing the generalizable implementation of a three-dimensional stencil skeleton that works in heterogeneous computing environments. Two exemplary programs, a gas simulation with the Lattice Boltzmann method, and a mean blur, are executed in a multi-node multi-graphics processing units environment, proving the runtime improvements in heterogeneous computing environments compared to a sequential program.
PubDate: 2024-06-21
-
- Orchestration Extensions for Interference- and Heterogeneity-Aware
Placement for Data-Analytics-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Today, there is an ever-increasing number of workloads pushed and executed on the Cloud. Data center operators and Cloud providers have embraced application co-location and multi-tenancy as first-class system design concerns to effectively serve and manage these huge computational demands. In addition, the continuous advancements in the computers’ hardware technology have made it possible to seamlessly leverage heterogeneous pools of physical machines in data center environments. Even though current modern Cloud schedulers and orchestrators adopt application-aware policies to achieve automation of time-consuming management tasks at scale, e.g., resource provisioning, they still rely on coarse-grained system metrics, such as CPU and/or memory utilization to place incoming applications, thus, not considering (1) interference effects that are provoked by co-located tasks, and (2) the impact on performance caused by the diversity of heterogeneous systems’ characteristics. The lack of such knowledge in existing state-of-the-art orchestration solutions results in their inability to perform efficient allocations, which negatively impacts the overall latency distribution delivered by the infrastructure. In this paper, to alleviate this inefficiency, we present a machine learning (ML) based Cloud orchestration extension that takes into account both resource interference and heterogeneity. The framework adequately schedules data-analytics applications on a pool of heterogeneous resources. We evaluate our proposed solution on different application mixes and co-location scenarios. We show that the proposed framework improves the tail latency of the distribution of the deployed applications by up to 3.6x compared to the state-of-the-art Kubernetes scheduler.
PubDate: 2024-05-28
DOI: 10.1007/s10766-024-00771-2
-
- High-Level Programming of FPGA-Accelerated Systems with Parallel Patterns
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract As a result of frequency and power limitations, multi-core processors and accelerators are becoming more and more prevalent in today’s systems. To fully utilize such systems, heterogeneous parallel programming is needed, but this introduces new complexities to the development. High-level frameworks such as SkePU have been introduced to help alleviate these complexities. SkePU is a skeleton programming framework based on a set of programming constructs implementing computational parallel patterns, while presenting a sequential interface to the programmer. Using the various skeleton backends, SkePU programs can execute, without source code modification, on multiple types of hardware such as CPUs, GPUs, and clusters. This paper presents the design and implementation of a new backend for SkePU, adding support for FPGAs. We also evaluate the effect of FPGA-specific optimizations in the new backend and compare it with the existing GPU backend, where the actual devices used are of similar vintage and price point. For simple examples, we find that the FPGA-backend’s performance is similar to that of the existing backend for GPUs, while it falls behind in more complex tasks. Finally, some shortcomings in the backend are highlighted and discussed, along with potential solutions.
PubDate: 2024-05-27
DOI: 10.1007/s10766-024-00770-3
-
- Erasure-Coded Hybrid Writes Based on Data Delta
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Erasure coding is extensively deployed in today’s data centers to tackle prevalent failures, because it can offer higher reliability at lower storage overhead than data replication. However, for each small write, erasure-coded storage systems have to perform a partial write to an entire erasure coding group, resulting in a time-consuming write-after-read. This paper presents DABRI, an erasure-coded hybrid write approach based on data delta for fast partial writes. DABRI uses data deltas that are the differences between latest data values and original data values, instead of parity deltas to recover the failed data. The data node sends the latest data instead of the parity delta to parity nodes for each partial write. The original data stored on the data node is read and sent to the parity nodes, only when the data stored on the parity nodes is insufficient to maintain the data reliability. This can bypass the computation of parity deltas and reduce the number of data reads. For a series of n partial writes to the same data, DABRI performs log-based updates for data and parity in the first write, performs in-place data updates and log-based parity updates for the last n-1 writes. In addition, the I/O between data nodes and parity nodes is scheduled for parallel I/O in each partial write. We implement an erasure-coded prototype storage system based on DABRI to perform performance evaluation. Experimental results running the real-world traces show that DABRI can significantly improve the I/O throughput, compared with the state-of-the-arts.
PubDate: 2024-05-24
DOI: 10.1007/s10766-024-00773-0
-
- LSH SimilarityJoin Pattern in FastFlow
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Similarity joins are recognized to be among the most used data processing and analysis operations. We introduce a C++-based high-level parallel pattern implemented on top of FastFlow Building Blocks to provide the programmer with ready-to-use similarity join computations. The SimilarityJoin pattern is implemented according to the MapReduce paradigm enriched with locality sensitive hashing (LSH) to optimize the whole computation. The new parallel pattern can be used with any C++ serializable data structure and executed on shared- and distributed-memory machines. We present experimental validations of the proposed solution considering two different clusters and small and large input datasets to evaluate in-core and out-of-core executions. The performance assessment of the SimilarityJoin pattern has been conducted by comparing the execution time against the one obtained from the original hand-tuned Hadoop-based implementation of the LSH-based similarity join algorithms as well as a Spark-based version. The experiments show that the SimilarityJoin pattern: (1) offers a significant performance improvement for small and medium datasets; (2) is competitive also for computations using large input datasets producing out-of-core executions.
PubDate: 2024-05-23
DOI: 10.1007/s10766-024-00772-1
-
- GraphTango: A Hybrid Representation Format for Efficient Streaming Graph
Updates and Analysis-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Streaming graph processing performs batched updates and analytics on a time-evolving graph. The underlying representation format of the graph largely determines the throughputs of these updates and analytics phases. Existing representation formats usually employ variations of hash tables or adjacency lists. However, a recent study showed that the adjacency-list-based approaches perform poorly on heavy-tailed graphs, and the hash table-based approaches suffer on short-tailed graphs. We propose GraphTango, a hybrid representation format that provides excellent update and analytics throughput regardless of the graph’s degree distribution. GraphTango dynamically switches among three different formats based on a vertex’s degree: (i) Low-degree vertices store the edges directly with the neighborhood metadata, confining accesses to a single cache line, (2) Medium-degree vertices use adjacency lists, and (3) High-degree vertices use hash tables as well as adjacency lists. In this case, the adjacency list provides fast traversal during the analytics phase, while the hash table provides constant-time lookups during the update phase. We further optimized the performance by designing an open-addressing-based hash table that fully utilizes every fetched cache line. In addition, we developed a thread-local lock-free memory pool that allows fast growing/shrinking of the adjacency lists and hash tables in a multi-threaded environment. We evaluated GraphTango with the help of the SAGA-Bench framework and compared it with four other representation formats: Stinger, Degree-aware Robin Hood Hashing, and two adjacency list-based formats with different workload balancing scheme. On average, GraphTango provides 4.5x higher insertion throughput, 3.2x higher deletion throughput, and 1.1x higher analytics throughput over the next best format. Furthermore, we integrated GraphTango with the state-of-the-art graph processing frameworks DZiG and RisGraph. Compared to the vanilla DZiG and vanilla RisGraph, [GraphTango + DZiG] and [GraphTango + RisGraph] reduces the average batch processing time by 2.3x and 1.5x, respectively.
PubDate: 2024-05-18
DOI: 10.1007/s10766-024-00768-x
-
- Yet Another Lock-Free Atom Table Design for Scalable Symbol Management in
Prolog-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Prolog systems rely on an atom table for symbol management, which is usually implemented as a dynamically resizeable hash table. This is ideal for single threaded execution, but can become a bottleneck in a multi-threaded scenario. In this work, we replace the original atom table implementation in the YAP Prolog system with a lock-free hash-based data structure, named Lock-free Hash Tries (LFHT), in order to provide efficient and scalable symbol management. Being lock-free, the new implementation also provides better guarantees, namely, immunity to priority inversion, to deadlocks and to livelocks. Performance results show that the new lock-free LFHT implementation has better results in single threaded execution and much better scalability than the original lock based dynamically resizing hash table.
PubDate: 2024-03-23
DOI: 10.1007/s10766-024-00766-z
-
- Automatic Discovery of Collective Communication Patterns in Parallelized
Task Graphs-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Collective communication APIs equip MPI vendors with the necessary context to optimize cluster-wide operations on the basis of theoretical complexity models and characteristics of the involved interconnects. Modern HPC runtime systems with a programmability focus can perform dependency analysis to eliminate the need for manual communication entirely. Profiting from optimized collective routines in this context often requires global analysis of the implicit point-to-point communication pattern or tight constrains on the data access patterns allowed inside kernels. The Celerity API provides a high degree of freedom for both runtime implementors and application developers by tieing transparent work assignment to data access patterns through user-defined range-mapper functions. Canonically, data dependencies are resolved through an intra-node coherence model and inter-node point-to-point communication. This paper presents Collective Pattern Discovery (CPD), a fully distributed, coordination-free method for detecting collective communication patterns on parallelized task graphs. Through extensive scheduling and communication microbenchmarks as well as a strong scaling experiment on a compute-intensive application, we demonstrate that CPD can achieve substantial performance gains in the Celerity model.
PubDate: 2024-03-22
DOI: 10.1007/s10766-024-00767-y
-
- Special Issue on SAMOS 2022
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2024-03-18
DOI: 10.1007/s10766-024-00765-0
-
- ControlPULP: A RISC-V On-Chip Parallel Power Controller for Many-Core HPC
Processors with FPGA-Based Hardware-In-The-Loop Power and Thermal
Emulation-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract High-performance computing (HPC) processors are nowadays integrated cyber-physical systems demanding complex and high-bandwidth closed-loop power and thermal control strategies. To efficiently satisfy real-time multi-input multi-output (MIMO) optimal power requirements, high-end processors integrate an on-die power controller system (PCS). While traditional PCSs are based on a simple microcontroller (MCU)-class core, more scalable and flexible PCS architectures are required to support advanced MIMO control algorithms for managing the ever-increasing number of cores, power states, and process, voltage, and temperature variability. This paper presents ControlPULP, an open-source, HW/SW RISC-V parallel PCS platform consisting of a single-core MCU with fast interrupt handling coupled with a scalable multi-core programmable cluster accelerator and a specialized DMA engine for the parallel acceleration of real-time power management policies. ControlPULP relies on FreeRTOS to schedule a reactive power control firmware (PCF) application layer. We demonstrate ControlPULP in a power management use-case targeting a next-generation 72-core HPC processor. We first show that the multi-core cluster accelerates the PCF, achieving 4.9x speedup compared to single-core execution, enabling more advanced power management algorithms within the control hyper-period at a shallow area overhead, about 0.1% the area of a modern HPC CPU die. We then assess the PCS and PCF by designing an FPGA-based, closed-loop emulation framework that leverages the heterogeneous SoCs paradigm, achieving DVFS tracking with a mean deviation within 3% the plant’s thermal design power (TDP) against a software-equivalent model-in-the-loop approach. Finally, we show that the proposed PCF compares favorably with an industry-grade control algorithm under computational-intensive workloads.
PubDate: 2024-02-26
DOI: 10.1007/s10766-024-00761-4
-
- Investigating Methods for ASPmT-Based Design Space Exploration in
Evolutionary Product Design-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Nowadays, product development is challenged by increasing system complexity and stringent time-to-market. To handle the demanding market requirements, knowledge from prior product generations is used to derive new, but partially similar product versions. The concept of product generation engineering, hence, allows manufacturers to release high-quality products within short development times. Therefore, in this paper, we propose a novel approach to evaluate the similarity of two product implementations based on the concept of the Hamming distance. This allows the usage of similarity information in various heuristics as well as in strategies and thus, to improve the product design process. In a wide set of cases, we investigate the quality and similarity of design points. In the experiments, the use of strategies leads to significantly short searching times, but also tends to be too restrictive in certain cases. Simultaneously, the quality of the solutions found in the heuristic design space exploration has been shown to be as good or better than for the search from scratch and considerably closer solutions as part of the non-dominated solution front have been found.
PubDate: 2024-02-24
DOI: 10.1007/s10766-024-00763-2
-
- Hardware-Aware Evolutionary Explainable Filter Pruning for Convolutional
Neural Networks-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Filter pruning of convolutional neural networks (CNNs) is a common technique to effectively reduce the memory footprint, the number of arithmetic operations, and, consequently, inference time. Recent pruning approaches also consider the targeted device (i.e., graphics processing units) for CNN deployment to reduce the actual inference time. However, simple metrics, such as the \(\ell ^1\) -norm, are used for deciding which filters to prune. In this work, we propose a hardware-aware technique to explore the vast multi-objective design space of possible filter pruning configurations. Our approach incorporates not only the targeted device but also techniques from explainable artificial intelligence for ranking and deciding which filters to prune. For each layer, the number of filters to be pruned is optimized with the objective of minimizing the inference time and the error rate of the CNN. Experimental results show that our approach can speed up inference time by 1.40× and 1.30× for VGG-16 on the CIFAR-10 dataset and ResNet-18 on the ILSVRC-2012 dataset, respectively, compared to the state-of-the-art ABCPruner.
PubDate: 2024-02-22
DOI: 10.1007/s10766-024-00760-5
-
- A Practical Approach for Employing Tensor Train Decomposition in Edge
Devices-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Deep Neural Networks (DNN) have made significant advances in various fields including speech recognition and image processing. Typically, modern DNNs are both compute and memory intensive, therefore their deployment in low-end devices is a challenging task. A well-known technique to address this problem is Low-Rank Factorization (LRF), where a weight tensor is approximated by one or more lower-rank tensors, reducing both the memory size and the number of executed tensor operations. However, the employment of LRF is a multi-parametric optimization process involving a huge design space where different design points represent different solutions trading-off the number of FLOPs, the memory size, and the prediction accuracy of the DNN models. As a result, extracting an efficient solution is a complex and time-consuming process. In this work, a new methodology is presented that formulates the LRF problem as a (FLOPs vs. memory vs. prediction accuracy) Design Space Exploration (DSE) problem. Then, the DSE space is drastically pruned by removing inefficient solutions. Our experimental results prove that the design space can be efficiently pruned, therefore extract only a limited set of solutions with improved accuracy, memory, and FLOPs compared to the original (non-factorized) model. Our methodology has been developed as a stand-alone, parameterized module integrated into T3F library of TensorFlow 2.X.
PubDate: 2024-02-16
DOI: 10.1007/s10766-024-00762-3
-
- Access Interval Prediction by Partial Matching for Tightly Coupled Memory
Systems-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In embedded systems, tightly coupled memories (TCMs) are usually shared between multiple masters for the purpose of hardware efficiency and software flexibility. On the one hand, memory sharing improves area utilization, but on the other hand, this can lead to a performance degradation due to an increase in access conflicts. To mitigate the associated performance penalty, access interval prediction (AIP) has been proposed. In a similar fashion to branch prediction, AIP exploits program flow regularity to predict the cycle of the next memory access. We show that this structural similarity allows for adaption of state-of-the-art branch predictors, such as Prediction by Partial Matching (PPM) and the TAgged GEometric history length (TAGE) branch predictor. Our analysis on memory access traces reveals that PPM predicts 99 percent of memory accesses. As PPM does not lend itself to hardware implementation, we also present the PPM-based TAGE access interval predictor which attains an accuracy of over 97 percent outperforming all previously presented implementable AIP schemes.
PubDate: 2024-02-13
DOI: 10.1007/s10766-024-00764-1
-
- Accelerating Massively Distributed Deep Learning Through Efficient
Pseudo-Synchronous Update Method-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.
PubDate: 2023-11-13
DOI: 10.1007/s10766-023-00759-4
-
- A Hybrid Machine Learning Model for Code Optimization
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The complexity of programming modern heterogeneous systems raises huge challenges. Over the past two decades, researchers have aimed to alleviate these difficulties by employing classical Machine Learning and Deep Learning techniques within compilers to optimize code automatically. This work presents a novel approach to optimize code using at the same time Classical Machine Learning and Deep Learning techniques by maximizing their benefits while mitigating their drawbacks. Our proposed model extracts features from the code using Deep Learning and then applies Classical Machine Learning to map these features to specific outputs for various tasks. The effectiveness of our model is evaluated on three downstream tasks: device mapping, optimal thread coarsening, and algorithm classification. Our experimental results demonstrate that our model outperforms previous models in device mapping with an average accuracy of 91.60% on two datasets and in optimal thread coarsening task where we are the first to achieve a positive speedup on all four platforms while achieving a comparable result of 91.48% in the algorithm classification task. Notably, our approach yields better results even with a small dataset without requiring a pre-training phase or a complex code representation, offering the advantage of reducing training time and data volume requirements.
PubDate: 2023-09-22
DOI: 10.1007/s10766-023-00758-5
-
- GPU-Based Algorithms for Processing the k Nearest-Neighbor Query on
Spatial Data Using Partitioning and Concurrent Kernel Execution-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Algorithms for answering the k nearest-neighbor (k-NN) query are widely used for queries in spatial databases and for distance classification of a group of query points against a reference dataset to derive the dominating feature class. GPU devices have significantly more processing cores than CPUs and faster device memory than the main memory accessed by CPUs, thus, providing higher computing power for processing demanding queries like the k-NN. However, since device and/or main memory may not be able to host an entire, rather big, reference and query datasets, storing these datasets in a fast secondary device, like a solid state disk (SSD), and partially retrieve the required, at each stage, partitions is, in many practical cases, a feasible solution. We propose and implement the first GPU-based algorithms for processing the k-NN query for big reference and query spatial data stored on SSDs. Based on 3d synthetic and real big spatial data, we experimentally compare these algorithms and highlight the most efficient algorithmic variation. This variation utilizes a CUDA feature known as Concurrent Kernel Execution, to further improve its performance.
PubDate: 2023-07-21
DOI: 10.1007/s10766-023-00755-8
-
- Calculation of Distributed-Order Fractional Derivative on Tensor
Cores-Enabled GPU-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Due to an increased computational complexity of calculating the values of the distributed-order Caputo fractional derivative compared to the classical Caputo derivative there is a need to develop new techniques that accelerate it. In this paper for this purpose we propose to use a fast matrix "multiply and accumulate" operation available in GPU’s that contain the so-called tensor cores. We present and experimentally analyze the properties of GPU-algorithms that are based on the L1 finite-difference approximation of the derivative and incorporate them into the Crank-Nicholson scheme for the distributed-order time-fractional diffusion equation. The computation of derivative’s values on GPU was faster than the multi-threaded implementation on CPU only for a large number of time steps with growing performance gain when number of time steps increase. The usage of the single-precision data type increased the error up to \(2.7\%\) comparing with the usage of the double-precision data type. Half-precision computations in tensor cores increased the error up to \(29.5\%\) . While solving a time-fractional diffusion equation, algorithms implemented for GPU with the usage of the single-precision data type were at least three times faster than the CPU-implementation for the number of time steps more than 1280. Data type precision had only slight influence on the solution error with significantly increased execution time when the double-precision data type was used for data storage and processing.
PubDate: 2023-07-10
DOI: 10.1007/s10766-023-00754-9
-
- Guest Editor’s Note: High-Level Parallel Programming 2021
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2023-02-14
DOI: 10.1007/s10766-023-00752-x
-