![]() |
International Journal of Parallel Programming
Journal Prestige (SJR): 0.244 ![]() Citation Impact (citeScore): 1 Number of Followers: 6 ![]() ISSN (Print) 1573-7640 - ISSN (Online) 0885-7458 Published by Springer-Verlag ![]() |
- Accelerating Massively Distributed Deep Learning Through Efficient
Pseudo-Synchronous Update Method-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.
PubDate: 2023-11-13
-
- A Hybrid Machine Learning Model for Code Optimization
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The complexity of programming modern heterogeneous systems raises huge challenges. Over the past two decades, researchers have aimed to alleviate these difficulties by employing classical Machine Learning and Deep Learning techniques within compilers to optimize code automatically. This work presents a novel approach to optimize code using at the same time Classical Machine Learning and Deep Learning techniques by maximizing their benefits while mitigating their drawbacks. Our proposed model extracts features from the code using Deep Learning and then applies Classical Machine Learning to map these features to specific outputs for various tasks. The effectiveness of our model is evaluated on three downstream tasks: device mapping, optimal thread coarsening, and algorithm classification. Our experimental results demonstrate that our model outperforms previous models in device mapping with an average accuracy of 91.60% on two datasets and in optimal thread coarsening task where we are the first to achieve a positive speedup on all four platforms while achieving a comparable result of 91.48% in the algorithm classification task. Notably, our approach yields better results even with a small dataset without requiring a pre-training phase or a complex code representation, offering the advantage of reducing training time and data volume requirements.
PubDate: 2023-09-22
-
- GPU-Based Algorithms for Processing the k Nearest-Neighbor Query on
Spatial Data Using Partitioning and Concurrent Kernel Execution-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Algorithms for answering the k nearest-neighbor (k-NN) query are widely used for queries in spatial databases and for distance classification of a group of query points against a reference dataset to derive the dominating feature class. GPU devices have significantly more processing cores than CPUs and faster device memory than the main memory accessed by CPUs, thus, providing higher computing power for processing demanding queries like the k-NN. However, since device and/or main memory may not be able to host an entire, rather big, reference and query datasets, storing these datasets in a fast secondary device, like a solid state disk (SSD), and partially retrieve the required, at each stage, partitions is, in many practical cases, a feasible solution. We propose and implement the first GPU-based algorithms for processing the k-NN query for big reference and query spatial data stored on SSDs. Based on 3d synthetic and real big spatial data, we experimentally compare these algorithms and highlight the most efficient algorithmic variation. This variation utilizes a CUDA feature known as Concurrent Kernel Execution, to further improve its performance.
PubDate: 2023-07-21
DOI: 10.1007/s10766-023-00755-8
-
- Calculation of Distributed-Order Fractional Derivative on Tensor
Cores-Enabled GPU-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Due to an increased computational complexity of calculating the values of the distributed-order Caputo fractional derivative compared to the classical Caputo derivative there is a need to develop new techniques that accelerate it. In this paper for this purpose we propose to use a fast matrix "multiply and accumulate" operation available in GPU’s that contain the so-called tensor cores. We present and experimentally analyze the properties of GPU-algorithms that are based on the L1 finite-difference approximation of the derivative and incorporate them into the Crank-Nicholson scheme for the distributed-order time-fractional diffusion equation. The computation of derivative’s values on GPU was faster than the multi-threaded implementation on CPU only for a large number of time steps with growing performance gain when number of time steps increase. The usage of the single-precision data type increased the error up to \(2.7\%\) comparing with the usage of the double-precision data type. Half-precision computations in tensor cores increased the error up to \(29.5\%\) . While solving a time-fractional diffusion equation, algorithms implemented for GPU with the usage of the single-precision data type were at least three times faster than the CPU-implementation for the number of time steps more than 1280. Data type precision had only slight influence on the solution error with significantly increased execution time when the double-precision data type was used for data storage and processing.
PubDate: 2023-07-10
DOI: 10.1007/s10766-023-00754-9
-
- Partitioning-Aware Performance Modeling of Distributed Graph Processing
Tasks-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Much of the data being produced in large scale by modern applications represents connected entities and their relationships, that can be modeled as large graphs. In order to extract valuable information from these large datasets, several parallel and distributed graph processing engines have been proposed. These systems are designed to run in large clusters, where resources must by allocated efficiently. Aiming to handle this problem, this paper presents a performance prediction model for GPS, a popular Pregel-based graph processing framework. By leveraging a micro-partitioning technique, our system can use various partitioning algorithms that greatly reduce the execution time, comparing with the simple hash partitioning that is commonly used in graph processing systems. Experimental results show that the prediction model has accuracy close to 90%, allowing it to be used in schedulers or to estimate the cost of running graph processing tasks.
PubDate: 2023-05-05
DOI: 10.1007/s10766-023-00753-w
-
- Guest Editor’s Note: High-Level Parallel Programming 2021
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2023-02-14
DOI: 10.1007/s10766-023-00752-x
-
- Accelerating OCaml Programs on FPGA
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper aims to exploit the massive parallelism of Field-Programmable Gate Arrays (FPGAs) by programming them in OCaml, a multiparadigm and statically typed language. It first presents O2B, an implementation of the OCaml virtual machine using a softcore processor to run the entire OCaml language on an FPGA. It then introduces Macle, a language to express, in ML-style, hardware-accelerated user-defined functions, implemented as gates and registers on the same FPGA. Macle allows to implement pure computations and compose them in parallel. It also supports processing of dynamic data structures such as arrays, matrices and trees allocated by the OCaml runtime in the memory of the softcore processor. Macle functions can then be called, as hardware accelerators, by OCaml programs executed by O2B. This combination of Macle and OCaml codes in a single source program enables to easily prototype FPGA applications mixing numeric and symbolic computations.
PubDate: 2023-01-24
DOI: 10.1007/s10766-022-00748-z
-
- Distributed Calculations with Algorithmic Skeletons for Heterogeneous
Computing Environments-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Contemporary HPC hardware typically provides several levels of parallelism, e.g. multiple nodes, each having multiple cores (possibly with vectorization) and accelerators. Efficiently programming such systems usually requires skills in combining several low-level frameworks such as MPI, OpenMP, and CUDA. This overburdens programmers without substantial parallel programming skills. One way to overcome this problem and to abstract from details of parallel programming is to use algorithmic skeletons. In the present paper, we evaluate the multi-node, multi-CPU and multi-GPU implementation of the most essential skeletons Map, Reduce, and Zip. Our main contribution is a discussion of the efficiency of using multiple parallelization levels and the consideration of which fine-tune settings should be offered to the user.
PubDate: 2023-01-07
DOI: 10.1007/s10766-022-00742-5
-
- Retraction Note: Designing a Framework for Communal Software: Based on the
Assessment Using Relation Modelling-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2023-01-02
DOI: 10.1007/s10766-022-00751-4
-
- Declarative Data Flow in a Graph-Based Distributed Memory Runtime System
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Runtime systems can significantly reduce the cognitive complexity of scientific applications, narrowing the gap between systems engineering and domain science in HPC. One of the most important angles in this is automating data migration in a cluster. Traditional approaches require the application developer to model communication explicitly, for example through MPI primitives. Celerity, a runtime system for accelerator clusters heavily inspired by the SYCL programming model, instead provides a purely declarative approach focused around access patterns. In addition to eliminating the need for explicit data transfer operations, it provides a basis for efficient and dynamic scheduling at runtime. However, it is currently only suitable for accessing array-like data from runtime-controlled tasks, while real programs often need to interact with opaque data local to each host, such as handles or database connections, and also need a defined way of transporting data into and out of the virtualised buffers of the runtime. In this paper, we introduce a graph-based approach and declarative API for expressing side-effect dependencies between tasks and moving data from the runtime context to the application space.
PubDate: 2022-12-26
DOI: 10.1007/s10766-022-00743-4
-
- SMSG: Profiling-Free Parallelism Modeling for Distributed Training of DNN
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The increasing size of deep neural networks (DNNs) raises a high demand for distributed training. An expert could find good hybrid parallelism strategies, but designing suitable strategies is time and labor-consuming. Therefore, automating parallelism strategy generation is crucial and desirable for DNN designers. Some automatic searching approaches have recently been studied to free the experts from the heavy parallel strategy conception. However, these approaches all rely on a numerical cost model, which requires heavy profiling results that lack portability. These profiling-based approaches cannot lighten the strategy generation work due to the non-reusable profiling value. Our intuition is that there is no need to estimate the actual execution time of the distributed training but to compare the relative cost of different strategies. We propose SMSG (Symbolic Modeling for Strategy Generation), which analyses the cost based on the communication and computation semantics. With SMSG, the parallel cost analyses are decoupled from hardware characteristics. SMSG defines cost functions for each kind of operator to quantitatively evaluate the amount of data for computation and communication, which eliminates the heavy profiling tasks. Besides, SMSG introduces how to apply functional transformation by using the Third Homomorphism theorem to control the high searching complexity. Our experiments show that SMSG can find good hybrid parallelism strategies to generate an efficient training performance similar to the state of the art. Moreover, SMSG covers a wide variety of DNN models with good scalability. SMSG provides good portability when changing training configurations that a profiling-based approach cannot.
PubDate: 2022-12-12
DOI: 10.1007/s10766-022-00741-6
-
- A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and
HPC-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.
PubDate: 2022-12-12
DOI: 10.1007/s10766-022-00749-y
-
- Portable C++ Code that can Look and Feel Like Fortran Code with Yet
Another Kernel Launcher (YAKL)-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper introduces the Yet Another Kernel Launcher (YAKL) C++ portability library, which strives to enable user-level code with the look and feel of Fortran code. The intended audience includes both C++ developers and Fortran developers unfamiliar with C++. The C++ portability approach is briefly explained, YAKL’s main features are described, and code examples are given that demonstrate YAKL’s usage. YAKL fills a niche capability important particularly to scientific applications seeking to port Fortran code quickly to a portable C++ library. YAKL places heavy emphasis on simplicity, readability, and productivity with performance mainly emphasizing Graphics Processing Units (GPUs). Central to YAKL’s ability to allow Fortran-like user-level code are three features: (1) a multi-dimensional Array class that allows Fortran behavior; (2) a limited library of Fortran intrinsic functions; and (3) an efficient pool allocator that transparently enables cheap frequent allocations and deallocations of YAKL Arrays. While YAKL allows Fortran-style code, it also allows Arrays that exhibit C-like behavior as well, including row-major index ordering and lower bounds of “0”. YAKL currently supports CPUs, CPU threading, and Nvidia, AMD, and Intel GPUs.
PubDate: 2022-12-08
DOI: 10.1007/s10766-022-00739-0
-
- Generic Exact Combinatorial Search at HPC Scale
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Exact combinatorial search is essential to a wide range of important applications, and there are many large problems that need to be solved quickly. Searches are extremely challenging to parallelise due to a combination of factors, e.g. searches are non-deterministic, dynamic pruning changes the workload, and search tasks have very different runtimes. YewPar is a C++/HPX framework that generalises parallel search by providing a range of sophisticated search skeletons.This paper demonstrates generic high performance combinatorial search, i.e. that a variety of exact combinatorial searches can be easily parallelised for HPC using YewPar. We present a new mechanism for profiling key aspects of YewPar parallel combinatorial search, and demonstrate its value. We exhibit, for the first time, generic exact combinatorial searches at HPC scale. We baseline YewPar against state-of-the-art sequential C++ and C++/OpenMP implementations. We demonstrate that deploying YewPar on an HPC system can dramatically reduce the runtime of large problems, e.g. from days to just 100s. The maximum relative speedups we achieve for an enumeration search are near-linear up to 195(6825) compute-nodes(workers), super-linear for an optimisation search on up to 128(4480) (pruning reduces the workload), and sub-linear for decision searches on up to 64(2240) compute-nodes(workers).
PubDate: 2022-12-07
DOI: 10.1007/s10766-022-00744-3
-
- Assessing Application Efficiency and Performance Portability in
Single-Source Programming for Heterogeneous Parallel Systems-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.
PubDate: 2022-12-06
DOI: 10.1007/s10766-022-00746-1
-
- Efficient High-Level Programming in Plain Java
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper introduces the support for developing efficient parallel programs in plain Java in the Gaspar framework. The framework supports a complete set of data locality optimisations that are introduced with Java annotations, keeping the original Java code platform independent and postponing performance tuning tasks. Performance results show that data layout improvements can provide a 2.5-fold speedup, and, when combined with the exploitation of parallelism, can deliver a 50-fold speedup when compared with an unoptimised sequential base Java application running on a 24-core machine.
PubDate: 2022-12-05
DOI: 10.1007/s10766-022-00747-0
-
- Interruptible Nodes: Reducing Queueing Costs in Irregular Streaming
Dataflow Applications on Wide-SIMD Architectures-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Streaming dataflow applications are an attractive target to parallelize on wide-SIMD processors such as GPUs. These applications can be expressed as a pipeline of compute nodes connected by edges, which feed outputs from one node to the next. Streaming applications often exhibit irregular dataflow, where the amount of output produced for one input is unknown a priori. Inserting finite queues between pipeline nodes can ameliorate the impact of irregularity and improve SIMD lane occupancy. The sizing of these queues is driven by both performance and safety considerations- relative queue sizes should be chosen to reduce runtime overhead and maximize throughput, but each node’s output queue must be large enough to accommodate the maximum number of outputs produced by one SIMD vector of inputs to the node. When safety and performance considerations conflict, the application may incur excessive memory usage and runtime overhead. In this work, we identify properties of applications that lead to such undesirable behaviors, with examples from applications implemented in our MERCATOR framework for irregular streaming on GPUs. To address these issues, we propose extensions to support interruptible nodes that can be suspended mid-execution if their output queues fill. We illustrate the impacts of adding interruptible nodes to the MERCATOR framework on representative irregular streaming applications from the domains of branching search and bioinformatics.
PubDate: 2022-12-05
DOI: 10.1007/s10766-022-00745-2
-
- Distributed-Memory FastFlow Building Blocks
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We present the new distributed-memory run-time system (RTS) of the C++-based open-source structured parallel programming library FastFlow. The new RTS enables the execution of FastFlow shared-memory applications written using its Building Blocks (BBs) on distributed systems with minimal changes to the original program. The changes required are all high-level and deal with introducing distributed groups (dgroup), i.e., logical partitions of the BBs composing the application streaming graph. A dgroup, which in turn is implemented using FastFlow’s BBs, can be deployed and executed on a remote machine and communicate with other dgroups according to the original shared-memory FastFlow streaming programming model. We present how to define the distributed groups and how we faced the problem of data serialization and communication performance tuning through transparent messages’ batching and their scheduling. Finally, we present a study of the overhead introduced by dgroups considering some benchmarks on a sixteen-node cluster.
PubDate: 2022-12-02
DOI: 10.1007/s10766-022-00750-5
-
- Retraction Note: QoS and QoE Enhanced Resource Allocation for Wireless
Video Sensor Networks Using Hybrid Optimization Algorithm-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2022-11-07
DOI: 10.1007/s10766-022-00738-1
-
- DSParLib: A C++ Template Library for Distributed Stream Parallelism
-
Free pre-print version: Loading...Rate this result: What is this?Please help us test our new pre-print finding feature by giving the pre-print link a rating.
A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs.
PubDate: 2022-10-29
DOI: 10.1007/s10766-022-00737-2
-