Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.) , only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. PubDate: Fri, 03 May 2019 00:00:00 GMT

Abstract: Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Panruo Wu, Ichitaro Yamazaki, Asim Yarkhan, Maksims Abalenkovs, Negin Bagherpour, Sven Hammarling, Jakub Šístek, David Stevens, Mawussi Zounon, Samuel D. Relton

The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented. Extensive benchmarks are targeted on three recent multicore and manycore architectures, namely, an Intel Xeon, Intel Xeon Phi, and IBM POWER 8 processors. PubDate: Fri, 03 May 2019 00:00:00 GMT

Abstract: Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, Paul H. J. Kelly

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]—hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation—not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. PubDate: Fri, 03 May 2019 00:00:00 GMT

An algorithm for multiplying a chain of Kronecker products by a matrix is described. The algorithm does not require that the Kronecker chain actually be computed and the main computational work is a series of matrix-matrix multiplications. Use of the algorithm can lead to substantial savings in both memory requirements and computational speed. Although similar algorithms have been described before, this article makes two novel contributions. First, it shows how shuffling of data can be (largely) avoided. Second, it provides a simple method to determine the optimal ordering of the workflow. PubDate: Fri, 03 May 2019 00:00:00 GMT

Abstract: Vittorio Maniezzo, Marco A. Boschetti, Antonella Carbonaro, Moreno Marzolla, Francesco Strappaveccia

Mobile platforms have matured to a point where they can provide the infrastructure required to support sophisticated optimization codes. This opens the possibility to envisage new interest for distributed application codes and the opportunity to intensify research on optimization algorithms requiring limited computational resources, as provided by mobile platforms. In this article, we report on some exploratory experience in this area. We illustrate some practical, real-world cases where running optimization programs on mobile or embedded devices can be useful, with particular emphasis on matheuristics approaches. Then, we discuss a practical use case involving the feasibility version of the generalized assignment problem (GAP). PubDate: Mon, 29 Apr 2019 00:00:00 GMT

Abstract: Dalal Sukkari, Hatem Ltaief, Aniello Esposito, David Keyes

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. PubDate: Fri, 26 Apr 2019 00:00:00 GMT

Abstract: Jan Winkelmann, Paul Springer, Edoardo Di Napoli

Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties that are pertinent to the entire sequence and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present the Chebyshev Accelerated Subspace iteration Eigensolver (ChASE), a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree that further reduces the necessary floating-point operations. PubDate: Fri, 26 Apr 2019 00:00:00 GMT

In this remark, we identify the cause of the loss of accuracy in the computation of the Faddeyeva function, w(z), near the real axis when using Algorithm 680. We provide a simple correction to this problem that allows us to restore this code as one of the important reference routines for accuracy comparisons. PubDate: Fri, 26 Apr 2019 00:00:00 GMT

We discuss the design decisions, design alternatives, and rationale behind the third generation of Peano, a framework for dynamically adaptive Cartesian meshes derived from spacetrees. Peano ties the mesh traversal to the mesh storage and supports only one element-wise traversal order resulting from space-filling curves. The user is not free to choose a traversal order herself. The traversal can exploit regular grid subregions and shared memory as well as distributed memory systems with almost no modifications to a serial application code. We formalize the software design by means of two interacting automata—one automaton for the multiscale grid traversal and one for the application-specific algorithmic steps. PubDate: Thu, 18 Apr 2019 00:00:00 GMT

This article shows how to use performance and data profile benchmarking tools to improve the performance of algorithms. We propose to achieve this goal by defining and approximately solving suitable optimization problems involving the parameters of the algorithm under consideration. Because these problems do not have derivatives and may involve integer variables, we suggest using a mixed-integer derivative-free optimizer for this task. A numerical illustration is presented (using the BFO package), which indicates that the obtained gains are potentially significant. PubDate: Thu, 18 Apr 2019 00:00:00 GMT