Publisher: IBM   (Total: 1 journals)   [Sort alphabetically]

Showing 1 - 1 of 1 Journals sorted by number of followers
IBM J. of Research and Development     Hybrid Journal   (Followers: 18, SJR: 0.275, CiteScore: 1)
Similar Journals
Journal Cover
IBM Journal of Research and Development
Journal Prestige (SJR): 0.275
Citation Impact (citeScore): 1
Number of Followers: 18  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 0018-8646
Published by IBM Homepage  [1 journal]
  • Umpire: Application-focused management and coordination of complex
           hierarchical memory
    • Authors: D. A. Beckingsale;M. J. McFadden;J. P. S. Dahm;R. Pankajakshan;R. D. Hornung;
      Abstract: Advanced architectures like Sierra provide a wide range of memory resources that must often be carefully controlled by the user. These resources have varying capacities, access timing rules, and visibility to different compute resources. Applications must intelligently allocate data in these spaces, and depending on the total amount of memory required, applications may also be forced to move data between different parts of the memory hierarchy. Finally, applications using multiple packages must coordinate effectively to ensure that each package can use the memory resources it needs. To address these challenges, we present Umpire, an application-oriented library for managing memory resources. Specifically, Umpire provides support for querying memory resources, provisioning and allocating memory, and memory introspection. It allows computer scientists and computational physicists to efficiently program the memory hierarchies of current and future high-performance computing architectures, without tying their application to specific hardware or software. In this article, we describe the design and implementation of Umpire and present case studies from the integration of Umpire into applications that are currently running on Sierra.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Preface: Summit and Sierra Supercomputers
    • Pages: 1 - 4
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • The CORAL supercomputer systems
    • Authors: W. A. Hanson;
      Pages: 1:1 - 1:10
      Abstract: In 2014, the U.S. Department of Energy (DoE) initiated a multiyear collaboration between Oak Ridge National Laboratory (ORNL), Argonne National Laboratory, and Lawrence Livermore National Laboratory (LLNL), known as “CORAL,” the next major phase in the DoE's scientific computing roadmap. The IBM CORAL systems are based on a fundamentally new data-centric architecture, where compute power is embedded everywhere data resides, combining powerful central processing units (CPUs) with graphics processing units (GPUs) optimized for scientific computing and artificial intelligence workloads. The IBM CORAL systems were built on the combination of mature technologies: 9th-generation POWER CPU, 6th-generation NVIDIA GPU, and 5th-generation Mellanox InfiniBand. These systems are providing scientists with computing power to solve challenges in many research areas beyond previously possible. This article provides an overview of the system solutions deployed at ORNL and LLNL.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Troubleshooting deep-learner training data problems using an evolutionary
           algorithm on Summit
    • Authors: M. Coletti;A. Fafard;D. Page;
      Pages: 1 - 12
      Abstract: Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Redefining IBM power system design for CORAL
    • Authors: S. Roberts;C. Mann;C. Marroquin;
      Pages: 2:1 - 2:10
      Abstract: Stipulations in the 2014 Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) joint procurement activity not only motivated a fundamental change in IBM's high-performance computer design, which refocused IBM power systems on compute nodes that can scale to 200 petaflops with access to 2.5 PB of memory, but also served the commercial market for single-server applications. The distribution of both processing elements and memory required a careful look at data movement. The resultant AC922 POWER9 system features NVIDIA V100 GPUs with cache line access granularity, more than double the IO bandwidth of PCIe Gen3, and low-latency interfaces interconnected by the state-of-the-art dual-rail Mellanox CAPI EDR HCAs running at 50 Gb/s. With processing units designed to operate at 250 and 300 W, a single system can produce up to 3,080 kW. The overall CORAL solutions achieved power usage effectiveness rankings in the top ten on the Green500. Previous power designs used uniquely designed cabinets and scaled-up infrastructure to achieve efficiency. For successful commercial use, our design uses industry-standard 19-in drawers and racks. Both air- and water-cooled solutions allow for use in a wide range of customer environments. This article documents the novel design features that facilitate data movement and enable new coherent programming models. It describes how three generations of system designs became the foundation for the CORAL contract fulfillment and illustrates key features and specifications of the final product.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Sierra Center of Excellence: Lessons learned
    • Authors: J. P. Dahm;D. F. Richards;A. Black;A. D. Bertsch;L. Grinberg;I. Karlin;S. Kokkila-Schumacher;E. A. León;J. R. Neely;R. Pankajakshan;O. Pearce;
      Pages: 2:1 - 2:14
      Abstract: The introduction of heterogeneous computing via GPUs from the Sierra architecture represented a significant shift in direction for computational science at Lawrence Livermore National Laboratory (LLNL), and therefore required significant preparation. Over the last five years, the Sierra Center of Excellence (CoE) has brought employees with specific expertise from IBM and NVIDIA together with LLNL in a concentrated effort to prepare applications, system software, and tools for the Sierra supercomputer. This article shares the process we applied for the CoE and documents lessons learned during the collaboration, with the hope that others will be able to learn from both our success and intermediate setbacks. We describe what we have found to work for the management of such a collaboration and best practices for algorithms and source code, system configuration and software stack, tools, and application performance.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • The high-speed networks of the Summit and Sierra supercomputers
    • Authors: C. B. Stunkel;R. L. Graham;G. Shainer;M. Kagan;S. S. Sharkawi;B. Rosenburg;G. A. Chochia;
      Pages: 3:1 - 3:10
      Abstract: Oak Ridge National Laboratory's Summit supercomputer and Lawrence Livermore National Laboratory's Sierra supercomputer utilize InfiniBand interconnect in a Fat-tree network topology, interconnecting all compute nodes, storage nodes, administration, and management nodes into one linearly scalable network. These networks are based on Mellanox 100-Gb/s EDR InfiniBand ConnectX-5 adapters and Switch-IB2 switches, with compute-rack packaging and cooling contributions from IBM. These devices support in-network computing acceleration engines such as Mellanox Scalable Hierarchical Aggregation and Reduction Protocol, graphics processor unit (GPU) Direct RDMA, advanced adaptive routing, Quality of Service, and other network and application acceleration. The overall IBM Spectrum Message Passing Interface (MPI) messaging software stack implements Open MPI, and was a collaboration between IBM, Mellanox, and NVIDIA to optimize direct communication between endpoints, whether compute nodes (with IBM POWER CPUs, NVIDIA GPUs, and flash memory devices), or POWER-hosted storage nodes. The Fat-tree network can isolate traffic among the compute partitions and to/from the storage subsystem, providing more predictable application performance. In addition, the high level of redundancy of this network and its reconfiguration capability ensures reliable high performance even after network component failures. This article details the hardware and software architecture and performance of the networks and describes a number of the high-performance computing (HPC) enhancements engineered into this generation of InfiniBand.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Building a high-performance resilient scalable storage cluster for CORAL
           using IBM ESS
    • Authors: R. Islam;G. Shah;
      Pages: 4:1 - 4:9
      Abstract: A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Summit and Sierra supercomputer cooling solutions
    • Authors: S. Tian;T. Takken;V. Mahaney;C. Marroquin;M. Schultz;M. Hoffmeyer;Y. Yao;K. O'Connell;A. Yuksel;P. Coteus;
      Pages: 5:1 - 5:12
      Abstract: Achieving optimal data center cooling efficiency requires effective water cooling of high-heat-density components, coupled with optimal warmer water temperatures and the correct order of water preheating from any air-cooled components. The Summit and Sierra supercomputers implemented efficient cooling by using high-performance cold plates to directly water-cool all central processing units (CPUs) and graphics processing units (GPUs) processors with warm inlet water. Cost performance was maximized by directly air-cooling the 10% to 15% of the compute drawer heat load generated by the lowest heat density components. For the Summit system, a rear-door heat exchanger allowed zero net heat load to air; the overall system efficiency was optimized by using the preheated water from the heat exchanger as an input to cool the higher power CPUs and GPUs.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Concurrent installation and acceptance of Summit and Sierra supercomputers
    • Authors: T. Liebsch;
      Pages: 6:1 - 6:8
      Abstract: The deployment of any high-performance computer systems typically includes an acceptance process to validate the system's specifications, covering hardware, software, and delivered services. In this article, we describe the efforts undertaken by IBM and its partners to accomplish early preparations and then concurrently deliver, stabilize, and accept the two fastest supercomputers in the world at the time of deployment.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Cluster system management
    • Authors: N. Besaw;L. Scheidenbach;J. Dunham;S. Kaur;A. Ohmacht;F. Pizzano;Y. Park;
      Pages: 7:1 - 7:9
      Abstract: Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at Documentation can be found at
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Scalable, fault-tolerant job step management for high-performance systems
    • Authors: D. Solt;J. Hursey;A. Lauria;D. Guo;X. Guo;
      Pages: 8:1 - 8:9
      Abstract: Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Communication protocol optimization for enhanced GPU performance
    • Authors: S. S. Sharkawi;G. A. Chochia;
      Pages: 9:1 - 9:9
      Abstract: The U.S. Department of Energy CORAL program systems SUMMIT and SIERRA are based on hybrid servers comprising IBM POWER9 CPUs and NVIDIA V100 graphics processing units (GPUs) connected by two extended data rate (EDR) links to a high-speed InfiniBand Network. A major challenge to the communication software stack is to optimize performance for all combinations of data origin and destination: host or GPU memory, same or different server. Alternate paths exist for routing data from GPU memory. When origin and destination are on different servers, it can be sent either via host memory or bypassing host memory with GPU direct feature. When origin and destination are on the same server, host memory can be bypassed with peer-to-peer inter process communication (IPC). For large messages pipelining makes host memory data path competitive with GPU direct. In this article, we explain the techniques used in Spectrum MPI Parallel Active Message Interface layer to cache memory types and attributes in order to reduce the overhead associated with calling the CUDA application programming interface (API); in addition, we detail the different protocols used for different memory types, device memory, managed memory, and host memory. To illustrate, the caching technique achieved a device-to-device latency improvement of 26% for intranode transfers and 19% for internode transfers.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Pre-exascale accelerated application development: The ORNL Summit
    • Authors: L. Luo;T. P. Straatsma;L. E. Aguilar Suarez;R. Broer;D. Bykov;E. F. D'Azevedo;S. S. Faraji;K. C. Gottiparthi;C. De Graaf;J. A. Harris;R. W. A. Havenith;H. J. Aa. Jensen;W. Joubert;R. K. Kathir;J. Larkin;Y. W. Li;D. I. Lyakh;O. E. B. Messer;M. R. Norman;J. C. Oefelein;R. Sankaran;A. F. Tillack;A. L. Barnes;L. Visscher;J. C. Wells;M. Wibowo;
      Pages: 11:1 - 11:21
      Abstract: High-performance computing (HPC) increasingly relies on heterogeneous architectures to achieve higher performance. In the Oak Ridge Leadership Facility (OLCF), Oak Ridge, TN, USA, this trend continues as its latest supercomputer, Summit, entered production in early 2019. The combination of IBM POWER9 CPU and NVIDIA V100 GPU, along with a fast NVLink2 interconnect and other latest technologies, pushes system performance to a new height and breaks the exascale barrier by certain measures. Due to Summit's powerful GPUs and much higher GPU–CPU ratio, offloading to accelerators becomes a requirement for any application, which intends to effectively use the system. To facilitate navigating a complex landscape of competing heterogeneous architectures, a collection of applications from a wide spectrum of scientific domains is selected for early adoption on Summit. In this article, the experience and lessons learned are summarized, in the hope of providing useful guidance to address new programming challenges, such as scalability, performance portability, and software maintainability, for future application development efforts on heterogeneous HPC systems.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • An open-source solution to performance portability for Summit and Sierra
    • Authors: G. T. Bercea;A. Bataev;A. E. Eichenberger;C. Bertolli;J. K. O'Brien;
      Pages: 12:1 - 12:23
      Abstract: Programming models that use a higher level of abstraction to express parallelism can target both CPUs and any attached devices, alleviating the maintainability and portability concerns facing today's heterogenous systems. This article describes the design, implementation, and delivery of a compliant OpenMP device offloading implementation for IBM-NVIDIA heterogeneous servers composing the Summit and Sierra supercomputers in the mainline open-source Clang/LLVM compiler and OpenMP runtime projects. From a performance perspective, reconciling the GPU programming model, best suited for massively parallel workloads, with the generality of the OpenMP model was a significant challenge. To achieve both high performance and full portability, we map high-level programming patterns to fine-tuned code generation schemes and customized runtimes that preserve the OpenMP semantics. In the compiler, we implement a low-overhead single-program multiple-data scheme that leverages the GPU native execution model and a fallback scheme to support the generality of OpenMP. Modular design enables the implementation to be extended with new schemes for frequently occurring patterns. Our implementation relies on key optimizations: sharing data among threads, leveraging unified memory, aggressive inlining of runtime calls, memory coalescing, and runtime simplification. We show that for commonly used patterns, performance on the Summit and Sierra GPUs matches that of hand-written native CUDA code.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Hybrid CPU/GPU tasks optimized for concurrency in OpenMP
    • Authors: A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien;
      Pages: 13:1 - 13:14
      Abstract: Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • OpenMP 4.5 compiler optimization for GPU offloading
    • Authors: E. Tiotto;B. Mahjour;W. Tsang;X. Xue;T. Islam;W. Chen;
      Pages: 14:1 - 14:11
      Abstract: Ability to efficiently offload computational workloads to graphic processing units (GPUs) is critical for the success of hybrid CPU–GPU architectures, such as the Summit and Sierra supercomputing systems. OpenMP 4.5 is a high-level programming model that enables the development of architecture- and accelerator-independent applications. This article describes aspects of the OpenMP implementation in the IBM XL C/C++ and XL Fortran OpenMP compilers that aid programmers to achieve performance objectives. This includes an interprocedural static analysis the XL optimizer uses to specialize code generation of the OpenMP distribute parallel do loop within the dynamic context of a target region, and other compiler optimizations designed to reduce the overhead of data transferred to an offloaded target region. We introduce the heuristic used at runtime to select optimal grid sizes for offloaded target team constructs. These tuned heuristics lead to an average improvement of 2× in the runtime of several target regions in the SPEC ACCEL V1.2 benchmark suite. In addition to performance enhancement, this article also presents an advanced diagnostic feature implemented in the XL Fortran compiler to aid in debugging OpenMP applications offloaded to accelerators.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Transformation of application enablement tools on CORAL systems
    • Authors: S. Maerean;E. K. Lee;H.-F. Wen;I-H. Chung;
      Pages: 16:1 - 16:12
      Abstract: The CORAL project exhibits an important shift in the computational paradigm from homogeneous to heterogeneous computing, where applications run on both the CPU and the accelerator (e.g., GPU). Existing applications optimized to run only on the CPU have to be rewritten to adopt accelerators and retuned to achieve optimal performance. The shift in the computational paradigm requires application development tools (e.g., compilers, performance profilers and tracers, and debuggers) change to better assist users. The CORAL project places a strong emphasis on open-source tools to create a collaborative environment in the tools community. In this article, we discuss the collaboration efforts and corresponding challenges to meet the CORAL requirements on tools and detail three of the challenges that required the most involvement. A usage scenario is provided to show how the tools may help users adopt the new computation environment and understand their application execution and the data flow at scale.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
  • Porting a 3D seismic modeling code (SW4) to CORAL machines
    • Authors: R. Pankajakshan;P.-H. Lin;B. Sjögreen;
      Pages: 17:1 - 17:11
      Abstract: Seismic waves fourth order (SW4) solves the seismic wave equations on Cartesian and curvilinear grids using large compute clusters with O (100,000) cores. This article discusses the porting of SW4 to run on the CORAL architecture using the RAJA performance portability abstraction layer. The performances of key kernels using RAJA and CUDA are compared to estimate the performance penalty of using the portability abstraction layer. Code changes required for efficiency on GPUs and minimizing time spent in Message Passing Interface (MPI) are discussed. This article describes a path for efficiently porting large code bases to GPU-based machines while avoiding the pitfalls of a new architecture in the early stages of its deployment. Current bottlenecks in the code are discussed along with possible architectural or software mitigations. SW4 runs 28× faster on one 4-GPU CORAL node than on a CTS-1 node (Dual Intel Xeon E5-2695 v4). SW4 is now in routine use on problems of unprecedented resolution (203 billion grid points) and scale on 1,200 nodes of Summit.
      PubDate: May-July 1 2020
      Issue No: Vol. 64, No. 3/4 (2020)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762

Your IP address:
Home (Search)
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-