Second workshop on Communication, I/O, and Storage at Scale on Next-Generation Platforms – Scalable Infrastructures

ISC 2023 IXPUG Workshop

Date: May 25, 2023 9:00 a.m. - 6:00 p.m. full-day workshop WITH proceedings

Location: In-person at ISC 2023, Hamburg, Germany – Hall Y4 - 2nd Floor

Registration: https://www.isc-hpc.com/ The workshop is held in conjunction with ISC 2023, Hamburg, Germany. To attend the IXPUG Workshop, you must register for the ISC 2023 Workshop Pass. In the ISC registration form, select “Second workshop on Communication, I/O, and Storage at Scale on Next-Generation Platforms – Scalable Infrastructures” from the drop-down menu.

Zoom: Those unable to attend ISC 2023 in person may join the IXPUG Workshop via Zoom Webinar: https://tacc.zoom.us/webinar/register/WN_0JEm5eCqROqU-eoiHfZc4Q

Event Description:

The workshop intends to attract system architects, code developers, research scientists, system providers, and industry luminaries who are interested in learning about the interplay of next-generation hardware and software solutions for communication, I/O, and storage subsystems tied together to support HPC and data analytics at the systems level, and how to use them effectively. The workshop will provide the opportunity to assess technology roadmaps to support AI and HPC at scale, sharing users’ experiences with early-product releases and providing feedback to technology experts. The overall goal is to make the ISC community aware of the emerging complexity and heterogeneity of upcoming communication, I/O and storage subsystems as part of next-generation system architectures and inspect how these components contribute to scalability in both AI and HPC workloads.

The workshop will pursue several objectives: (1) Develop and provide a holistic overview of next-generation platforms with an emphasis on communication, I/O, and storage at scale, (2) Showcase application-driven performance analysis with various HPC fabrics, (3) Present early experiences with emerging storage concepts like object stores using next-generation HPC fabrics, (4) Share experience with performance tuning on heterogeneous platforms from multiple vendors, and (5) Be a forum for sharing best practices for performance tuning of communication, I/O, and storage to improve application performance at scale and any challenges.

Workshop Agenda: All times are shown in CEST / Hamburg Time, UTC+2. Final presentations will be made accessible to download at https://www.ixpug.org/resources after the workshop.

Time	Title and Abstract	Presenter and Authors	Presentation
	Session 1: Keynote 1 Session Chair: R. Glenn Brook
9:00-10:00	Keynote: Performance Portability for Next-Generation Heterogeneous Systems There is a huge diversity in the processors used to power the leading supercomputers. Despite their differences in how they need to be programmed, these processors lie on a spectrum of design. GPU-accelerated systems are optimised for throughput calculations providing high memory bandwidth; CPUs provide deep and complex cache hierarchies to improve memory latency; and both use vector units to bolster compute performance. Competitive processors are available from a multitude of vendors, with each becoming more heterogeneous with every generation. This gives us as a HPC community a choice, but how do we write our applications to make the most of this opportunity? Our high-performance applications must be written to embrace the full ecosystem of supercomputer design. They need to take advantage of the hierarchy of concurrency on offer, and utilise the whole processor. And writing these applications must be productive because HPC software outlives any one system. Our applications need to address the “Three Ps” and be Performance Portable and Productive. This talk will highlight the opportunities this variety of heterogeneous architectures brings to applications, and how application performance and portability can be rigorously measured and compared across diverse architectures. It will share a strategy for writing performance portable applications and present the roles that ISO languages C++ and Fortran, as well as parallel programming models and abstractions such as OpenMP, SYCL and Kokkos play in the ever changing heterogeneous landscape.	Dr. Tom Deakin, University of Bristol	Slides
	Session 2: Technical Presentations Session Chair: R. Glenn Brook
10:00-10:30	Bandwidth Limits in the Intel Xeon Max (Sapphire Rapids with HBM) Processors The HBM memory of Intel Xeon Max processors provides significantly higher sustained memory bandwidth than their DDR5 memory, with corre-sponding increases in the performance of bandwidth-sensitive applications. However, the increase in sustained memory bandwidth is much smaller than the increase in peak memory bandwidth. Using custom microbench-marks (instrumented with hardware performance counters) and analytical modeling, the primary bandwidth limiter is shown to be insufficient memory concurrency. Secondary bandwidth limitations due to non-uniform loading of the two-dimensional on-chip mesh interconnect are shown to arise not far behind the primary limiters	John D. McCalpin, Texas Advanced Computing Center, The University of Texas at Austin	Slides
10:30-11:00	Modelling Next Generation High Performance Computing Fabrics Simulation provides insight in to physical phenomenon that could otherwise not be understood. It allows detailed analysis to take place that may not be able to captured in a significant level of detail from the physical world. This insight can be used to build better products and design better features used inside of next generation low latency fabrics. This extended abstract looks at validation and development of a simulator to support the development of next generation fabrics.	Dean Chester, Aruna Ramanan and Mark Atkins – Cornelis Networks	Slides
11:00-11:30	Coffee Break
	Session 3: Technical Presentations Session Chair: Thomas Steinke
11:30-12:00	DAOS beyond PMem: Architecture and Initial Performance Results The Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. Until now, the DAOS storage stack has been based on Intel Optane Persistent Memory (PMem) and the Persistent Memory Development Kit (PMDK). With the discontinuation of Optane PMem, and no persistent CXL.mem devices in the market yet, DAOS continues to support PMem-based servers but now also supports server configurations where its Versioning Object Store (VOS) is held in DRAM. In this case, the VOS data structures are persisted through a synchronous Write-Ahead-Log (WAL) combined with asynchronous checkpointing to NVMe SSDs. This paper describes the new non-PMem DAOS architecture, and reports first performance results based on a DAOS 2.4 technology preview.	Michael Hennecke, Johann Lombardi, Jeff Olivier, Tom Nabarro, Liang Zhen, Yawei Niu, Shilong Wang and Xuezhao Liu – Intel Corporation	Slides
12:00-12:30	Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size 400^3 scalably due to the larger memory size.	Yehonatan Fridman, NRCN and Ben-Gurion University; Guy Tamir, Intel Corporation; Gal Oren, NRCN and Department of Computer Science, Technion – Israel Institute of Technology	Slides
12:30-13:00	Enabling Multi-level Network Modeling in Structural Simulation Toolkit for Next-Generation HPC Network Design Space Exploration The last decade has seen high-performance computing (HPC) systems become denser and denser. Higher node and rack density has led to development of multi-level networks - at socket, node, 'pod', rack, and between nodes. As sockets become more complex with integrated or co-packaged heterogeneous architectures, this network complexity is going to increase. In this paper, we extend Structural Simulation Toolkit (SST) to model these multi-level networks designs. We demonstrate this newly introduced capability by modeling a combination of a few different network topologies at different levels of the system and simulating the performance of collectives and some popular HPC communication patterns.	Sai Prabhakar Rao Chenna, Nalini Kumar, Leonardo Borges, Michael Steyer, Philippe Thierry and Maria Garzaran – Intel Corporation	Slides
13:00-14:00	Lunch Break
	Session 4: Keynote 2 Session Chair: David Martin		Slides
14:00-15:00	Keynote: Building a Productive, Open, Accelerated Programming Model for Now and the Future The needs for energy-efficient computing to solve problems of exascale class computing, emerging distributed artificial intelligence, and intelligent devices at the edge among others have driven what was referred to in 2019 as a “A New Golden Age for Computer Architecture.”¹ The growth of diverse CPU, GPU, and other accelerator architectures comes with complexity in programming from predicted domain-specific languages. In practice, the per device programming models of novel architectures present challenges developing, deploying, and maintaining software. This talk will cover Intel’s efforts to date to enable multi-architecture, multi-vendor accelerated programming and progress to date. Additionally, this session will discuss research directions to make accelerated computing increasingly productive. ¹ J.L. Hennessy and D.A. Patterson: “A New Golden Age for Computer Architecture” in Communications of the ACM 62.2 (2019), pp. 48-60)	Joseph Curley, Intel Corporation	Slides
	Session 5: Technical Presentations Session Chair: David Martin
15:00-15:30	Application Performance Analysis: a Report on the Impact of Memory Bandwidth As High-Performance Computing (HPC) applications involving massive data sets, including large-scale simulations, data analytics, and machine learning, continue to grow in importance, memory bandwidth has emerged as a critical performance factor in contemporary HPC systems. The rapidly escalating memory performance requirements, which traditional DRAM memories often fail to satisfy, necessitate the use of High-Bandwidth Memory (HBM), which offers high bandwidth, low power consumption, and high integration capacity, making it a promising solution for next-generation platforms. However, despite the notable increase in memory bandwidth on modern systems, no prior work has comprehensively assessed the memory bandwidth requirements of a diverse set of HPC applications and provided sufficient justification for the cost of HBM with potential performance gain. This work presents a performance analysis of a diverse range of scientific applications as well as standard benchmarks on platforms with varying memory bandwidth. The study shows that while the performance improvement of scientific applications varies quite a bit, some applications in CFD, Earth Science, and Physics show significant performance gains with HBM. Furthermore, a cost-effectiveness analysis suggests that the applications exhibiting at least a 30% speedup on the HBM platform would justify the additional cost of the HBM.	Yinzhi Wang, John McCalpin, Junjie Li, Matthew Cawood, John Cazes, Hanning Chen, Lars Koesterke, Hang Liu, Chun-Yaung Lu, Robert McLay, Kent Milfield, Amit Ruhela, Dave Semeraro and Wenyang Zhang – Texas Advanced Computing Center, The University of Texas at Austin	Slides
15:30-16:00	Omni-Path Express (OPX) Libfabric Provider Performance Evaluation The introduction of the Omni-Path Express (OPX) Libfabrics provider by Cornelis Networks delivers improved performance and capabilities for the current generation of Omni-Path 100 fabric as well as setting the software foundation for future generations of Omni-Path fabric. This session will study the performance characteristics of the OPX provider using MPI microbenchmarks and make competitive comparisons using application workloads.	John Swinburne, Robert Bollig and James Erwin – Cornelis Networks	Slides
16:00-16:30	Coffee Break
	Session 6: Technical Presentations Session Chair: Amit Ruhela
16:30-17:00	Apache Spark performance optimization over Frontera HPC cluster Apache Spark is a very popular computing engine that allows you to distribute the computing task among a computing cluster. This paper briefly describes the Frontera supercomputer, summarizes the Apache Spark components and deployment methods and provides an optimized proposal to run Apache Spark jobs. To complete this experiment, the simulation results using different sizes of samples allows us to understand the scalability and performance of the proposed implementation.	Samuel Bernardo, LIP; Amit Ruhela, John Cazes and Stephen Harrell – Texas Advanced Computing Center, The University of Texas at Austin	Slides
	Session 7: Keynote 3 Session Chair: Amit Ruhela
17:00-18:00	Keynote: Next-Gen Acceleration with Multi-Hybrid Devices – Is GPU Enough? Accelerated computing with GPU is now the main player on HPC and AI with its absolute performance in FLOPS with multiple precision and high bandwidth of memory. It covers major application fields to fit with its performance characteristics, however, it is not perfect for some class of computation with complicated construction. We think one of the next solutions is multi-hybrid acceleration where different kind of accelerators such as GPU and FPGA by mutual compensation with each other. We are running a conceptual PC cluster named Cygnus under the concept of CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices). In this talk, I will present from the concept to the real application showing the next generation of ideal accelerated computing.	Dr. Taisuke Boku, Center for Computational Sciences, University of Tsukuba	Slides

Workshop Format:

The workshop will have a keynote, full (30 min) talks and lightning talks (10-15 min). While in-person presentations are preferred, pre-recorded videos will be allowed as presentation in exceptional cases.

Call for Submissions:

The submission process will close on March 31, 2023 AoE. All submitters should provide content that represents an Extended Abstract, max. 6-12 pages in LNCS format via the IXPUG EasyChair website. Notifications will be sent to submitters by April 17, 2023 AoE. The page limit is 12 pages for each paper with 2 possible extra pages after the review to address the reviewer's comments. The page limit includes bibliography and appendices.

Topics of Interest are (but not limited to):

Holistic view on performance of next-generation platforms (with emphasis on communication, I/O, and storage at scale)
Application driven performance analysis on inter-node and intra-node HPC fabrics
Software-defined networks in HPC environments
Experiences with with emerging scalable storage concepts, e.g., object stores using next-generation HPC fabrics
Performance tuning on heterogeneous platforms from multiple vendors including impact of I/O and storage
Performance and portability using network programmable devices (DPU, IPU)
Best practice solutions for performance tuning of communication, I/O, and storage to improve application performance at scale and any challenges.

Keywords:

high-performance fabrics, data and infrastructure processing units, scalable object stores as HPC storage subsystems, heterogeneous data processing

Review Process:

All submissions within the scope of the workshop will be peer-reviewed and will need to demonstrate the high quality of the results, originality and new insights, technical strength, and correctness. We apply a standard single-blind review process, i.e., the authors will be known to reviewers. The assignment of reviewers from the Program Committee will avoid conflicts of interest.

Important Dates:

Call for Papers/Contributions: Feb 24, 2023
Deadline for submissions: March 31, 2023
Acceptance notification: April 17, 2023
Camera ready presentation: May 22, 2023
Workshop date: May 25, 2023

Organizers:

R. Glenn Brook, Cornelis Networks
Nalini Kumar, Intel Corporation
David Martin, Argonne Leadership Computing Facility
Amit Ruhela, Texas Advanced Computing Center
Thomas Steinke, Zuse Institute Berlin

Program Committee:

R. Glenn Brook, Cornelis Networks
Clayton Hughes, Sandia National Laboratories
Andrey Kudryavtsev, Intel Corporation
Nalini Kumar, Intel Corporation
Johann Lombardi, Intel Corporation
David Martin, Argonne National Laboratory
Christopher Mauney, Los Alamos National Laboratory
Kelsey Prantis, Intel Corporation

General questions should be sent to This email address is being protected from spambots. You need JavaScript enabled to view it.