Communication, I/O, and Storage at Scale on Next-Generation Platforms - Scalable Infrastructures

SC24 IXPUG Workshop

Workshop date/time: Monday, November 18, 9:00 a.m. to 5:30 p.m. ET

Location: Room B312, in-person at SC24, Atlanta, Georgia

Registration: To attend the IXPUG Workshop, you must register for the SC24 Workshop Pass at https://sc24.supercomputing.org/attend/registration/

Agenda:

All times are shown in ET, Atlanta, Georgia local time. Event details are subject to change. The workshop is held in conjunction with SC24, Atlanta, Georgia. All presentations, aside from the keynotes, are allotted 30 minutes. Presenters should aim to speak for 20 to 25 minutes, leaving the remaining time for Q&A.

Session 1 | Chair: Amit Ruhela (Texas Advanced Computing Center (TACC))

9:05-10:00 a.m. Intel Keynote: From Tensor Processing Primitive towards Tensor Compilers using upstream MLIR (Slides)

Author: Alexander Heinecke, Parallel Computing Lab, Intel Corporation

During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms, kernels and shapes thereof that vendors have dedicated optimizations efforts, while they underperform in the remaining use-cases, yielding non-portable codes with performance glass-jaws. This talk will shade light on abstraction efforts, mainly targeting CPUs and widening to GPUs the close the approaches get to DSLs/Compilers. We will introduce the Tensor Processing Primitives (TPP) as an virtual and software-defined ISA abstraction in form of ukernels. Subsequently we will cover programming abstractions on top of TPP which is carried out in two steps: 1) Expressing the computational core using Tensor Processing Primitives (TPPs): a compact, versatile set of 2D-tensor operators, 2) Expressing the logical loops around TPPs in a high-level, declarative fashion whereas the exact instantiation (ordering, tiling, parallelization) is determined via simple knobs. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms. We will close the talk by demonstrating how TPP can be the architectural target of a tensor compiler which in turn is then able to generate hand-coded performance.

(10:00-10:30 a.m. Coffee Break)

10:30-11:00 a.m. Can Current SDS Controllers Scale To Modern HPC Infrastructures? (Slides)

Authors: Mariana Miranda (INESC TEC & University of Minho), Yusuke Tanimura (AIST), Jason Haga (AIST), Amit Ruhela (UT Austin), Stephen Lien Harrell (UT Austin), John Cazes (UT Austin), Ricardo Macedo (INESC TEC & University of Minho), José Pereira (INESC TEC & University of Minho), João Paulo (INESC TEC & University of Minho)

Modern supercomputers host numerous jobs that compete for shared storage resources, causing I/O interference and performance degradation. Solutions based on software-defined storage (SDS) emerged to address this issue by coordinating the storage environment through the enforcement of QoS policies. However, these often fail to consider the scale of modern HPC infrastructures. In this work, we explore the advantages and shortcomings of state-of-the-art SDS solutions and highlight the scale of current production clusters and their rising trends. Furthermore, we conduct the first experimental study that sheds new insights into the performance and scalability of flat and hierarchical SDS control plane designs. Our results, using the Frontera supercomputer, show that a flat design with a single controller can scale up to 2,500 nodes with an average control cycle latency of 41 ms, while hierarchical designs can handle up to 10,000 nodes with an average latency ranging between 69 and 103 ms.

11:00-11:30 a.m. Benchmarking Ethernet Interconnect for HPC/AI workloads (Slides)

Presenter: Lorenzo Pichetti (University of Trento), Daniele De Sensi Sapienza (University of Rome) Karthee Sivalingam (Open Edge and HPC Initiative), Stepan Nassyr (ParTec AG, FZ Jülich) Dirk Pleiter (Open Edge and HPC Initiative KTH), Aldo Artigiani (Huawei Datacom), Flavio Vella (University of Trento, Italy), Daniele Cesarini (CINECA), Matteo Turisini (CINECA)

Interconnects have always played a cornerstone role in HPC. Since the inception of the Top500 ranking, interconnect statistics have been predominantly dominated by two compet- ing technologies: InfiniBand and Ethernet. However, even if Ethernet increased its popularity due to versatility and cost- effectiveness, InfiniBand used to provide higher bandwidth and continues to feature lower latency. Industry seeks for a further evolution of the Ethernet standards to enable fast and low- latency interconnect for emerging AI workloads by offering competitive, open-standard solutions. This paper analyzes the early results obtained from two systems relying on an HPC Ethernet interconnect, one relying on 100G and the other on 200G Ethernet. Preliminary findings indicate that the Ethernet- based networks exhibit competitive performance, closely aligning with InfiniBand, especially for large message exchanges.

11:30-12:00 p.m. Predicting Protein Folding on Intel’s Data Center GPU Max Series Architecture (PVC) (Slides)

Authors: Madhvan Prasanna (Purdue), Dhani Ruhela (Westwood HS), Aditya Saxena (Bob Jones HS)

Predicting the structure of proteins has been a grand challenge for over 60 years. Google's DeepMind team leveraged Artificial intelligence in 2020 to develop AlphaFold and achieved an accuracy above 90 for two-thirds of the proteins in CASP's competition. AlphaFold has been very successful in biology and medicine. However, a lack of training code and expansive computational requirements created an open-source implementation, OpenFold. OpenFold is fast, memory-efficient, and provides an OpenProtein dataset with five million MSAs. MLCommons added OpenFold to their HPC benchmarks suite in 2023 and was evaluated by four institutions on NVIDIA GPU architectures. This work presents our endeavours to port, run and tune OpenFold on Intel's Ponte Vecchio (PVC) GPUs. To the best of our knowledge, this is the first large-scale study of the distributed implementation of OpenFold application with Intel PVC GPU, presenting the challenges, opportunities and performance of the application on Intel's Max series architecture.

12:00-12:30 p.m. An Efficient Checkpointing System for Large Machine Learning Model Training (Slides)

Authors: Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang (Nanchang Hangkong University), Luanzheng Guo (PNNL), Kento Sato (RIKEN R-CCS)

As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.

(12:30-2:00 p.m. Lunch Break)

Session 2 | Chair: David Martin (Argonne Leadership Computing Facility, Argonne National Laboratory)

2:00-3:00 p.m. Keynote: Network and Communication Infrastructure powering Meta’s GenAI and Recommendation Systems (Slides)

Presenters: Adithya Gangidi, Mikel Jimenez Fernandez (Meta)

In 2020, Meta changed the way we did AI Training. We moved to a synchronous training approach to power our recommendation systems. This pivot required us to build high speed low latency RDMA networks to interconnect GPUs. Over the years Meta has build some of the largest AI Clusters in the world to support training, increasing complex models to support rich user experience. We initially built with Ethernet as our interconnect, later also onboarded Infiniband to production. Such model complexity and scale increased an order of magnitude recently with evolution of GenAI, highlighted by our llama series of foundational models. In this talk, we will take you through such evolution of Meta’s AI Network and Communication Library software over the past 5 years. We will talk about the problems we ran into as we scaled such infrastructure and how we customized our training systems software stack to work through them. We will highlight the changes we did to the Scheduling, Collective Communication, Sharding and Network Transport layers to keep our Clusters performant from a communication perspective.

(3:00-3:30 p.m. Coffee Break)

3:30-4:00 p.m. Protocol Buffer Deserialization DPU Offloading in the RPC Datapath (Slides)

Author/Presenters: Raphaël Frantz (Eindhoven University of Technology, Netherlands), Jerónimo Sánchez García (Aalborg University, Copenhagen), Marcin Copik (ETH Zurich), Idelfonso Tafur Monroy (Eindhoven University of Technology, Netherlands), Juan José Vegas Olmos (NVIDIA Corporation), Gil Bloch (NVIDIA Corporation), Salvatore Di Girolamo (NVIDIA Corporation)

In the microservice paradigm, monolithic applications are decomposed into finer-grained modules invoked independently in a data-flow fashion. The different modules communicate through remote procedure calls (RPCs), which constitute a critical component of the infrastructure. To ensure portable passage of RPC metadata, arguments, and return values between different microservices, RPCs involve serialization/deserialization activities, part of the RPC data center tax. We demonstrate how RPC server logic, including serialization/deserialization, can be offloaded to Data Processing Units (DPUs). This effectively reduces the RPC data center tax on the host, where applications' business logic runs. While we focus on offloading Protocol Buffers deserialization used by the popular gRPC framework, our findings can be applied to other RPC infrastructures. Our experimental results demonstrate that RPC offloading performs similarly to traditional methods while significantly reducing CPU usage.

4:00-4:30 p.m. Modeling and Simulation of Collective Algorithms on HPC Network Topologies using Structural Simulation Toolkit (Slides)

Presenters: Sai P. Chenna (Intel Corporation), Michael Steyer (Intel Corporation/NVIDIA), Nalini Kumar (Intel Corporation), Maria Garzaran (Intel Corporation), Philippe Thierry (Intel Corporation)

In the last decade, DL training has emerged as an HPC-scale workload running on large clusters. The dominant communication pattern in distributed data-parallel DL training is allreduce which is used to sum the model gradients across processes during backpropagation phase. Various allreduce algorithms have been developed to optimize communication time in DL training. Given the scale of DL workloads, it is crucial to evaluate the scaling efficiency of these algorithms on a variety of system architectures. We have extended the Structural Simulation Toolkit (SST) to simulate allreduce and barrier algorithms - Rabenseifner, ring, and, dissemination algorithms. We performed a design space exploration (DSE) study with three allreduce algorithms and two barrier algorithms running on six system network topologies for various message sizes. We quantified the performance benefits of using allreduce algorithms which preserve locality between communicating processes. In addition, we evaluated the scaling efficiency of centralized and decentralized barrier algorithms.

4:30-5:00 p.m. Performance Analysis of a Stencil Code in Modern C++ (Slides)

Presenter: Victor Eijkhout (UT Austin), Yojan Chitkara, Daksh Chaplot

In this paper we evaluate multiple parallel programming models with respect to both ease of expression and resulting performance. We do this by implementing the mathematical algorithm known as the `power method' in a variety of ways, using modern C++ techniques.

5:00 p.m. Workshop Closing Remarks

David Martin (Argonne Leadership Computing Facility, Argonne National Laboratory)

Event Description:

The workshop is a continuation of our effort to bring together HPC users, researchers, and developers from across the globe to share experiences around topics most pertinent to the future of large heterogeneous HPC systems.

Next-generation HPC platforms have to deal with increasing heterogeneity in their subsystems. These subsystems include internal high-speed fabrics for inter-node communication; storage system integrated with programmable data processing units (DPUs) and infrastructure processing units (IPUs) to support software-defined networks; traditional storage infrastructures with global parallel POSIX-based filesystems complemented with scalable object stores; and heterogeneous compute nodes configured with a diverse spectrum of CPUs and accelerators (e.g., GPU, FPGA, AI processors) having complex intra-node communication.

The workshop will pursue multiple objectives, including: (1) develop and provide a holistic overview of next-generation platforms with an emphasis on communication, I/O, and storage at scale, (2) showcase application-driven performance analysis with various HPC network fabrics, (3) present experiences with emerging storage concepts like object stores and all-flash storage, (4) share experiences with performance tuning on heterogeneous platforms from multiple vendors, and (5) share best practices for application programming with complex communication, I/O, and storage at scale.

The workshop intends to attract system architects, code developers, research scientists, system providers, and industry luminaries who are interested in learning about the interplay of next-generation hardware and software solutions for communication, I/O, and storage subsystems tied together to support HPC and data analytics at the systems level, and how to use them effectively. The workshop will provide the opportunity to assess technology roadmaps to support AI and HPC at scale, sharing users’ experiences with early-product releases and providing feedback to technology experts. The overall goal is to make the SC community aware of the emerging complexity and heterogeneity of upcoming communication, I/O and storage subsystems as part of next-generation system architectures and inspect how these components contribute to scalability in both AI and HPC workloads.

Workshop Format:

Full-day workshop. The workshop program will feature two invited talks from Industry as well as academia, several technical talks, and a few shorter ’lightning talks’ to feature late-breaking work in this area.

Call for Submissions:

Submissions are due 12am (AoE) Aug 24, 2024 (updated). The workshop will use SC24-provided software (Linklings) for the review processes.

Camera ready papers will need to be submitted on Linklings no later than the 15th of September.
Submissions must be 5-10 two-column pages (U.S. letter - 8.5 inches x 11 inches), excluding the bibliography, using the IEEE proceedings template.
Camera ready papers are required to be formatted the same as the main conference papers.
IEEE conference proceedings, two-column, US letter.
IEEE will be providing a unique copyright submission site, and access to PDF eXpress to validate final pdfs.
Additional guidelines, including the copyright notice for the camera-ready, will be provided at a later time.
For more details and templates see: https://sc24.supercomputing.org/program/papers/SC

Topics of Interest Are (but Not Limited To):

Holistic view on performance of next-generation platforms (with emphasis on communication, I/O, and storage at scale)
Application-driven performance analysis with various HPC fabrics
Software-defined networks in HPC environments
Experiences with emerging scalable storage concepts, e.g., object stores using next-generation HPC fabrics
Performance tuning on heterogeneous platforms from multiple vendors including impact of I/O and storage
Performance and portability using network programmable devices (DPU, IPU)
Best practice solutions for application programming with complex communication, I/O, and storage at scale

Keywords:

High-performance fabrics, data and infrastructure processing units, scalable object stores as HPC storage subsystems, heterogeneous data processing on accelerators, holistic system view on scalable HPC infrastructures

Review Process:

All submissions within the scope of the workshop will be peer-reviewed and will need to demonstrate high quality results, originality and new insights, technical strength, and correctness. We will apply a standard single-blind review process, i.e., the authors will be known to reviewers. The assignment of reviewers from the Program Committee will avoid conflicts of interest.

Important Dates:

Deadline for submissions via Linklings: August 16, 2024 12am (AoE) Aug 24, 2024
Acceptance notification: September 6, 2024 September 8, 2024
Camera ready presentation: September 27, 2024
Workshop date: November 18, 9am-5:30pm

Organizers:

Glenn Brook, Cornelis Networks
Steffen Christgau, Zuse Institute Berlin
Clayton Hughes, Sandia National Laboratories
Nalini Kumar, Intel Corporation
Hatem Ltaief, King Abdullah University of Science & Technology
David Martin, Argonne National Laboratory
Christopher Mauney, Los Alamos National Laboratory
Amit Ruhela, Texas Advanced Computing Center (TACC)

Program Committee:

Aksel Alpay, Heidelberg University
Glenn Brook, Cornelis Networks
Steffen Christgau, Zuse Institute Berlin
Toshihiro Hanawa, The University of Tokyo
Clayton Hughes, Sandia National Laboratories
Nalini Kumar, Intel Corporation
James Lin, Shanghai Jiao Tong University
Hatem Ltaief, King Abdullah University of Science & Technology
David Martin, Argonne National Laboratory
Christopher Mauney, Los Alamos National Laboratory
Amit Ruhela, Texas Advanced Computing Center (TACC)

Contact:

Please contact This email address is being protected from spambots. You need JavaScript enabled to view it. with any general questions.