Webinar Series


IXPUG Webinar Series

IXPUG webinars enable knowledge-sharing and greater collaboration on a range of topics across HPC, XPU architectures, storage, data analytics, artificial intelligence, and visualization development. The live session webinars are free and open to anyone who wishes to join—it's a great way to get involved in the IXPUG community!

The goals of the Webinar Series are:

  1. Direct IXPUG discussions to what is most relevant to the community.
  2. Disseminate results and techniques.
  3. Assist the community with performance debugging/troubleshooting on Intel HPC platforms.
  4. Provide a forum for collaboration between IXPUG members and Intel engineers.
  5. Help the community to prepare for deeper engagement at upcoming IXPUG events.

How to Participate

To receive updates on future webinars, subscribe to the IXPUG newsletter. If you are interested in a specific topic for future webinars and/or would like to share your work with the IXPUG community, This email address is being protected from spambots. You need JavaScript enabled to view it..


Upcoming Meetings

Date Title Author(s) Description Registration
Next webinar to be announced soon!









Previous Meetings

Date Title Author(s) Description Presentation
April 28, 2022 Intel Fortran Compilers: A Tradition of Trusted Application Performance Ron Green is the manager of the Intel Fortran OpenMP and Runtime Library development team. He is a moderator for the Intel Fortran Community Forum and is an Intel Developer Zone “Black Belt”. He has extensive experience as a developer and consultant in HPC for the past 30+ years and has been with Intel’s compiler team for thirteen years. His technical interest area is in parallel application development with a focus on Fortran programming.

The Intel® Fortran Compiler is built on a long history of generating optimized code that supports industry standards while taking advantage of built-in technology for Intel® Xeon® Scalable processors and Intel® Core™ processors. Staying aligned with Intel's evolving and diverse architectures, the compiler now supports GPUs. This presentation will cover the compiler standards and path forward.

There are two versions of this compiler. Both versions integrate seamlessly with popular third-party compilers, development environments, and operating systems.
• Intel Fortran Compiler: provides CPU and GPU offload support
• Intel Fortran Compiler Classic: provides continuity with existing CPU-focused workflows

• Improves development productivity by targeting CPUs and GPUs through single-source code while permitting custom tuning
• Supports broad Fortran language standards
• Incorporates industry standards support for OpenMP* 4.5, and initial OpenMP 5.0 and 5.1 for GPU offload
• Uses well-proven LLVM compiler technology and Intel's history of compiler leadership
• Takes advantage of multicore, Single Instruction Multiple Data (SIMD) vectorization and multiprocessor systems with OpenMP, automatic parallelism, and coarrays



March 10, 2022 DAOS: Storage Innovations Driven by Intel® Optane™ Persistent Memory Zhen Liang is a technical architect involved in the architecture, design, and implementation of distributed storage system. He has been in the storage software industry since 2004 and has significant experience and expertise in filesystem, network, high performance computing, and distributed storage system architecture. He is currently the technical architect of Distributed Asynchronous Object Storage (DAOS). DAOS is an open-source software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), it is the foundation of the Intel exascale storage stack. This presentation will provide a technical overview of Distributed Asynchronous Object Store (DAOS), a software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), including Intel® Optane™ DC persistent memory and Intel Optane DC SSDs. This presentation will also introduce the performance and explain main features of DAOS.



December 9, 2021

Multi-GPU Programming—Scale-Up and Scale-Out Made Easy, Using the Intel® MPI Library

Anatoliy Rozanov is Intel MPI Lead Developer responsible for Intel GPU enabling and Intel MPI process management/deployment infrastructure at Intel.

Dmitry Durnov is Intel MPI and oneCCL Products Architect at Intel.

Michael Steyer is a HPC Technical Consulting Engineer, supporting technical and high performance computing segments within the Software and Advanced Technology Group at Intel.

For shared memory programming of GPGPU systems, users either have to manually run their domain decomposition along available GPUs as well as GPU Tiles. Or leverage implicit scaling mechanisms that transparently scale their offload code across multiple GPU-Tiles. The former approach can be cumbersome, and the latter approach is not always the best performing one. The Intel MPI library can take that burden from users by enabling the user to program only for a single GPU / Tile and leave the distribution to the library. This can make HPC / GPU programming much easier. Therefore, Intel® MPI does not just allow to pin individual MPI ranks to individual GPUs or Tiles, but also allows users to pass GPU memory pointers to the library.



August 12, 2021 IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream? Dr. Shantenu Jha, Chair of Computation & Data Driven Discovery Department at Brookhaven National Laboratory and Professor of Computer Engineering at Rutgers University

The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silico methodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple methodological and supporting infrastructural innovations at scale. Specifically, how we used TACC’s Frontera on > 8000 compute nodes to sustain 144M/hour docking hits, and to screen ∼100 Billion drug candidates. These capabilities have been used by the US-DOE National Virtual Biotechnology Laboratory, and represent important progress towards improvement of computational drug discovery, both in terms of size of libraries screened, but also the possibility of generating training data fast enough for very powerful (docking) surrogate models.



April 22, 2021 Visual Analysis on TACC Frontera using the Intel oneAPI Rendering Toolkit Dr. Paul A. Navra╠ütil, Research Scientist and Director of Visualization, Texas Advanced Computing Center (TACC) at the University of Texas at Austin TACC Frontera handles the largest simulation runs for open-science researchers supported by the National Science Foundation. Due to the data sizes involved, the scientific analysis is most easily performed on Frontera itself, often done “in situ” without writing the full data to disk. This talk will present recent work on Frontera that uses the Intel oneAPI Rendering Toolkit to perform both batch and interactive visual analysis across a range of scientific domains.



March 11, 2021 Performance Optimizations for End-to-End AI Pipelines Meena Arunachalam and Vrushabh Sanghavi, Intel Corporation The trifecta of high volumes of data, abundant compute availability on cloud and on-premise, and rapid algorithmic innovations enable data scientists and AI researchers to do fast experiments, prototyping, and model development at an accelerated pace that was never possible before. In this talk, we will touch upon a variety of software packages, libraries, and tools that can also help HPC practitioners push the envelope of applying AI in their application domains and simulations at-scale. We will cover examples and talk about how to create efficient end-to-end AI pipelines with large data sets in-memory, security, and other features through Intel-optimized software packages such as Intel® Distribution of Python, Intel® Optimized Modin, Intel® Optimized Sklearn, and XGBoost, as well as DL Frameworks such as Intel® Optimized Tensorflow and Intel® Optimized PyTorch tuned and enabled with new hardware features and instructions every new CPU generation.



February 18, 2021 Migrating from CUDA-only to Multi-Platform DPC++

Steffen Christgau, Zuse Institute Berlin (ZIB)

Marius Knaust (ZIB) will join to answer FPGA-related questions from the audience.

In this webinar we will demonstrate how an existing CUDA stencil application code can be migrated to DPC++ with the help of the Compatibility Tool. We will highlight and discuss the crucial differences between the two programming environments in the context of migrating the tsunami simulation easyWave. The discussion also includes steps for making the code to compliant with the SYCL standard. During the talk, we will also show that the migrated code can run on a wide range of platforms starting from CPUs, over GPUs, to FPGAs.




July 1, 2020


Migrating Your Existing CUDA Code to DPC++


Edward Mascarenhas and Sunny Gogar, Intel Corporation

Best practices for using a one-time migration tool that migrates CUDA applications into standards-based Data Parallel C++ (DPC++) code. Topics include:

• An overview of the DPC++ language, including why it was created and how it benefits developers
• An overview of the Intel DPC++ Compatibility Tool itself—what it is and what it does
• Real-world examples of the code-migration concept, including the process and expectations
• A demonstration of the steps involved to migrate CUDA code to DPC++ code, including what a complete migration looks like and best practices to follow


February 20, 2020 Performance Optimization of Intel® oneAPI Applications

Kevin O’Leary
Intel Corporation

Modern workloads are incredibly diverse—and so are architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial (SVMS) architectures deployed in CPU, GPU, FPGA, and other future accelerators. Intel® oneAPI products will deliver the tools needed to deploy applications and solutions across SVMS architectures. This webinar will focus on the oneAPI features that focus on performance optimization, including the analysis tools:

  • Intel® VTune™ Profiler(Beta) to find performance bottlenecks fast in CPU, GPU, and FPGA systems
  • Intel® Advisor(Beta) for vectorization, threading, and accelerator offload design advice
  • Part of the webinar will start with an application that is currently running on a CPU, and we will use the oneAPI tools to port and optimize on our GPU.

The Intel® oneAPI set of complementary toolkits—a base kit and specialty add-ons—simplify programming and help improve efficiency and innovation. Use it for: high performance computing, machine learning and analytics, IoT applications, video processing, rendering, etc. This webinar will include extra time for Q&A.

Presenter: Kevin O’Leary is a senior technical consulting engineer in Intel’s software tools group. Kevin was one of the original developers of Intel® Parallel Studio. Before coming to Intel, he spent several years on the IBM Rational Apex debugger development team.



November 14, 2019 The DREAM Framework and Binning Directories –or– Can We Analyze ALL Genomic Sequences on Earth?

Knut Reinert
Freie Universität Berlin

The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents.

As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease.

The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e., exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e., a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time.

In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel® Optane™ DC persistent memory architecture and achieves significant speedups compared to a disk based solution.

October 10, 2019 Optimize for Both Memory and Compute on Modern Hardware Using Roofline Model Automation in Intel® Advisor

Zakhar Matveev and Cédric Andreolli
Intel Corporation

Software must be optimized for both Compute (including SIMD vector parallelism) and effective memory sub-system utilization to achieve scaled performance on modern hardware.
In this talk we present state-of-the-art Intel Advisor Roofline performance model automation which helps to identify memory bottlenecks and balance between CPU and memory utilization. The talk will not only cover “cache-aware” Roofline implementation, but also new capabilities to produce DRAM (“original”) and multi-level (L1, L2, LLC, MCDRAM and DRAM – all de-coupled) Roofline model flavors in order to guide DRAM- or cache-bound applications optimization.


[PDF - Matveev]

[PDF - Andreolli]

September 12, 2019

Accelerate Your Inferencing with Intel® Deep Learning Boost

Shailen Sobhee
Intel Corporation

Learn about Intel® Deep Learning Boost (Intel® DL Boost), and its Vector Neural Network Instructions (VNNI). These are a new set of Intel® Advanced Vector Extension 512 (Intel® AVX-512) instructions that are designed to deliver significantly more efficient Deep Learning inference acceleration. We will show a live demo of them in action and quickly show you how you can get started with Intel® DL Boost today. 


July 11, 2019 Scaling Distributed TensorFlow Training with Intel’s nGraph Library on Xeon® Processor Based HPC Infrastructure
Jianying Lang
Intel Corporation
Intel has released nGraph library, a compiler and runtime APIs for multiple front-end Deep Learning frameworks, such as TensorFlow, MxNet, PaddlePaddle, and others. nGraph represents framework computational graph as an intermediate representation (IR) which could be executed by multiple backend computational hardware from the edge to the data center, thus significantly improving the productivity of AI data scientists. As in this talk, we will present the details on the bridge that connects TensorFlow to nGraph for a Xeon CPU backend. We will demonstrate state-of-the-art (SOTA) accuracy and convergence for ResNet-50 against ImageNet-1K on multiple Xeon Skylake nodes. Using distributed nGraph, we are able to obtain ~75% Top-1 accuracy for ResNet-50 training on a small number of Xeon Skylake nodes. We will demonstrate convergence and excellent scaling efficiency Skylake nodes connected with Ethernet using nGraph TensorFlow with open source code Horovod.



May 09, 2019 Deeply-Pipelined FPGA Clusters Make DNN Training Scalable
Tong Geng
Boston University

Tianqi Wang
University of Science and Technology of China

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this talk, we introduce a framework, FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.



April 11, 2019 A Study of SIMD Vectorization for Matrix-Free Finite Element Method

Tianjiao Sun  
Imperial College London, UK

Lawrence Mitchell 
Durham University, UK

David A. Ham 
Imperial College London, UK

Paul H. J. Kelly  
Imperial College London, UK

Kaushik Kulkami  
University of Illinois at Urbana-Champaign, USA

Andreas Kloeckner
University of Illinois at Urbana-Champaign, USA

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this talk, we study the cross-element vectorization in the finite framework Firedrake and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent Intel CPUs using three mainstream compilers. Our experiments show that cross-element vectorization achieves 30% of theoretical peak performance for many examples of practical significance, and exceeds 50% for cases with high arithmetic intensities, with consistent speed-up over vectorization restricted to the local assembly kernels.



March 14, 2019 Scalable and Flexible Distributed Rendering with OSPRay's Distributed API and FrameBuffer

Will Usher 
Scientific Computing and Imaging Institute, University of Utah

Ingo Wald 
Formerly Intel, now NVIDIA

Jefferson Amstutz 
Intel Corporation

Johannes Günther 
Intel Corporation

Carson Brownlee 
Intel Corporation

Valerio Pascucci 
Scientific Computing and Imaging Institute, University of Utah

Image and data-parallel rendering across multiple nodes on HPC system is widely used in visualization to provide higher framerates, support large datasets, and render data in situ, Specifically for in situ, reducing bottlenecks incurred by the visualization and compositing tasks is of key concern to reduce the overall simulation run time, while for general interactive visualization improving rendering performance, and thus interactivity, is always desirable. In this talk, Will Usher will present our work on an asynchronous image processing and compositing framework for multi-node rendering in OSPRay, dubbed the Distributed FrameBuffer. We demonstrate that this approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distribution. By building on this framework, we have extended OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Will Usher will cover our approach to developing this framework, performance considerations, and use cases and examples of the new data-distributed API in OSPRay.

[Video] Recording begins at 2:50.


February 14, 2019 Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Vladimir Mironov   Lomonosov Moscow State University

Yuri Alexeev Argonne National Laboratory

Alexander Moskovsky    RSC Technologies

Andrey Kudryavtsev      Intel Corporation

This talk will present benchmark data for IMDT, which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPoint based Intel SSD called Optane. IMDT performance was studied using synthetic benchmarks, scientific kernels and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only system. The performance was measured as percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.

[Video] Recording begins at 5:35.


January 10, 2019 Massively Scalable Computing Method for Handling Large Eigenvalue Problem for Nanoelectronics Modeling Hoon Ryu
Korea Institute of Science and Technology Information (KISTI)

This talk will help you learn how Lanczos iterative algorithm can be extended with a parallel computing to solve highly degenerated systems. The talk will address the performance benefits of the core numerical operations in Lanczos iteration, which can be driven with manycore processors (KNL) compared to the heterogeneous systems containing PCI-E and-in devices. This work will also demonstrate an extremely large-scale benchmark (~2500 KNL computing nodes) that has been recently performed with KISTI-5 (NURION) HPC resource.

As this talk covers the numerical details of the algorithm, it would be also quite instructive to those who consider KNL system to solve large-scale eigenvalue problems.



October 11, 2018

Intel Optane Solutions in HPC

Andrey Kudryavtsev
Intel Corporation

This session focuses on the latest Intel Optane technologies and the way it’s used by HPC customers. Attendees will learn about the best usage models and benefits Intel Optane introduces for fast storage or extending system memory.  


August 9, 2018 

Machine Learning at Scale 

Deborah Bard and Karthik Kashinath

Deep Learning has revolutionized the fields of computer vision, speech recognition, robotics and control systems. At NERSC, we have applied deep learning to problems in cosmology and climate science, focusing on areas that require supercomputing resources to solve real scientific challenges. In cosmology, we use deep learning to identify the underlying physical model that produced the matter distribution in the universe, and develop a deep learning-based emulator for cosmological observables that can reduce the need for computationally expensive simulations. In addition, we use feature introspection to examine the physical structures identified by the network as distinguishing between cosmological models. 


In climate, we apply deep learning to detect and localize extreme weather events such as tropical cyclones, atmospheric rivers and weather fronts in large-scale simulated and observed datasets. We will also discuss the challenges involved in scaling deep learning frameworks to supercomputer scale, and how to obtain optimal performance from supercomputing hardware. 


 June 14, 2018

Using Roofline Analysis to Analyze, Optimize, & Vectorize Iso3DFD with Intel® Advisor 

Kevin O’Leary
Intel Corporation
This presentation will introduce the use of Intel® Advisor to help you enabling vectorization in your application. We will use the Roofline Model in Intel Advisor to see the impact of our optimizations. We will also demonstrate how Intel Advisor can detect wrong memory access patterns or loop carried dependency in your application. The case study we will use is Iso3DFD. This kernel is propagating a wave in a 3D field using finite difference with a 16th order stencil in an isotropic media.


May 10, 2018

High Productivity Languages

Rollin Thomas

Sergey Maidanov
Intel Corporation

This talk will cover challenges of numerical analysis and simulations at scale. The tools such as Python which are often used for prototyping are not designed to scale to large problems. As a result organizations have to have a dedicated team that takes a prototype created by research scientists and deploy it in the production environment.

The new approach is required for addressing both scalability and productivity aspects of applied science that combines two distinct worlds, the best of HPC world and the best of database worlds.


Starting with a brief overview of scalability aspects with respect to modern hardware architecture we will characterize what the problem at scale is, its inherit characteristics and how these map onto software design choices. We will also discuss selected experimental/observational science applications making use of Python at the National Energy Research Scientific Computing Center (NERSC), and what NERSC has done in partnership with the Intel Python Team to help application developers improve performance while retaining scientist/developer productivity.

[Slides 1]

[Slides 2]


April 12, 2018

Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors

John McCalpin
Intel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.



March 8, 2018

Compiler Prefetching on KNL Rakesh Krishaiyer
Intel Corporation
We will cover some of the recent changes in the compiler-based prefetching (for Knights Landing and Skylake) and provide tips on how to tune for performance using compiler prefetching options, pragmas and prefetch intrinsics.



February 8, 2018

Threading Building Blocks (TBB) Flow Graph: Expressing and Analyzing Dependencies in Your C++ Application

Pablo Reble
Intel Corporation

Developing for heterogeneous systems is challenging because applications may be composed of many layers of parallelism and employ a diverse set of programming models or libraries. This session focuses on Flow Graph, an extension to the Threading Building Blocks (TBB) interface that can be used as a coordination layer for heterogeneity that retains optimization opportunities and composes with existing models. This extension assists in expressing complex synchronization and communication patterns and in balancing load between CPUs, GPUs, and FPGAs. 

Because a Flow Graph can express complex interactions, we use Intel Advisor’s Flow Graph Analyzer (FGA), which has been released as a Technology Preview in Parallel Studio XE 2018 to visualize interactions in a graph and map the application structure to performance data. Finally, we validate this approach by presenting use cases of applications using Flow Graph.



January 11, 2018



Vectorization of Inclusive/Exclusive Compilier 19.0 Nikolay Panchenko
Intel Corporation

We propose a new OpenMP syntax to support inclusive and exclusive scan patterns.  In computer science, this pattern is also known as a prefix or cumulative sum.  The proposal defines several new constructs to support inclusive and exclusive scans through OpenMP, defines semantics for these constructs and possible combination of parallelization and vectorization.  In 18.0 Compiler 3 new OMP SIMD experimental features were added: vectorization of loops with breaks, syntax for compress/expand patterns and syntax for histogram pattern.




For more information about previous meetings, please refer to the minutes.