Webinar Series

IXPUG Webinar Series

The IXPUG Webinar Series fosters knowledge-sharing and enables greater collaboration on a range of topics across HPC, data analytics, artificial intelligence, and visualization development. The webinars are a great way to get involved in the IXPUG community! If you are interested in a specific topic for future webinars and/or would like to share your work with the IXPUG community, please This email address is being protected from spambots. You need JavaScript enabled to view it. the IXPUG Webinar Series organizer.

The goals of the Webinar Series are as follows:

  1. Direct IXPUG discussions to what is most relevant to the community.
  2. Disseminate results and techniques.
  3. Assist the community with performance debugging/troubleshooting on Intel HPC platforms.
  4. Provide a forum for collaboration between IXPUG members and Intel engineers.
  5. Help the community to prepare for upcoming IXPUG events.

How to Join

The Webinar Series is open to anybody that wishes to join. The live session webinars are held on the second Thursday of every month 08:00 AM - 08:30 AM PST, using GoToWebinar. To join us, please reference the upcoming webinars listed below, and register for each presentation that interests you. To receive updates on future webinars, subscribe to the mailing list. Please note that you must register for an IXPUG account in order to subscribe.

 

Upcoming Meetings

Date Title Author(s) Description Presentation
August 8, 2019 To be announced soon     [Registration]
October 10, 2019 To be announced soon     [Registration]

 


Previous Meetings

Date Title Author(s) Description Presentation
July 11, 2019 Scaling Distributed TensorFlow Training with Intel’s nGraph Library on Xeon® Processor Based HPC Infrastructure
Jianying Lang
Intel Corporation
Intel has released nGraph library, a compiler and runtime APIs for multiple front-end Deep Learning frameworks, such as TensorFlow, MxNet, PaddlePaddle, and others. nGraph represents framework computational graph as an intermediate representation (IR) which could be executed by multiple backend computational hardware from the edge to the data center, thus significantly improving the productivity of AI data scientists. As in this talk, we will present the details on the bridge that connects TensorFlow to nGraph for a Xeon CPU backend. We will demonstrate state-of-the-art (SOTA) accuracy and convergence for ResNet-50 against ImageNet-1K on multiple Xeon Skylake nodes. Using distributed nGraph, we are able to obtain ~75% Top-1 accuracy for ResNet-50 training on a small number of Xeon Skylake nodes. We will demonstrate convergence and excellent scaling efficiency Skylake nodes connected with Ethernet using nGraph TensorFlow with open source code Horovod.

[Video]

[PDF]

May 09, 2019 Deeply-Pipelined FPGA Clusters Make DNN Training Scalable
Tong Geng
Boston University

Tianqi Wang
University of Science and Technology of China

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this talk, we introduce a framework, FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.

[Video]

[PDF]

April 11, 2019 A Study of SIMD Vectorization for Matrix-Free Finite Element Method

Tianjiao Sun  
Imperial College London, UK

Lawrence Mitchell 
Durham University, UK

David A. Ham 
Imperial College London, UK

Paul H. J. Kelly  
Imperial College London, UK

Kaushik Kulkami  
University of Illinois at Urbana-Champaign, USA

Andreas Kloeckner
University of Illinois at Urbana-Champaign, USA

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this talk, we study the cross-element vectorization in the finite framework Firedrake and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent Intel CPUs using three mainstream compilers. Our experiments show that cross-element vectorization achieves 30% of theoretical peak performance for many examples of practical significance, and exceeds 50% for cases with high arithmetic intensities, with consistent speed-up over vectorization restricted to the local assembly kernels.

[Video]

[PDF]

March 14, 2019 Scalable and Flexible Distributed Rendering with OSPRay's Distributed API and FrameBuffer

Will Usher 
Scientific Computing and Imaging Institute, University of Utah

Ingo Wald 
Formerly Intel, now NVIDIA

Jefferson Amstutz 
Intel Corporation

Johannes Günther 
Intel Corporation

Carson Brownlee 
Intel Corporation

Valerio Pascucci 
Scientific Computing and Imaging Institute, University of Utah

Image and data-parallel rendering across multiple nodes on HPC system is widely used in visualization to provide higher framerates, support large datasets, and render data in situ, Specifically for in situ, reducing bottlenecks incurred by the visualization and compositing tasks is of key concern to reduce the overall simulation run time, while for general interactive visualization improving rendering performance, and thus interactivity, is always desirable. In this talk, Will Usher will present our work on an asynchronous image processing and compositing framework for multi-node rendering in OSPRay, dubbed the Distributed FrameBuffer. We demonstrate that this approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distribution. By building on this framework, we have extended OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Will Usher will cover our approach to developing this framework, performance considerations, and use cases and examples of the new data-distributed API in OSPRay.

[Video] Recording begins at 2:50.

February 14, 2019 Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Vladimir Mironov   Lomonosov Moscow State University

Yuri Alexeev Argonne National Laboratory

Alexander Moskovsky    RSC Technologies

Andrey Kudryavtsev      Intel Corporation

This talk will present benchmark data for IMDT, which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPoint based Intel SSD called Optane. IMDT performance was studied using synthetic benchmarks, scientific kernels and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only system. The performance was measured as percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.

[Video] Recording begins at 5:35.

[PDF]

January 10, 2019 Massively Scalable Computing Method for Handling Large Eigenvalue Problem for Nanoelectronics Modeling Hoon Ryu
Korea Institute of Science and Technology Information (KISTI)

This talk will help you learn how Lanczos iterative algorithm can be extended with a parallel computing to solve highly degenerated systems. The talk will address the performance benefits of the core numerical operations in Lanczos iteration, which can be driven with manycore processors (KNL) compared to the heterogeneous systems containing PCI-E and-in devices. This work will also demonstrate an extremely large-scale benchmark (~2500 KNL computing nodes) that has been recently performed with KISTI-5 (NURION) HPC resource.

As this talk covers the numerical details of the algorithm, it would be also quite instructive to those who consider KNL system to solve large-scale eigenvalue problems.

[Video]

[PDF]

October 11, 2018

Intel Optane Solutions in HPC

Andrey Kudryavtsev
Intel Corporation

This session focuses on the latest Intel Optane technologies and the way it’s used by HPC customers. Attendees will learn about the best usage models and benefits Intel Optane introduces for fast storage or extending system memory.  

[Video]

August 9, 2018 

Machine Learning at Scale 

Deborah Bard and Karthik Kashinath
NERSC

Deep Learning has revolutionized the fields of computer vision, speech recognition, robotics and control systems. At NERSC, we have applied deep learning to problems in cosmology and climate science, focusing on areas that require supercomputing resources to solve real scientific challenges. In cosmology, we use deep learning to identify the underlying physical model that produced the matter distribution in the universe, and develop a deep learning-based emulator for cosmological observables that can reduce the need for computationally expensive simulations. In addition, we use feature introspection to examine the physical structures identified by the network as distinguishing between cosmological models. 

 

In climate, we apply deep learning to detect and localize extreme weather events such as tropical cyclones, atmospheric rivers and weather fronts in large-scale simulated and observed datasets. We will also discuss the challenges involved in scaling deep learning frameworks to supercomputer scale, and how to obtain optimal performance from supercomputing hardware. 

[Video]

 June 14, 2018

Using Roofline Analysis to Analyze, Optimize, & Vectorize Iso3DFD with Intel® Advisor 

Kevin O’Leary
Intel Corporation
This presentation will introduce the use of Intel® Advisor to help you enabling vectorization in your application. We will use the Roofline Model in Intel Advisor to see the impact of our optimizations. We will also demonstrate how Intel Advisor can detect wrong memory access patterns or loop carried dependency in your application. The case study we will use is Iso3DFD. This kernel is propagating a wave in a 3D field using finite difference with a 16th order stencil in an isotropic media.

[Video]

May 10, 2018

High Productivity Languages

Rollin Thomas
NERSC

Sergey Maidanov
Intel Corporation

This talk will cover challenges of numerical analysis and simulations at scale. The tools such as Python which are often used for prototyping are not designed to scale to large problems. As a result organizations have to have a dedicated team that takes a prototype created by research scientists and deploy it in the production environment.

The new approach is required for addressing both scalability and productivity aspects of applied science that combines two distinct worlds, the best of HPC world and the best of database worlds.

 

Starting with a brief overview of scalability aspects with respect to modern hardware architecture we will characterize what the problem at scale is, its inherit characteristics and how these map onto software design choices. We will also discuss selected experimental/observational science applications making use of Python at the National Energy Research Scientific Computing Center (NERSC), and what NERSC has done in partnership with the Intel Python Team to help application developers improve performance while retaining scientist/developer productivity.

[Slides 1]

[Slides 2]

[Video]

April 12, 2018

Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors

John McCalpin
TACC
Intel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.

[Slides]

[Video

March 8, 2018

Compiler Prefetching on KNL Rakesh Krishaiyer
Intel Corporation
We will cover some of the recent changes in the compiler-based prefetching (for Knights Landing and Skylake) and provide tips on how to tune for performance using compiler prefetching options, pragmas and prefetch intrinsics.

[Slides]

[Video]

February 8, 2018

Threading Building Blocks (TBB) Flow Graph: Expressing and Analyzing Dependencies in Your C++ Application

Pablo Reble
Intel Corporation

Developing for heterogeneous systems is challenging because applications may be composed of many layers of parallelism and employ a diverse set of programming models or libraries. This session focuses on Flow Graph, an extension to the Threading Building Blocks (TBB) interface that can be used as a coordination layer for heterogeneity that retains optimization opportunities and composes with existing models. This extension assists in expressing complex synchronization and communication patterns and in balancing load between CPUs, GPUs, and FPGAs. 

Because a Flow Graph can express complex interactions, we use Intel Advisor’s Flow Graph Analyzer (FGA), which has been released as a Technology Preview in Parallel Studio XE 2018 to visualize interactions in a graph and map the application structure to performance data. Finally, we validate this approach by presenting use cases of applications using Flow Graph.

[Slides]

[Video]

January 11, 2018

 


 

Vectorization of Inclusive/Exclusive Compilier 19.0 Nikolay Panchenko
Intel Corporation

We propose a new OpenMP syntax to support inclusive and exclusive scan patterns.  In computer science, this pattern is also known as a prefix or cumulative sum.  The proposal defines several new constructs to support inclusive and exclusive scans through OpenMP, defines semantics for these constructs and possible combination of parallelization and vectorization.  In 18.0 Compiler 3 new OMP SIMD experimental features were added: vectorization of loops with breaks, syntax for compress/expand patterns and syntax for histogram pattern.

[Slides]

 

 


For more information about previous meetings, please refer to the minutes.

Intel has released nGraph library, a compiler and runtime APIs for multiple front-end Deep Learning frameworks, such as TensorFlow, MxNet, PaddlePaddle, and others. nGraph represents framework computational graph as an intermediate representation (IR) which could be executed by multiple backend computational hardware from the edge to the data center, thus significantly improving the productivity of AI data scientists. As in this talk, we will present the details on the bridge that connects TensorFlow to nGraph for a Xeon CPU backend. We will demonstrate state-of-the-art (SOTA) accuracy and convergence for ResNet-50 against ImageNet-1K on multiple Xeon Skylake nodes. Using distributed nGraph, we are able to obtain ~75% Top-1 accuracy for ResNet-50 training on a small number of Xeon Skylake nodes. We will demonstrate convergence and excellent scaling efficiency Skylake nodes connected with Ethernet using nGraph TensorFlow with open source code Horovod.