Resources

We have collected presentations from IXPUG workshops, annual meetings, and BOF sessions, and made them accessible here to view or download. You may search by event, keyword, science domain or author’s name. The database will be updated as new talks are made available.

  • CategoriesClear All
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image

Search ResultShowing 1 - 10 of 490 Results

IXPUG Webinar Series Mar 15, 2021

The trifecta of high volumes of data, abundant compute availability on cloud and on-premise, and rapid algorithmic innovations enable data scientists and AI researchers to do fast experiments, prototyping, and model development at an accelerated pace that was never possible before. In this talk, we will touch upon a variety of software packages, libraries, and tools that can also help HPC practitioners push the envelope of applying AI in their application domains and simulations at-scale. We will cover examples and talk about how to create efficient end-to-end AI pipelines with large data sets in-memory, security, and other features through Intel-optimized software packages such as Intel® Distribution of Python, Intel® Optimized Modin, Intel® Optimized Sklearn, and XGBoost, as well as DL Frameworks such as Intel® Optimized Tensorflow and Intel® Optimized PyTorch tuned and enabled with new hardware features and instructions every new CPU generation.

Keyword(s): oneAPI,Modin,Intel® AI Analytics Toolkit,Intel® Distribution of Modin,Scikit-learn,XGBoost,Machine Learning,Census,PLAsTiCC,SigOpt

Author(s): Meena Arunachalam, Vrushabh Sanghavi
Video(s): Performance Optimizations for End-to-End AI Pipelines
Read more | |
IXPUG Webinar Series Mar 08, 2021

In this webinar we will demonstrate how an existing CUDA stencil application code can be migrated to DPC++ with the help of the Compatibility Tool. We will highlight and discuss the crucial differences between the two programming environments in the context of migrating the tsunami simulation easyWave. The discussion also includes steps for making the code to compliant with the SYCL standard. During the talk, we will also show that the migrated code can run on a wide range of platforms starting from CPUs, over GPUs, to FPGAs.

Keyword(s): oneAPI,SYCL,easyWave,heterogeneous architectures,Data Parallel C++,stencil kernels,Compatibility Tool,Unified Shared Memory,DPC++

Author(s): Marius Knaust, Steffen Christgau
Video(s): Migrating from CUDA-only to Multi-Platform DPC++
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

HPC industry is undergoing a seismic shift and growth due to global Exascale initiatives, emergence of AI and accelerated migration of workloads to the Cloud. At the same time, increasing demands for high-performance data analytics and computational workloads have resulted in expanding ecosystems of diverse general purpose processors and accelerator technologies. In this talk, we discuss how Intel is addressing the needs of the HPC community with a comprehensive portfolio of products and technologies that are built on top of an open, scalable and standards-based ecosystem in order for the community to advance HPC together.

Keyword(s): Exascale,Cloud,XPU,Heterogeneous Acceleration,oneAPI,DevCloud

Author(s): John K. Lee
Video(s): Keynote Address: Advancing HPC Together
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

Taisuke Boku (Workshop co-chair, University of Tsukuba)

Keyword(s): Optimization Targets,Heterogeneous Architectures

Author(s): Taisuke Boku
Video(s): Opening Remarks: IXPUG Workshop at HPC Asia 2021
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-s to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-s is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-s applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.

Keyword(s): Multigrid Methods,Geometrical Multigrid,Algebraic Multigrid,Oakforest-PACs,Weak Scaling,Flat MPI

Author(s): Kenjo Nakajima, Balazs Gerofi, Yutaka Ishikawa, Masashi Horikoshi
Video(s): Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

Performance monitoring is an important component of code optimization. Performance monitoring is also important for the beginning user, but can be difficult to configure appropriately. The overhead of the performance monitoring tools Craypat, FPMP, mpiP, Scalasca and TAU, are measured using default configurations likely to be chosen by a novice user and shown to be small when profiling Fast Fourier Transform based solvers for the Klein Gordon equation based on 2decomp&FFT and on FFTE. Performance measurements help explain that despite FFTE having a more efficient parallel algorithm, it is not always faster than 2decom&FFT because the complied single core FFT is not as fast as that in FFTW which is used in 2decomp&FFT.

Keyword(s): Fast Fourier Transform solver,Klein Gordon equation,Performance Profiling,TAU,IPM,Cryapat,FPMI,mpiP,Sclalsca

Author(s): Brian Leu, Samar Aseeri, Benson Muite
Video(s): A Comparison of Parallel Profiling Tools for Programs Utilizing the FFT
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem ?v = ??v whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents R(?) with different shifts ?. For a given vector x, an action of a resolvent y := R(?)x is given by solving a system of linear equations ?(?)y = ?x for y, here the coefficient ?(?) =???? is symmetric. We assume to solve this system of linear equations by matrix factorization of ?(?), for example by the modified Cholesky method (???^? decomposition method). When both matrices ? and ? are banded, ?(?) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of ?-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of ?-orthonormalization and filtering to a set of initial random vectors.

Keyword(s): Single-Precision Calculation,Eigenproblem,Single Resolvent,Computation,Storage,GEVP

Author(s): Hiroshi Murakami
Video(s):
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

The demands of high-performance data analytics and computational workloads have created demand for diverse integrated and attached accelerator technologies. While this unlocks potential of greater energy efficiency and improved time to results, architectural diversity can create both economic and technical challenges for application and framework developers. In this talk, we discuss the opportunity to create open, scalable and standardized interfaces to resolve these problems through the oneAPI initiative.

Keyword(s): oneAPI,Accelerated Computing,Integrated Accelerators,Attached Accelerators,oneAPI Industry Initiative,SYCL,Data Parallel C++,EasyWave Application,GROMACS,DevCloud

Author(s): Joe Curley
Video(s): Invited Talk: oneAPI Industry Initiative for Accelerated Computing
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with up to 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.

Keyword(s): MLPerf,Training & Inference,TensorFlow,BFloat16,ResNet50,oneDNN,Horovod

Author(s): Wei Wang, Niranjan Hasabnis
Video(s): Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow
Read more | |
IXPUG Workshop at HPC Asia 2021 Feb 08, 2021

The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K×1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ?20 compared to the unoptimized case.

Keyword(s): Non-Equilibrium Green's Function (NEGF),Recursive Green's Function (RGF),MPI,OpenMP,Blocked Matrix Multiplication,Thread-scheduling

Author(s): Hoon Ryu, Yosang Jeong
Video(s):
Read more | |