IXPUG HPC Asia 2021

Conference Dates: January 20-22, 2021

Workshop Date: January 22, 2021 (Day-3; Korean Standard Time, GMT +9)

Location: HPC Asia 2021 (International Conference on HPC in Asia-Pacific Region), Jeju, South Korea—Online Conference

Event Description:

The Intel eXtreme Performance Users Group (IXPUG) is an active community-led forum for sharing industry best practices, techniques, tools, etc. for maximizing efficiency on Intel platforms and products. IXPUG Workshop at HPC Asia 2021 is an open workshop on high-performance computing applications, systems, and architecture with Intel technologies. This is a half-day workshop with invited talks and contributed papers. The workshop aims to bring together software developers and technology experts to share challenges, experiences and best-practice methods for the optimization of HPC, Machine Learning, and Data Analytics workloads on Intel® Xeon® Scalable processors, Intel® Xeon Phi™ processors, Intel® FPGA, and any related hardware/software platforms. The workshop will cover application performance and scalability challenges at all levels - from intra-node performance up to large-scale compute systems. Any research aspect related to Intel HPC products is welcome to be presented in this workshop.

Workshop Agenda

All times are shown in KST (Korean Standard Time, GMT+9)

Time	Title	Author(s)	Presentation
9:00	Opening Remarks	Taisuke Boku (Workshop co-chair, University of Tsukuba)	Presentation Recording
9:10	Keynote Address: Advancing HPC Together HPC industry is undergoing a seismic shift and growth due to global Exascale initiatives, emergence of AI and accelerated migration of workloads to the Cloud. At the same time, increasing demands for high-performance data analytics and computational workloads have resulted in expanding ecosystems of diverse general purpose processors and accelerator technologies. In this talk, we discuss how Intel is addressing the needs of the HPC community with a comprehensive portfolio of products and technologies that are built on top of an open, scalable and standards-based ecosystem in order for the community to advance HPC together.	John K. Lee (Intel Corporation)	Presentation Recording
9:55	High Performance Simulations of Quantum Transport using Manycore Computing The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K×1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ∼20 compared to the unoptimized case.	Yosang Jeong, Hoon Ryu (KISTI)	Presentation Recording
10:25	Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with up to 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.	Wei Wang, Niranjan Hasabnis (Intel Corporation)	Presentation Recording
10:55	BREAK (20 min.)
11:15	Invited Talk: oneAPI Industry Initiative for Accelerated Computing The demands of high-performance data analytics and computational workloads have created demand for diverse integrated and attached accelerator technologies. While this unlocks potential of greater energy efficiency and improved time to results, architectural diversity can create both economic and technical challenges for application and framework developers. In this talk, we discuss the opportunity to create open, scalable and standardized interfaces to resolve these problems through the oneAPI initiative.	Joe Curley (Intel Corporation)	Presentation Recording
11:50	Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem 𝐴v = 𝜆𝐵v whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents R(𝜌) with different shifts 𝜌. For a given vector x, an action of a resolvent y := R(𝜌)x is given by solving a system of linear equations 𝐶(𝜌)y = 𝐵x for y, here the coefficient 𝐶(𝜌) =𝐴−𝜌𝐵 is symmetric. We assume to solve this system of linear equations by matrix factorization of 𝐶(𝜌), for example by the modified Cholesky method (𝐿𝐷𝐿^𝑇 decomposition method). When both matrices 𝐴 and 𝐵 are banded, 𝐶(𝜌) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of 𝐵-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of 𝐵-orthonormalization and filtering to a set of initial random vectors.	Hiroshi Murakami (Tokyo Metropolitan University)	Presentation Recording
12:20	A Comparison of Parallel Profiling Tools for Programs Utilizing the FFT Performance monitoring is an important component of code optimization. Performance monitoring is also important for the beginning user, but can be difficult to configure appropriately. The overhead of the performance monitoring tools Craypat, FPMP, mpiP, Scalasca and TAU, are measured using default configurations likely to be chosen by a novice user and shown to be small when profiling Fast Fourier Transform based solvers for the Klein Gordon equation based on 2decomp&FFT and on FFTE. Performance measurements help explain that despite FFTE having a more efficient parallel algorithm, it is not always faster than 2decom&FFT because the complied single core FFT is not as fast as that in FFTW which is used in 2decomp&FFT.	Brian Leu (Applied Dynamics International), Samar Aseeri (KAUST), Benson K. Muite (Kichakato Kizito)	Presentation Recording
12:50	Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-s to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-s is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-s applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.	Kengo Nakajima (the University of Tokyo), Balazs Gerofi (RIKEN R-CCS), Yutaka Ishikawa (RIKEN R-CCS), Masashi Horikoshi (Intel Corporation)	Presentation Recording
13:05	Closing Remarks	Toshihiro Hanawa (Workshop co-chair, the University of Tokyo)	Recording

Call for Papers (Closed)

IXPUG is soliciting submissions for technical presentations on innovative work using Intel architecture from users in academia, industry, government/national labs, etc. describing original discoveries, experiences, and methods for obtaining efficient and scalable use of heterogeneous systems. IXPUG welcomes full papers up to 10 pages within 30 minutes presentation and also welcomes short papers up to 4 pages within 15 minutes presentation. Please submit your paper to IXPUG via EasyChair: https://easychair.org/cfp/IXPUGWorkshopatHPCAsia2021. We welcome any topics on Intel architecture, including but not limited to the following topics of interest:

Paper Topics of Interest:

Application porting and performance optimization
Vectorization, memory, communications, thread and process management
Multi-node application experiences
Programming models, algorithms and methods
Software environment and tools
Benchmarking and profiling tools
Visualization development
FPGA applications and system softwares

Paper Format:

Authors are invited to submit technical papers of at most 10 pages (full paper) or 4 pages (short paper) in PDF format including figures, tables, and references. Papers should be formatted in the ACM Proceedings Style which can be obtained at: http://www.acm.org/publications/proceedings-template

Important Dates:

Full paper due: November 6, 2020 (closed)
Notification of acceptance: November 20, 2020
Camera-ready due: November 30, 2020
Workshop: Friday, January 22, 2020

Paper Submission Site: https://easychair.org/cfp/IXPUGWorkshopatHPCAsia2021

Publication:

All accepted papers will be included in ACM Digital Library as a part of the HPC Asia 2021 Workshop Proceedings. Also, IXPUG Workshop at HPC Asia 2021 final presentations will be made accessible to download at https://www.ixpug.org/resources.

Organizing Co-Chairs:

Taisuke Boku (University of Tsukuba)
Toshihiro Hanawa (The University of Tokyo)

Program Committee:

Thomas Steinke (Zuse Intitute Berlin)
R. Glenn Brook (University of Tennessee Knoxville)
Richard Gerber (NERSC/Lawrence Berkeley National Laboratory)
Clay Hughes (Sandia National Laboratory)
David Keyes (King Abdullah University of Science & Technology)
Nalini Kumar (Intel Corporation)
James Lin (Shanghai Jiao Tong University)
David Martin (Argonne National Laboratory)
Alberto Di Meglio (CERN openlab)
Vladimir Mironov (Lomonosov Moscow State University)
Sergi Siso (UK Science & Technology Facilities Council)

Questions? All questions for workshop paper submission and organization should be sent to This email address is being protected from spambots. You need JavaScript enabled to view it.. General questions should be sent to This email address is being protected from spambots. You need JavaScript enabled to view it..