Resources

We have collected presentations from IXPUG workshops, annual meetings, and BOF sessions, and made them accessible here to view or download. You may search by event, keyword, science domain or author’s name. The database will be updated as new talks are made available.

  • CategoriesClear All
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image
    • Toggle ImageToggle Image

Search ResultShowing 1 - 10 of 361 Results

IXPUG Webinar Series May 13, 2019

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this talk, we introduce a framework, FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.

Keyword(s): Data Parallelism,Inference,FPGA,FPDeep,Convolutional Neural Networks (CNN),Convolutional Neural Networks (CNN) Training,Hybrid Model/Layer Parallelism,Workload Partitioning

Author(s): Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rushi Patel, Martin Herbordt
Video(s): Deeply-Pipelined FPGA Clusters Make DNN Training Scalable
Read more | |
IXPUG Webinar Series Apr 12, 2019

Image and data-parallel rendering across multiple nodes on HPC system is widely used in visualization to provide higher framerates, support large datasets, and render data in situ, Specifically for in situ, reducing bottlenecks incurred by the visualization and compositing tasks is of key concern to reduce the overall simulation run time, while for general interactive visualization improving rendering performance, and thus interactivity, is always desirable. In this talk, Will Usher will present our work on an asynchronous image processing and compositing framework for multi-node rendering in OSPRay, dubbed the Distributed FrameBuffer. We demonstrate that this approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distribution. By building on this framework, we have extended OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Will Usher will cover our approach to developing this framework, performance considerations, and use cases and examples of the new data-distributed API in OSPRay.

Keyword(s): Xeon Phi,Xeon,Data Parallel,Parallel Rendering,FrameBuffer,In Situ Visualization

Author(s): Ingo Wald, Will Usher, Jefferson Amstutz, Johannes Gunther, Carson Brownlee, Valerio Pascucci
Video(s): Webinar recording (begins at 2:50)
Read more | |
IXPUG Webinar Series Apr 11, 2019

This talk will help you learn how Lanczos iterative algorithm can be extended with a parallel computing to solve highly degenerated systems. The talk will address the performance benefits of the core numerical operations in Lanczos iteration, which can be driven with manycore processors (KNL) compared to the heterogeneous systems containing PCI-E and-in devices. This work will also demonstrate an extremely large-scale benchmark (~2500 KNL computing nodes) that has been recently performed with KISTI-5 (NURION) HPC resource. As this talk covers the numerical details of the algorithm, it would be also quite instructive to those who consider KNL system to solve large-scale eigenvalue problems.

Keyword(s): Xeon Phi,MPI,OpenMP

Author(s): Hoon Ryu KISTI
Video(s): Webinar recording
Read more | |
IXPUG Webinar Series Apr 11, 2019

This talk will present benchmark data for IMDT, which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPoint based Intel SSD called Optane. IMDT performance was studied using synthetic benchmarks, scientific kernels and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only system. The performance was measured as percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.

Keyword(s): 3DXPoint,Optane,MKL

Author(s): Vladimir Mironov
Video(s): Webinar recording (begins at 5:35)
Read more | |
IXPUG Webinar Series Apr 11, 2019

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this talk, we study the cross-element vectorization in the finite framework Firedrake and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent Intel CPUs using three mainstream compilers. Our experiments show that cross-element vectorization achieves 30% of theoretical peak performance for many examples of practical significance, and exceeds 50% for cases with high arithmetic intensities, with consistent speed-up over vectorization restricted to the local assembly kernels.

Keyword(s): Vectorization,algorithms,OpenMP,Xeon

Author(s): Tianjiao Sun, Lawrence Mitchell, David A. Ham, Paul H. J. Kelly, Kaushik Kulkami, Andreas Kloeckner
Video(s): Webinar recording
Read more | |
IXPUG Annual Fall Conference 2018 Dec 27, 2018

Here an optimization strategy based on code modernization concept is proposed and applies to the global MASNUM surface wave model, which has been used in several operational forecasting systems and earth system models.

Keyword(s): masnum,wave model

Author(s): Zhenya Song
Video(s): Optimization strategy for MASNUM surface wave model
Read more | |
IXPUG Annual Fall Conference 2018 Dec 27, 2018

We present a complementary physics based, unsupervised approach that exploits the causal nature of spatiotemporal data sets generated by local dynamics (e.g. hydrodynamic flows). We illustrate how novel patterns and coherent structures can be discovered in cellular automata and outline the path from them to climate data.

Keyword(s): unsupervised learning,parallel programming

Author(s): Adam Rupe, Karthik Kashinath, James Crutchfield, Ryan James, Prabhat
Video(s): Project DisCo: Physics-based discovery of coherent structures in spatiotemporal systems
Read more | |
IXPUG Annual Fall Conference 2018 Dec 27, 2018

Our work proposes several optimization techniques to improve the performance of a wave propagation model provided by Petrobras, a multinational corporation in the petroleum industry.

Keyword(s): performance optimization,oil & gas

Author(s): Eduardo Cruz, Philippe Navaux
Video(s): Improving Oil and Gas Extraction Simulation Performance using Intel® Xeon® and Xeon Phi™ Architectures
Read more | |
IXPUG Annual Fall Conference 2018 Dec 27, 2018

The Energy Exascale Earth System Model (E3SM) is one of the top users of resources at NERSC, of which the Model for Prediction Across Scales - Ocean Core (MPAS-O) is a significant component, composed of 800,000 lines of Fortran and work by 50 contributors. When MPAS-O is migrated from the previous generation NERSC production system, Edison which hosts Ivy Bridge processors to the newer Knights Landing based Cori system, severe performance loss and scaling bottlenecks result. Performance analysis was used to reject a number of possible causes of this effect including load imbalance, cache behavior, and vectorization efficiency. It was found that a lower bound on the number of simulation cells mapped to an MPI rank combined with MPAS framework overhead caused by serialized thread structure is the overwhelming contributor to MPAS performance loss on Xeon Phi systems. Two framework optimizations which remove excessive thread barriers and recycle communications data structures have been incorporated into the E3SM master codebase for a 15% speed improvement when running MPAS-O at production scale on Xeon Phi processors.

Keyword(s): MPI,Climate and weather

Author(s): William Arndt
Video(s): Optimization of the Model for Prediction Across Scales: Ocean Core targeting Production Scale Use of Knights Landing Processor Architecture
Read more | |
IXPUG Annual Fall Conference 2018 Dec 27, 2018

This work significantly improves the OpenMP threading performance of Quantum ESRESSO (QE) on Xeon and Xeon Phi processors.

Keyword(s): OpenMP,Density functional theory,3D FFT

Author(s): Ye Luo
Video(s): Improved threading performance of Quantum ESPRESSO
Read more | |