Events

 

isc logo

IXPUG Workshop: Many-core Computing on Intel Processors: Applications, Performance and Best-Practice Solutions

 

Location: Marriott Hotel - Matrix Room (5th floor), Frankfurt, Germany

Date: Thursday, June 28, 2018, 9:00am-6:00pm

Registration: The workshop is held in conjunction with the ISC 2018 in Frankfurt (Main). To attend the IXPUG workshop, you have to register for ISC Workshops. More information is on the ISC 2018 conference website.

Event Description: The workshop will bring together software developers and technology experts to share challenges, experiences and best‐practice methods for the optimization of HPC, Machine Learning (ML) and Data Analytics (DA) workloads on Intel Xeon Scalable Processors and Intel Xeon Phi Processors. The workshop will cover application performance and scalability challenges at all levels – from intra-node performance up to large-scale compute systems.

The keynote will introduce the main features of current-generation Intel processors models for HPC and ML/DA workloads - including the various memory configurations and modes of operation available - and provide a refresher on what’s public about future processor generations.

The submitted talks cover optimization and scalability topics in real-world HPC and ML applications, e.g. data layouts and code restructuring for efficient SIMD operation, utilization of new AVX-512 instructions, work distribution and thread management. Furthermore, aspects related to deeper memory hierarchies (High-Bandwidth Memory, node-local persistent storage) are of particular interest. The usability of tools for development, debugging and performance analysis will be covered.

The panel session provides an opportunity to discuss optimization strategies for Intel many-core processors including Intel Xeon SP series, “Knights Landing” (KNL), and “Knights Mill” (KNM), and to provide feedback to the toolchain developers.

Agenda

Start End Title Speaker* and Authors
09:00 09:15 IXPUG Welcome  
09:15 10:00

KEYNOTE:

HPC's Impact on the Digital Economy Transformation

Mark Seager (Intel)
10:00 10:30 CSB_Coo sparse matrix vector performance on Intel Xeon and Xeon Phi architectures Brandon Cook, Charlene Yang, Thorsten Kurth and Jack Deslippe (LBL)
10:30 11:00 Lessons Learned from Optimizing Kernels for Algebraic Multigrid Solvers in Lattice QCD Balint Joo (Jefferson Lab) and Thorsten Kurth (LBL)
11:00 11:30 Break  
11:30 12:00

INVITED TALK:

Mapping SIMD into Kokkos

Simon Hammond (Sandia)
12:00 12:30 Distributed Training of Generative Adversarial Network Sofia Vallecorsa (CERN), Federico Carminati (CERN), Gulrukh Khattak (CERN), Damian Podareanu (SURFSARA) ‎, Valeriu Codreanu (SURFSARA), Vikram Saletore (Intel) and Hans Pabst (Intel)
12:30 12:45 Deep Learning with Many-Core Processors and BigDL using Scientific Datasets David Ojika (Univ. Florida) and Bhavesh Patel (Dell)
12:45 13:00 Performance optimization for modern many-core architectures using PSYclone embedded-DSL Sergi Siso, Rupert Ford and Andrew Porter (STFC)
13:00 14:00 Lunch  
14:00 15:00

KEYNOTE:

OpenMP API Version 5.0: A Story about Threads, Tasks, and Devices

Michael Klemm (CEO OpenMP)
15:00 15:15

Optimised Data Decomposition for Reduced Communication Costs

Manos Farsarakis and Adrian Jackson (Univ. Edinburgh)
15:15 16:00

Site Updates:

- IPCC Asian activity

- TACC Science Stories

- JSC Site Update

- KNL/OPA based KISTI 5th Supercomputer

- Recent Progress of Big Data Reseach in IPCC China

- SSCC: Siberian supercomputer center for applied scientific computing

 

Taisuke Boku, University of Tsukuba

John Cazes, TACC

Bernd Mohr, Jülich Supercomputing Centre

Oh-Kyoung Kwon, KISTI Korea

Shun Xu, Chinese Academy of Sciences

Igor Chernykh, Siberian Supercomputer Center

16:00 16:30 Break  
16:30 17:00

INVITED TALK:

DM-HEOM: A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion

Matthias Noack (ZIB)
17:00 17:55 Open Discussion: Quo Vadis IXPUG Thomas Steinke (ZIB)
17:55 18:00 Wrap-up  

 

 

Call for PAPERS: The submission process opened on March 7, 2018 and will close on April 15, 2018. April 29, 2018. All submitters should provide an abstract and FULL PAPER, uploaded to the IXPUG EasyChair site.  Notifications will be issued on May 20, 2018.  Please be sure to focus your content on the approach that was taken, obstacles encountered, solutions developed, performance results and next steps. 

 

Topics of interest are (but not limited to): sharing techniques in vectorization, memory, communications, thread and process management, multi-node application experiences, programming models, algorithms and methods, software environment and tools, benchmarking and profiling tools, visualization development, etc.

 

Important Dates:

         Call for Papers: March 14, 2018

         Deadline for submissions:  April 15, 2018  April 29, 2018 AoE  [Deadline extended]

         Final acceptance notification: May 20, 2018

        Conference Ready Paper: June 17, 2018

         Camera ready paper:  July 29, 2018 

Review Process

Reviewers are expected to make judgment on what was available at the time reviews were assigned. Subsequent updates to content may or may not be considered by the program committee as part of the selection decision. We encourage authors to exercise the freedom to use the time up until presentation and camera ready copy to provide the highest-quality product.

All submitted papers will be reviewed. We apply a standard single-blind review process, i.e., the authors will be known to reviewers. All submissions within the scope of the workshop will be peer-reviewed and will need to demonstrate quality of the results, originality and new insights, technical strength, and correctness. The submitted papers may not be published in or be in preparation for other conferences, workshops or journals.

 

 

4:00 Alexander Breuer

Title: Best Practices for the Xeon Phi Coprocessor: Tuning SG++, ls1 mardyn and SeisSol

In this presentation we show recent advances of our Intel Parallel Computing Center at Leibniz Supercomputing Centre and Technische Universität München regarding the Xeon Phi coprocessor. A broad field of applications is covered: High-dimensional problems, molecular dynamics and earthquake simulation. The first part of the talk covers characteristic challenges of SG++ and ls1 mardyn. We present solutions and best practices on the coprocessors to exploit the different levels of parallelism at highest performance. Especially novel algorithms to utilize the advanced SIMD capabilities and Phi-specific vector instructions turn out to be of major importance. In the second part we focus on the end-to-end performance tuning of the SeisSol software package. SeisSol simulates dynamic rupture and seismic wave propagation at petascale performance in production runs. Sustained machine-size performance of more than 2 DP-PFLOPS on Stampede for a Landers 1992 earthquake strong-scaling setting conclude the presentation.


4:15 Alexander Gaenko

Title: Porting GAMESS to Xeon Phi: Advances and Challenges

GAMESS is a freely available, mature, powerful quantum chemistry software package, developed at Ames Laboratory. The GAMESS is parallelized on the process, rather than thread, level, using Generalized Distributed Data Interface (GDDI) library. The GDDI library supports both two-sided (message passing) and one-sided (distributed memory) communication models, and can utilize MPI-1, TCP/IP and/or shared memory underlying transport mechanisms. The support of full-featured OS, TCP/IP stack and shared memory on Intel Xeon Phi makes it a very attractive candidate for running GAMESS in the native mode. The presentation will outline our experience with running quantum many-body methods of GAMESS on Intel Xeon Phi, the use of offload and native modes, the mixed process/tread-based parallelization with and without MPI, the challenges and successes, and the initial performance analysis results


4:30 Milind Girkar

Title: Explicit Vector Programming

As processor designs have faced increasing power constraints, processor architecture has evolved to offer multiple cores and wider vector execution units on a single die to increase performance. Exploiting these innovations in hardware requires software developers to find the parallelism in their program and to express it in high level programming languages. While constructs to express threaded parallelism to utilize the multiple cores have been available for some time, language level constructs for exploiting the vector execution units have only recently become practical. We show how vector execution can be expressed through the recently published OpenMP 4.0 standard and its implementation in Intel compilers on Intel processors supporting SIMD (Intel SSE4.2, Intel AVX, Intel AVX2, Intel AVX-512) instructions.


4:45 John Michalakes

Title: Optimizing Weather Model Physics on Phi

I will discuss objectives, challenges, strategies, and experiences porting and optimizing physics packages from the Weather Research and Forecast (WRF) model and the NOAA Non-hydrostatic Multiscale Model (NMM-B) on Xeon Phi. The current focus intra-node, dealing with improving performance on individual Phi processors, but scaling to large numbers of Phi-enabled nodes with whole codes (not just kernels) is the goal.


5:00 Paul Peltz

Title: Best Practices for Administering a Medium Sized Cluster with Intel Xeon Phi Coprocessors

This work describes the best practices for configuring and managing an Intel Xeon Phi cluster. The Xeon Phi presents a unique environment to the user and preparing this environment requires unique procedures. This work will outline these procedures and provide examples for HPC Administrators to utilize and then customize for their system. Considerable effort has been put forth to help researchers determine how to maximize their performance on the Xeon Phi, but little has been done for the administrators of these systems. Now that the Xeon Phis are being deployed on larger systems, there is a need for information on how to manage and deploy these systems. The information provided here will serve as a supplement to the documentation Intel provides in order to bridge the gap between workstation and cluster deployments. This work is based on the authors experiences deploying and maintaining the Beacon cluster at the University of Tennessee’s Application Acceleration Center of Excellence (AACE).


5:15 James Rosinski

Title: Porting and Optimizing NOAA/ESRL Weather Models on the Intel Xeon Phi Architecture

NOAA/ESRL is developing a numerical weather forecast model (named NIM) designed to run on a variety of architectures, including traditional CPUs as well as fine-grained hardware including Xeon Phi and GPU. One software constraint to this work is the need for a single-source solution for all supported platforms. In this talk we will describe the software development issues specific to porting and optimizing NIM for the Xeon Phi. In addition to performance results, tradeoffs of the symmetric vs. offload approaches to sharing the workload between the Phi and the host will be described. Issues associated with port validation, communication, and load balancing between host and coprocessor will also be discussed.


Lightning Talks


SpeakerTitle/Description
Nambiar, Manoj Performance Optimization of Scientific and Engineering workloads on Xeon/Xeon Phi
Brook, Glenn HPC-BLAST: Scaling the Life Sciences for the Intel Many Integrated Core future
Noack, Matthias Hierarchical Equations of Motion:­ -­What we can learn from OpenCL-
Chow, Edmond Large-Scale Hydrodynamic Brownian Simulations on Intel Xeon Phi
Poulsen, Jacob Weismann Refactoring for Xeon Phi
Luszczek, Piotr MAGMA MIC: HPC Linear Algebra for Intel Xeon Phi

Selected Submissions


NameTitle/Description
Breuer, Alex Accelerated Earthquake Simulations
Deslippe, Jack Lessons Learned From Optimizing Applications on Xeon Phi
Enkovaara, Jussi Python-based software on MIC
Golembiowski, Albert Maximizing parallelization of BLAST: Output Formatting Section (OFS)
Khaldi, Dounia Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi
Léger, Raphaël Adapting a solver for bioelectromagnetics to the DEEP-ER architecture
Lehto, Olli-Pekka Experiences with porting to Xeon Phi by CSC – IT Center for Science Ltd.
Lequiniou, Eric RADIOSS Porting on Xeon Phi A Developer’s Perspective
Rajan, Manesh; Doerfler,
Doug; Hammond, Si; Trott,
Christian;Barrett, Richard
Trinity Benchmarks on Xeon Phi (Knights Corner)
Ramos, Sabela; Hoefler, Torsten Programming for Xeon Phi using Cache Line awareness
Romein, John Accelerated Real-Time Processing of Radio Telescope Data
Wende, Florian Enabling Manual Vectorization of Complex Code Patterns in Fortran

 

IXPUG

IXPUG Webinar Series

IXPUG webinars enable knowledge-sharing and greater collaboration on a range of topics across HPC, XPU architectures, storage, data analytics, artificial intelligence, and visualization development. The live session webinars are free and open to anyone who wishes to join—it's a great way to get involved in the IXPUG community!

The goals of the Webinar Series are:

  1. Direct IXPUG discussions to what is most relevant to the community.
  2. Disseminate results and techniques.
  3. Assist the community with performance debugging/troubleshooting on Intel HPC platforms.
  4. Provide a forum for collaboration between IXPUG members and Intel engineers.
  5. Help the community to prepare for deeper engagement at upcoming IXPUG events.

How to Participate

To receive updates on future webinars, subscribe to the IXPUG newsletter. If you are interested in a specific topic for future webinars and/or would like to share your work with the IXPUG community, This email address is being protected from spambots. You need JavaScript enabled to view it..

 

 

Upcoming Webinars

Date/Time Title Author(s) Description Registration
April 25, 2024 8:00-9:00 a.m. PT

Leveraging LLMs and Differentiable Rendering for Automating Digital Twin Construction

 

Krishna Kumar is an assistant professor of the Fariborz Maseeh Department of Civil, Architectural, and Environmental Engineering and an affiliate member at the Oden Institute of Computational Sciences at UT Austin. Krishna received his PhD in Engineering at the University of Cambridge, UK, in multi-scale and multiphysics modeling. Krishna's work involves developing large-scale multiphysics numerical methods and in situ visualization techniques. His research interest spans physics-based machine learning techniques, such as graph networks and differentiable simulators, to solve inverse and design problems. He leads the NSF-funded AI in Engineering Cyber Infrastructure Ecosystem and leads AI developments in DesignSafe, an NSF-funded Cyber Infrastructure facility for Natural Hazard Engineering.

 

This presentation introduces an innovative approach that combines Large Language Models (LLMs) and differentiable rendering techniques to automate the construction of digital twins. In our approach, we employ LLMs to guide and optimize the placement of objects in digital twin scenarios. This is achieved by integrating LLMs with differentiable rendering, a method traditionally used for optimizing object positions in computer graphics based on image pixel loss. Our technique enhances this process by incorporating a second modality, namely Lidar data, resulting in faster convergence and improved accuracy. This fusion of sensor inputs proves invaluable, especially for applications like autonomous vehicles, where establishing the precise location of multiple actors in a scene is crucial. Our methodology involves several key steps: (1) Generating a point cloud of the scene via ray casting, (2) Extracting lightweight geometry from the point cloud using PlaneSLAM, (3) Creating potential camera paths through the scene, (4) Selecting the most suitable camera path by leveraging the LLM in conjunction with image segmentation and classification, and (5) Rendering the camera flight path from its origin to the final destination.

The technical backbone of this system includes the use of Mitsuba for ray tracing, powered by Intel's Embree ray tracing library. This setup encompasses Lidar simulation, image rendering, and a final differentiable rendering step for precise camera positioning. Future iterations may incorporate Intel OSPRay for enhanced Lidar-like ray casting and image rendering, with a possible integration of Mitsuba for differentiable render camera positioning. The machine learning inference chain utilizes a pre-trained LLM from OpenAI accessed via LangChain, coupled with GroundingDINO for zero-shot image segmentation and classification within PyTorch. This entire workflow is optimized for performance on the latest generation of Intel CPUs.

This presentation will delve into the technical details of this approach, demonstrating its efficacy in automating digital twin construction and its potential applications in various industries, particularly in the realm of autonomous vehicle navigation and scene understanding.

 

[Registration link HERE]

 

 

 


Previous Webinars

Date Title Author(s) Description Presentation
August 10, 2023 Preparing for Exascale on Aurora Dr. Scott Parker is a computation scientist at the Argonne Leadership Computing Facility (ALCF) and a lead for the ALCF Performance Engineering team. His principal focus is on developing and deploying next-generation leadership-scale high-performance computing systems at the ALCF and developing scientific applications that utilize these systems. In addition, he is the lead for the Exascale Computing Projects (ECP) Applications Integration effort, which seeks to enable ECP applications to utilize the new generation of exascale systems. He is also one of the co-organizers of the annual International Workshop on Performance, Portability, and Productivity in HPC. The Aurora exascale system is currently being deployed at Argonne National Lab. The system, utilizing Intel’s new Data Center Max Series GPUs (a.k.a. PVC) and Xeon Max Series CPU with HBM, will provide a uniquely powerful platform for leading-edge HPC, AI, and data-intensive computing applications. Scientists at Argonne National Laboratory, in collaboration with the Exascale Computing Project, Intel, and several other institutions, are preparing several dozen applications and workflows to run at scale on the Aurora system. This talk will present an overview of the Aurora system and highlights from the experience of preparing applications for the system. In addition, promising early performance results on the Aurora hardware will be shown.

[Video]

[PDF]

April 28, 2022 Intel Fortran Compilers: A Tradition of Trusted Application Performance Ron Green is the manager of the Intel Fortran OpenMP and Runtime Library development team. He is a moderator for the Intel Fortran Community Forum and is an Intel Developer Zone “Black Belt”. He has extensive experience as a developer and consultant in HPC for the past 30+ years and has been with Intel’s compiler team for thirteen years. His technical interest area is in parallel application development with a focus on Fortran programming.

The Intel® Fortran Compiler is built on a long history of generating optimized code that supports industry standards while taking advantage of built-in technology for Intel® Xeon® Scalable processors and Intel® Core™ processors. Staying aligned with Intel's evolving and diverse architectures, the compiler now supports GPUs. This presentation will cover the compiler standards and path forward.

There are two versions of this compiler. Both versions integrate seamlessly with popular third-party compilers, development environments, and operating systems.
• Intel Fortran Compiler: provides CPU and GPU offload support
• Intel Fortran Compiler Classic: provides continuity with existing CPU-focused workflows

Features:
• Improves development productivity by targeting CPUs and GPUs through single-source code while permitting custom tuning
• Supports broad Fortran language standards
• Incorporates industry standards support for OpenMP* 4.5, and initial OpenMP 5.0 and 5.1 for GPU offload
• Uses well-proven LLVM compiler technology and Intel's history of compiler leadership
• Takes advantage of multicore, Single Instruction Multiple Data (SIMD) vectorization and multiprocessor systems with OpenMP, automatic parallelism, and coarrays

[Video]

[PDF]

March 10, 2022 DAOS: Storage Innovations Driven by Intel® Optane™ Persistent Memory Zhen Liang is a technical architect involved in the architecture, design, and implementation of distributed storage system. He has been in the storage software industry since 2004 and has significant experience and expertise in filesystem, network, high performance computing, and distributed storage system architecture. He is currently the technical architect of Distributed Asynchronous Object Storage (DAOS). DAOS is an open-source software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), it is the foundation of the Intel exascale storage stack. This presentation will provide a technical overview of Distributed Asynchronous Object Store (DAOS), a software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), including Intel® Optane™ DC persistent memory and Intel Optane DC SSDs. This presentation will also introduce the performance and explain main features of DAOS.

[Video]

[PDF]

December 9, 2021

Multi-GPU Programming—Scale-Up and Scale-Out Made Easy, Using the Intel® MPI Library

Anatoliy Rozanov is Intel MPI Lead Developer responsible for Intel GPU enabling and Intel MPI process management/deployment infrastructure at Intel.

Dmitry Durnov is Intel MPI and oneCCL Products Architect at Intel.

Michael Steyer is a HPC Technical Consulting Engineer, supporting technical and high performance computing segments within the Software and Advanced Technology Group at Intel.

For shared memory programming of GPGPU systems, users either have to manually run their domain decomposition along available GPUs as well as GPU Tiles. Or leverage implicit scaling mechanisms that transparently scale their offload code across multiple GPU-Tiles. The former approach can be cumbersome, and the latter approach is not always the best performing one. The Intel MPI library can take that burden from users by enabling the user to program only for a single GPU / Tile and leave the distribution to the library. This can make HPC / GPU programming much easier. Therefore, Intel® MPI does not just allow to pin individual MPI ranks to individual GPUs or Tiles, but also allows users to pass GPU memory pointers to the library.

[Video]

[PDF]

August 12, 2021 IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream? Dr. Shantenu Jha, Chair of Computation & Data Driven Discovery Department at Brookhaven National Laboratory and Professor of Computer Engineering at Rutgers University

The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silico methodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple methodological and supporting infrastructural innovations at scale. Specifically, how we used TACC’s Frontera on > 8000 compute nodes to sustain 144M/hour docking hits, and to screen ∼100 Billion drug candidates. These capabilities have been used by the US-DOE National Virtual Biotechnology Laboratory, and represent important progress towards improvement of computational drug discovery, both in terms of size of libraries screened, but also the possibility of generating training data fast enough for very powerful (docking) surrogate models.

[Video]

[PDF]

April 22, 2021 Visual Analysis on TACC Frontera using the Intel oneAPI Rendering Toolkit Dr. Paul A. Navrátil, Research Scientist and Director of Visualization, Texas Advanced Computing Center (TACC) at the University of Texas at Austin TACC Frontera handles the largest simulation runs for open-science researchers supported by the National Science Foundation. Due to the data sizes involved, the scientific analysis is most easily performed on Frontera itself, often done “in situ” without writing the full data to disk. This talk will present recent work on Frontera that uses the Intel oneAPI Rendering Toolkit to perform both batch and interactive visual analysis across a range of scientific domains.

[Video]

[PDF]

March 11, 2021 Performance Optimizations for End-to-End AI Pipelines Meena Arunachalam and Vrushabh Sanghavi, Intel Corporation The trifecta of high volumes of data, abundant compute availability on cloud and on-premise, and rapid algorithmic innovations enable data scientists and AI researchers to do fast experiments, prototyping, and model development at an accelerated pace that was never possible before. In this talk, we will touch upon a variety of software packages, libraries, and tools that can also help HPC practitioners push the envelope of applying AI in their application domains and simulations at-scale. We will cover examples and talk about how to create efficient end-to-end AI pipelines with large data sets in-memory, security, and other features through Intel-optimized software packages such as Intel® Distribution of Python, Intel® Optimized Modin, Intel® Optimized Sklearn, and XGBoost, as well as DL Frameworks such as Intel® Optimized Tensorflow and Intel® Optimized PyTorch tuned and enabled with new hardware features and instructions every new CPU generation.

[Video]

[PDF]

February 18, 2021 Migrating from CUDA-only to Multi-Platform DPC++

Steffen Christgau, Zuse Institute Berlin (ZIB)

Marius Knaust (ZIB) will join to answer FPGA-related questions from the audience.

In this webinar we will demonstrate how an existing CUDA stencil application code can be migrated to DPC++ with the help of the Compatibility Tool. We will highlight and discuss the crucial differences between the two programming environments in the context of migrating the tsunami simulation easyWave. The discussion also includes steps for making the code to compliant with the SYCL standard. During the talk, we will also show that the migrated code can run on a wide range of platforms starting from CPUs, over GPUs, to FPGAs.

[Video]

[PDF]

 

July 1, 2020

 

Migrating Your Existing CUDA Code to DPC++

 

Edward Mascarenhas and Sunny Gogar, Intel Corporation

Best practices for using a one-time migration tool that migrates CUDA applications into standards-based Data Parallel C++ (DPC++) code. Topics include:

• An overview of the DPC++ language, including why it was created and how it benefits developers
• An overview of the Intel DPC++ Compatibility Tool itself—what it is and what it does
• Real-world examples of the code-migration concept, including the process and expectations
• A demonstration of the steps involved to migrate CUDA code to DPC++ code, including what a complete migration looks like and best practices to follow

[Video]

February 20, 2020 Performance Optimization of Intel® oneAPI Applications

Kevin O’Leary
Intel Corporation

Modern workloads are incredibly diverse—and so are architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial (SVMS) architectures deployed in CPU, GPU, FPGA, and other future accelerators. Intel® oneAPI products will deliver the tools needed to deploy applications and solutions across SVMS architectures. This webinar will focus on the oneAPI features that focus on performance optimization, including the analysis tools:

  • Intel® VTune™ Profiler(Beta) to find performance bottlenecks fast in CPU, GPU, and FPGA systems
  • Intel® Advisor(Beta) for vectorization, threading, and accelerator offload design advice
  • Part of the webinar will start with an application that is currently running on a CPU, and we will use the oneAPI tools to port and optimize on our GPU.

The Intel® oneAPI set of complementary toolkits—a base kit and specialty add-ons—simplify programming and help improve efficiency and innovation. Use it for: high performance computing, machine learning and analytics, IoT applications, video processing, rendering, etc. This webinar will include extra time for Q&A.

Presenter: Kevin O’Leary is a senior technical consulting engineer in Intel’s software tools group. Kevin was one of the original developers of Intel® Parallel Studio. Before coming to Intel, he spent several years on the IBM Rational Apex debugger development team.

[Video]

[PDF]

November 14, 2019 The DREAM Framework and Binning Directories –or– Can We Analyze ALL Genomic Sequences on Earth?

Knut Reinert
Freie Universität Berlin

The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents.

As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease.

The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e., exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e., a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time.

In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel® Optane™ DC persistent memory architecture and achieves significant speedups compared to a disk based solution.

[Video]
October 10, 2019 Optimize for Both Memory and Compute on Modern Hardware Using Roofline Model Automation in Intel® Advisor

Zakhar Matveev and Cédric Andreolli
Intel Corporation

Software must be optimized for both Compute (including SIMD vector parallelism) and effective memory sub-system utilization to achieve scaled performance on modern hardware.
In this talk we present state-of-the-art Intel Advisor Roofline performance model automation which helps to identify memory bottlenecks and balance between CPU and memory utilization. The talk will not only cover “cache-aware” Roofline implementation, but also new capabilities to produce DRAM (“original”) and multi-level (L1, L2, LLC, MCDRAM and DRAM – all de-coupled) Roofline model flavors in order to guide DRAM- or cache-bound applications optimization.

[Video]

[PDF - Matveev]

[PDF - Andreolli]

September 12, 2019

Accelerate Your Inferencing with Intel® Deep Learning Boost

Shailen Sobhee
Intel Corporation

Learn about Intel® Deep Learning Boost (Intel® DL Boost), and its Vector Neural Network Instructions (VNNI). These are a new set of Intel® Advanced Vector Extension 512 (Intel® AVX-512) instructions that are designed to deliver significantly more efficient Deep Learning inference acceleration. We will show a live demo of them in action and quickly show you how you can get started with Intel® DL Boost today. 

[Video]
[PDF]

 
July 11, 2019 Scaling Distributed TensorFlow Training with Intel’s nGraph Library on Xeon® Processor Based HPC Infrastructure
Jianying Lang
Intel Corporation
Intel has released nGraph library, a compiler and runtime APIs for multiple front-end Deep Learning frameworks, such as TensorFlow, MxNet, PaddlePaddle, and others. nGraph represents framework computational graph as an intermediate representation (IR) which could be executed by multiple backend computational hardware from the edge to the data center, thus significantly improving the productivity of AI data scientists. As in this talk, we will present the details on the bridge that connects TensorFlow to nGraph for a Xeon CPU backend. We will demonstrate state-of-the-art (SOTA) accuracy and convergence for ResNet-50 against ImageNet-1K on multiple Xeon Skylake nodes. Using distributed nGraph, we are able to obtain ~75% Top-1 accuracy for ResNet-50 training on a small number of Xeon Skylake nodes. We will demonstrate convergence and excellent scaling efficiency Skylake nodes connected with Ethernet using nGraph TensorFlow with open source code Horovod.

[Video]

[PDF]

May 09, 2019 Deeply-Pipelined FPGA Clusters Make DNN Training Scalable
Tong Geng
Boston University

Tianqi Wang
University of Science and Technology of China

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this talk, we introduce a framework, FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.

[Video]

[PDF]

April 11, 2019 A Study of SIMD Vectorization for Matrix-Free Finite Element Method

Tianjiao Sun  
Imperial College London, UK

Lawrence Mitchell 
Durham University, UK

David A. Ham 
Imperial College London, UK

Paul H. J. Kelly  
Imperial College London, UK

Kaushik Kulkami  
University of Illinois at Urbana-Champaign, USA

Andreas Kloeckner
University of Illinois at Urbana-Champaign, USA

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this talk, we study the cross-element vectorization in the finite framework Firedrake and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent Intel CPUs using three mainstream compilers. Our experiments show that cross-element vectorization achieves 30% of theoretical peak performance for many examples of practical significance, and exceeds 50% for cases with high arithmetic intensities, with consistent speed-up over vectorization restricted to the local assembly kernels.

[Video]

[PDF]

March 14, 2019 Scalable and Flexible Distributed Rendering with OSPRay's Distributed API and FrameBuffer

Will Usher 
Scientific Computing and Imaging Institute, University of Utah

Ingo Wald 
Formerly Intel, now NVIDIA

Jefferson Amstutz 
Intel Corporation

Johannes Günther 
Intel Corporation

Carson Brownlee 
Intel Corporation

Valerio Pascucci 
Scientific Computing and Imaging Institute, University of Utah

Image and data-parallel rendering across multiple nodes on HPC system is widely used in visualization to provide higher framerates, support large datasets, and render data in situ, Specifically for in situ, reducing bottlenecks incurred by the visualization and compositing tasks is of key concern to reduce the overall simulation run time, while for general interactive visualization improving rendering performance, and thus interactivity, is always desirable. In this talk, Will Usher will present our work on an asynchronous image processing and compositing framework for multi-node rendering in OSPRay, dubbed the Distributed FrameBuffer. We demonstrate that this approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distribution. By building on this framework, we have extended OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Will Usher will cover our approach to developing this framework, performance considerations, and use cases and examples of the new data-distributed API in OSPRay.

[Video] Recording begins at 2:50.

[PDF]

February 14, 2019 Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Vladimir Mironov   Lomonosov Moscow State University

Yuri Alexeev Argonne National Laboratory

Alexander Moskovsky    RSC Technologies

Andrey Kudryavtsev      Intel Corporation

This talk will present benchmark data for IMDT, which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPoint based Intel SSD called Optane. IMDT performance was studied using synthetic benchmarks, scientific kernels and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only system. The performance was measured as percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.

[Video] Recording begins at 5:35.

[PDF]

January 10, 2019 Massively Scalable Computing Method for Handling Large Eigenvalue Problem for Nanoelectronics Modeling Hoon Ryu
Korea Institute of Science and Technology Information (KISTI)

This talk will help you learn how Lanczos iterative algorithm can be extended with a parallel computing to solve highly degenerated systems. The talk will address the performance benefits of the core numerical operations in Lanczos iteration, which can be driven with manycore processors (KNL) compared to the heterogeneous systems containing PCI-E and-in devices. This work will also demonstrate an extremely large-scale benchmark (~2500 KNL computing nodes) that has been recently performed with KISTI-5 (NURION) HPC resource.

As this talk covers the numerical details of the algorithm, it would be also quite instructive to those who consider KNL system to solve large-scale eigenvalue problems.

[Video]

[PDF]

October 11, 2018

Intel Optane Solutions in HPC

Andrey Kudryavtsev
Intel Corporation

This session focuses on the latest Intel Optane technologies and the way it’s used by HPC customers. Attendees will learn about the best usage models and benefits Intel Optane introduces for fast storage or extending system memory.  

[Video]

August 9, 2018 

Machine Learning at Scale 

Deborah Bard and Karthik Kashinath
NERSC

Deep Learning has revolutionized the fields of computer vision, speech recognition, robotics and control systems. At NERSC, we have applied deep learning to problems in cosmology and climate science, focusing on areas that require supercomputing resources to solve real scientific challenges. In cosmology, we use deep learning to identify the underlying physical model that produced the matter distribution in the universe, and develop a deep learning-based emulator for cosmological observables that can reduce the need for computationally expensive simulations. In addition, we use feature introspection to examine the physical structures identified by the network as distinguishing between cosmological models. 

 

In climate, we apply deep learning to detect and localize extreme weather events such as tropical cyclones, atmospheric rivers and weather fronts in large-scale simulated and observed datasets. We will also discuss the challenges involved in scaling deep learning frameworks to supercomputer scale, and how to obtain optimal performance from supercomputing hardware. 

[Video]

 June 14, 2018

Using Roofline Analysis to Analyze, Optimize, & Vectorize Iso3DFD with Intel® Advisor 

Kevin O’Leary
Intel Corporation
This presentation will introduce the use of Intel® Advisor to help you enabling vectorization in your application. We will use the Roofline Model in Intel Advisor to see the impact of our optimizations. We will also demonstrate how Intel Advisor can detect wrong memory access patterns or loop carried dependency in your application. The case study we will use is Iso3DFD. This kernel is propagating a wave in a 3D field using finite difference with a 16th order stencil in an isotropic media.

[Video]

May 10, 2018

High Productivity Languages

Rollin Thomas
NERSC

Sergey Maidanov
Intel Corporation

This talk will cover challenges of numerical analysis and simulations at scale. The tools such as Python which are often used for prototyping are not designed to scale to large problems. As a result organizations have to have a dedicated team that takes a prototype created by research scientists and deploy it in the production environment.

The new approach is required for addressing both scalability and productivity aspects of applied science that combines two distinct worlds, the best of HPC world and the best of database worlds.

 

Starting with a brief overview of scalability aspects with respect to modern hardware architecture we will characterize what the problem at scale is, its inherit characteristics and how these map onto software design choices. We will also discuss selected experimental/observational science applications making use of Python at the National Energy Research Scientific Computing Center (NERSC), and what NERSC has done in partnership with the Intel Python Team to help application developers improve performance while retaining scientist/developer productivity.

[Slides 1]

[Slides 2]

[Video]

April 12, 2018

Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors

John McCalpin
TACC
Intel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.

[Slides]

[Video

March 8, 2018

Compiler Prefetching on KNL Rakesh Krishaiyer
Intel Corporation
We will cover some of the recent changes in the compiler-based prefetching (for Knights Landing and Skylake) and provide tips on how to tune for performance using compiler prefetching options, pragmas and prefetch intrinsics.

[Slides]

[Video]

February 8, 2018

Threading Building Blocks (TBB) Flow Graph: Expressing and Analyzing Dependencies in Your C++ Application

Pablo Reble
Intel Corporation

Developing for heterogeneous systems is challenging because applications may be composed of many layers of parallelism and employ a diverse set of programming models or libraries. This session focuses on Flow Graph, an extension to the Threading Building Blocks (TBB) interface that can be used as a coordination layer for heterogeneity that retains optimization opportunities and composes with existing models. This extension assists in expressing complex synchronization and communication patterns and in balancing load between CPUs, GPUs, and FPGAs. 

Because a Flow Graph can express complex interactions, we use Intel Advisor’s Flow Graph Analyzer (FGA), which has been released as a Technology Preview in Parallel Studio XE 2018 to visualize interactions in a graph and map the application structure to performance data. Finally, we validate this approach by presenting use cases of applications using Flow Graph.

[Slides]

[Video]

January 11, 2018

 


 

Vectorization of Inclusive/Exclusive Compilier 19.0 Nikolay Panchenko
Intel Corporation

We propose a new OpenMP syntax to support inclusive and exclusive scan patterns.  In computer science, this pattern is also known as a prefix or cumulative sum.  The proposal defines several new constructs to support inclusive and exclusive scans through OpenMP, defines semantics for these constructs and possible combination of parallelization and vectorization.  In 18.0 Compiler 3 new OMP SIMD experimental features were added: vectorization of loops with breaks, syntax for compress/expand patterns and syntax for histogram pattern.

[Slides]

 

 


For more information about previous meetings, please refer to the minutes.