IXPUG Webinar Series

Intel eXtreme Performance Users Group (IXPUG) live session webinars enable knowledge sharing and greater collaboration in the HPC and AI community. Topics range across HPC, AI, next-generation platforms, storage at scale, data analytics, visualization, and Intel software and tools. The webinars are free and open to all attendees!

What to expect:

Direct and collaborative discussions about results and techniques between IXPUG members, industry experts, and Intel engineers.
Live Q&A to assist the community with performance, debugging, and troubleshooting on Intel HPC platforms.
Networking with HPC-AI technical leaders and your peers to prepare for a deeper engagement at IXPUG events.

Previous Webinars

Date	Title	Author(s)	Description	Presentation
April 25, 2024	Leveraging LLMs and Differentiable Rendering for Automating Digital Twin Construction	Krishna Kumar is an assistant professor of the Fariborz Maseeh Department of Civil, Architectural, and Environmental Engineering and an affiliate member at the Oden Institute of Computational Sciences at UT Austin. Krishna received his PhD in Engineering at the University of Cambridge, UK, in multi-scale and multiphysics modeling. Krishna's work involves developing large-scale multiphysics numerical methods and in situ visualization techniques. His research interest spans physics-based machine learning techniques, such as graph networks and differentiable simulators, to solve inverse and design problems. He leads the NSF-funded AI in Engineering Cyber Infrastructure Ecosystem and leads AI developments in DesignSafe, an NSF-funded Cyber Infrastructure facility for Natural Hazard Engineering.	This presentation introduces an innovative approach that combines Large Language Models (LLMs) and differentiable rendering techniques to automate the construction of digital twins. In our approach, we employ LLMs to guide and optimize the placement of objects in digital twin scenarios. This is achieved by integrating LLMs with differentiable rendering, a method traditionally used for optimizing object positions in computer graphics based on image pixel loss. Our technique enhances this process by incorporating a second modality, namely Lidar data, resulting in faster convergence and improved accuracy. This fusion of sensor inputs proves invaluable, especially for applications like autonomous vehicles, where establishing the precise location of multiple actors in a scene is crucial. Our methodology involves several key steps: (1) Generating a point cloud of the scene via ray casting, (2) Extracting lightweight geometry from the point cloud using PlaneSLAM, (3) Creating potential camera paths through the scene, (4) Selecting the most suitable camera path by leveraging the LLM in conjunction with image segmentation and classification, and (5) Rendering the camera flight path from its origin to the final destination. The technical backbone of this system includes the use of Mitsuba for ray tracing, powered by Intel's Embree ray tracing library. This setup encompasses Lidar simulation, image rendering, and a final differentiable rendering step for precise camera positioning. Future iterations may incorporate Intel OSPRay for enhanced Lidar-like ray casting and image rendering, with a possible integration of Mitsuba for differentiable render camera positioning. The machine learning inference chain utilizes a pre-trained LLM from OpenAI accessed via LangChain, coupled with GroundingDINO for zero-shot image segmentation and classification within PyTorch. This entire workflow is optimized for performance on the latest generation of Intel CPUs. This presentation will delve into the technical details of this approach, demonstrating its efficacy in automating digital twin construction and its potential applications in various industries, particularly in the realm of autonomous vehicle navigation and scene understanding.	[Video] [PDF]
August 10, 2023	Preparing for Exascale on Aurora	Dr. Scott Parker is a computation scientist at the Argonne Leadership Computing Facility (ALCF) and a lead for the ALCF Performance Engineering team. His principal focus is on developing and deploying next-generation leadership-scale high-performance computing systems at the ALCF and developing scientific applications that utilize these systems. In addition, he is the lead for the Exascale Computing Projects (ECP) Applications Integration effort, which seeks to enable ECP applications to utilize the new generation of exascale systems. He is also one of the co-organizers of the annual International Workshop on Performance, Portability, and Productivity in HPC.	The Aurora exascale system is currently being deployed at Argonne National Lab. The system, utilizing Intel’s new Data Center Max Series GPUs (a.k.a. PVC) and Xeon Max Series CPU with HBM, will provide a uniquely powerful platform for leading-edge HPC, AI, and data-intensive computing applications. Scientists at Argonne National Laboratory, in collaboration with the Exascale Computing Project, Intel, and several other institutions, are preparing several dozen applications and workflows to run at scale on the Aurora system. This talk will present an overview of the Aurora system and highlights from the experience of preparing applications for the system. In addition, promising early performance results on the Aurora hardware will be shown.	[Video] [PDF]
April 28, 2022	Intel Fortran Compilers: A Tradition of Trusted Application Performance	Ron Green is the manager of the Intel Fortran OpenMP and Runtime Library development team. He is a moderator for the Intel Fortran Community Forum and is an Intel Developer Zone “Black Belt”. He has extensive experience as a developer and consultant in HPC for the past 30+ years and has been with Intel’s compiler team for thirteen years. His technical interest area is in parallel application development with a focus on Fortran programming.	The Intel® Fortran Compiler is built on a long history of generating optimized code that supports industry standards while taking advantage of built-in technology for Intel® Xeon® Scalable processors and Intel® Core™ processors. Staying aligned with Intel's evolving and diverse architectures, the compiler now supports GPUs. This presentation will cover the compiler standards and path forward. There are two versions of this compiler. Both versions integrate seamlessly with popular third-party compilers, development environments, and operating systems. • Intel Fortran Compiler: provides CPU and GPU offload support • Intel Fortran Compiler Classic: provides continuity with existing CPU-focused workflows Features: • Improves development productivity by targeting CPUs and GPUs through single-source code while permitting custom tuning • Supports broad Fortran language standards • Incorporates industry standards support for OpenMP* 4.5, and initial OpenMP 5.0 and 5.1 for GPU offload • Uses well-proven LLVM compiler technology and Intel's history of compiler leadership • Takes advantage of multicore, Single Instruction Multiple Data (SIMD) vectorization and multiprocessor systems with OpenMP, automatic parallelism, and coarrays	[Video] [PDF]
March 10, 2022	DAOS: Storage Innovations Driven by Intel® Optane™ Persistent Memory	Zhen Liang is a technical architect involved in the architecture, design, and implementation of distributed storage system. He has been in the storage software industry since 2004 and has significant experience and expertise in filesystem, network, high performance computing, and distributed storage system architecture. He is currently the technical architect of Distributed Asynchronous Object Storage (DAOS). DAOS is an open-source software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), it is the foundation of the Intel exascale storage stack.	This presentation will provide a technical overview of Distributed Asynchronous Object Store (DAOS), a software-defined object store designed from the ground up for massively distributed Non-Volatile Memory (NVM), including Intel® Optane™ DC persistent memory and Intel Optane DC SSDs. This presentation will also introduce the performance and explain main features of DAOS.	[Video] [PDF]
December 9, 2021	Multi-GPU Programming—Scale-Up and Scale-Out Made Easy, Using the Intel® MPI Library	Anatoliy Rozanov is Intel MPI Lead Developer responsible for Intel GPU enabling and Intel MPI process management/deployment infrastructure at Intel. Dmitry Durnov is Intel MPI and oneCCL Products Architect at Intel. Michael Steyer is a HPC Technical Consulting Engineer, supporting technical and high performance computing segments within the Software and Advanced Technology Group at Intel.	For shared memory programming of GPGPU systems, users either have to manually run their domain decomposition along available GPUs as well as GPU Tiles. Or leverage implicit scaling mechanisms that transparently scale their offload code across multiple GPU-Tiles. The former approach can be cumbersome, and the latter approach is not always the best performing one. The Intel MPI library can take that burden from users by enabling the user to program only for a single GPU / Tile and leave the distribution to the library. This can make HPC / GPU programming much easier. Therefore, Intel® MPI does not just allow to pin individual MPI ranks to individual GPUs or Tiles, but also allows users to pass GPU memory pointers to the library.	[Video] [PDF]
August 12, 2021	IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream?	Dr. Shantenu Jha, Chair of Computation & Data Driven Discovery Department at Brookhaven National Laboratory and Professor of Computer Engineering at Rutgers University	The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silico methodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple methodological and supporting infrastructural innovations at scale. Specifically, how we used TACC’s Frontera on > 8000 compute nodes to sustain 144M/hour docking hits, and to screen ∼100 Billion drug candidates. These capabilities have been used by the US-DOE National Virtual Biotechnology Laboratory, and represent important progress towards improvement of computational drug discovery, both in terms of size of libraries screened, but also the possibility of generating training data fast enough for very powerful (docking) surrogate models.	[Video] [PDF]
April 22, 2021	Visual Analysis on TACC Frontera using the Intel oneAPI Rendering Toolkit	Dr. Paul A. Navrátil, Research Scientist and Director of Visualization, Texas Advanced Computing Center (TACC) at the University of Texas at Austin	TACC Frontera handles the largest simulation runs for open-science researchers supported by the National Science Foundation. Due to the data sizes involved, the scientific analysis is most easily performed on Frontera itself, often done “in situ” without writing the full data to disk. This talk will present recent work on Frontera that uses the Intel oneAPI Rendering Toolkit to perform both batch and interactive visual analysis across a range of scientific domains.	[Video] [PDF]
March 11, 2021	Performance Optimizations for End-to-End AI Pipelines	Meena Arunachalam and Vrushabh Sanghavi, Intel Corporation	The trifecta of high volumes of data, abundant compute availability on cloud and on-premise, and rapid algorithmic innovations enable data scientists and AI researchers to do fast experiments, prototyping, and model development at an accelerated pace that was never possible before. In this talk, we will touch upon a variety of software packages, libraries, and tools that can also help HPC practitioners push the envelope of applying AI in their application domains and simulations at-scale. We will cover examples and talk about how to create efficient end-to-end AI pipelines with large data sets in-memory, security, and other features through Intel-optimized software packages such as Intel® Distribution of Python, Intel® Optimized Modin, Intel® Optimized Sklearn, and XGBoost, as well as DL Frameworks such as Intel® Optimized Tensorflow and Intel® Optimized PyTorch tuned and enabled with new hardware features and instructions every new CPU generation.	[Video] [PDF]
February 18, 2021	Migrating from CUDA-only to Multi-Platform DPC++	Steffen Christgau, Zuse Institute Berlin (ZIB) Marius Knaust (ZIB) will join to answer FPGA-related questions from the audience.	In this webinar we will demonstrate how an existing CUDA stencil application code can be migrated to DPC++ with the help of the Compatibility Tool. We will highlight and discuss the crucial differences between the two programming environments in the context of migrating the tsunami simulation easyWave. The discussion also includes steps for making the code to compliant with the SYCL standard. During the talk, we will also show that the migrated code can run on a wide range of platforms starting from CPUs, over GPUs, to FPGAs.	[Video] [PDF]
July 1, 2020	Migrating Your Existing CUDA Code to DPC++	Edward Mascarenhas and Sunny Gogar, Intel Corporation	Best practices for using a one-time migration tool that migrates CUDA applications into standards-based Data Parallel C++ (DPC++) code. Topics include: • An overview of the DPC++ language, including why it was created and how it benefits developers • An overview of the Intel DPC++ Compatibility Tool itself—what it is and what it does • Real-world examples of the code-migration concept, including the process and expectations • A demonstration of the steps involved to migrate CUDA code to DPC++ code, including what a complete migration looks like and best practices to follow	[Video]
February 20, 2020	Performance Optimization of Intel® oneAPI Applications	Kevin O’Leary Intel Corporation	Modern workloads are incredibly diverse—and so are architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial (SVMS) architectures deployed in CPU, GPU, FPGA, and other future accelerators. Intel® oneAPI products will deliver the tools needed to deploy applications and solutions across SVMS architectures. This webinar will focus on the oneAPI features that focus on performance optimization, including the analysis tools: Intel® VTune™ Profiler(Beta) to find performance bottlenecks fast in CPU, GPU, and FPGA systems Intel® Advisor(Beta) for vectorization, threading, and accelerator offload design advice Part of the webinar will start with an application that is currently running on a CPU, and we will use the oneAPI tools to port and optimize on our GPU. The Intel® oneAPI set of complementary toolkits—a base kit and specialty add-ons—simplify programming and help improve efficiency and innovation. Use it for: high performance computing, machine learning and analytics, IoT applications, video processing, rendering, etc. This webinar will include extra time for Q&A. Presenter: Kevin O’Leary is a senior technical consulting engineer in Intel’s software tools group. Kevin was one of the original developers of Intel® Parallel Studio. Before coming to Intel, he spent several years on the IBM Rational Apex debugger development team.	[Video] [PDF]
November 14, 2019	The DREAM Framework and Binning Directories –or– Can We Analyze ALL Genomic Sequences on Earth?	Knut Reinert Freie Universität Berlin	The recent improvements of full genome sequencing technologies, commonly subsumed under the term NGS (Next Generation Sequencing), have tremendously increased the sequencing throughput. Within 10 years it rose from 21 billion base pairs collected over months to about 400 billion base pairs per day (current throughput of Illumina's HiSeq 4000). The costs for producing one million base pairs could also be reduced from 140,000 dollars to a few cents. As a result of this dramatic development, the number of new data submissions, generated by various biotechnological protocols (ChIP-Seq, RNA-Seq, etc.), to genomic databases has grown dramatically and is expected to continue to increase faster than the cost and capacity of storage devices will decrease. The main task in analyzing NGS data is to search sequencing reads or short sequence patterns (i.e., exon/intron boundary read-through patterns) or expression profiles in large collections of sequences (i.e., a database). Searching the entirety of such databases mentioned above is usually only possible by searching the metadata or a set of results initially obtained from the experiment. Searching (approximately) for specific genomic sequence in all the data has not been possible in reasonable computational time. In this work we describe results of our new data structure, called binning directory that can distribute approximate search queries based on an extension of our recently introduced Interleaved Bloom Filters (IBF) called x-partitioned IBF (x-PIBF). The results presented here make use of Intel® Optane™ DC persistent memory architecture and achieves significant speedups compared to a disk based solution.	[Video]
October 10, 2019	Optimize for Both Memory and Compute on Modern Hardware Using Roofline Model Automation in Intel® Advisor	Zakhar Matveev and Cédric Andreolli Intel Corporation	Software must be optimized for both Compute (including SIMD vector parallelism) and effective memory sub-system utilization to achieve scaled performance on modern hardware. In this talk we present state-of-the-art Intel Advisor Roofline performance model automation which helps to identify memory bottlenecks and balance between CPU and memory utilization. The talk will not only cover “cache-aware” Roofline implementation, but also new capabilities to produce DRAM (“original”) and multi-level (L1, L2, LLC, MCDRAM and DRAM – all de-coupled) Roofline model flavors in order to guide DRAM- or cache-bound applications optimization.	[Video] [PDF - Matveev] [PDF - Andreolli]
September 12, 2019	Accelerate Your Inferencing with Intel® Deep Learning Boost	Shailen Sobhee Intel Corporation	Learn about Intel® Deep Learning Boost (Intel® DL Boost), and its Vector Neural Network Instructions (VNNI). These are a new set of Intel® Advanced Vector Extension 512 (Intel® AVX-512) instructions that are designed to deliver significantly more efficient Deep Learning inference acceleration. We will show a live demo of them in action and quickly show you how you can get started with Intel® DL Boost today.	[Video] [PDF]
July 11, 2019	Scaling Distributed TensorFlow Training with Intel’s nGraph Library on Xeon® Processor Based HPC Infrastructure	Jianying Lang Intel Corporation	Intel has released nGraph library, a compiler and runtime APIs for multiple front-end Deep Learning frameworks, such as TensorFlow, MxNet, PaddlePaddle, and others. nGraph represents framework computational graph as an intermediate representation (IR) which could be executed by multiple backend computational hardware from the edge to the data center, thus significantly improving the productivity of AI data scientists. As in this talk, we will present the details on the bridge that connects TensorFlow to nGraph for a Xeon CPU backend. We will demonstrate state-of-the-art (SOTA) accuracy and convergence for ResNet-50 against ImageNet-1K on multiple Xeon Skylake nodes. Using distributed nGraph, we are able to obtain ~75% Top-1 accuracy for ResNet-50 training on a small number of Xeon Skylake nodes. We will demonstrate convergence and excellent scaling efficiency Skylake nodes connected with Ethernet using nGraph TensorFlow with open source code Horovod.	[Video] [PDF]
May 09, 2019	Deeply-Pipelined FPGA Clusters Make DNN Training Scalable	Tong Geng Boston University Tianqi Wang University of Science and Technology of China	Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this talk, we introduce a framework, FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.	[Video] [PDF]
April 11, 2019	A Study of SIMD Vectorization for Matrix-Free Finite Element Method	Tianjiao Sun Imperial College London, UK Lawrence Mitchell Durham University, UK David A. Ham Imperial College London, UK Paul H. J. Kelly Imperial College London, UK Kaushik Kulkami University of Illinois at Urbana-Champaign, USA Andreas Kloeckner University of Illinois at Urbana-Champaign, USA	Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this talk, we study the cross-element vectorization in the finite framework Firedrake and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent Intel CPUs using three mainstream compilers. Our experiments show that cross-element vectorization achieves 30% of theoretical peak performance for many examples of practical significance, and exceeds 50% for cases with high arithmetic intensities, with consistent speed-up over vectorization restricted to the local assembly kernels.	[Video] [PDF]
March 14, 2019	Scalable and Flexible Distributed Rendering with OSPRay's Distributed API and FrameBuffer	Will Usher Scientific Computing and Imaging Institute, University of Utah Ingo Wald Formerly Intel, now NVIDIA Jefferson Amstutz Intel Corporation Johannes Günther Intel Corporation Carson Brownlee Intel Corporation Valerio Pascucci Scientific Computing and Imaging Institute, University of Utah	Image and data-parallel rendering across multiple nodes on HPC system is widely used in visualization to provide higher framerates, support large datasets, and render data in situ, Specifically for in situ, reducing bottlenecks incurred by the visualization and compositing tasks is of key concern to reduce the overall simulation run time, while for general interactive visualization improving rendering performance, and thus interactivity, is always desirable. In this talk, Will Usher will present our work on an asynchronous image processing and compositing framework for multi-node rendering in OSPRay, dubbed the Distributed FrameBuffer. We demonstrate that this approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distribution. By building on this framework, we have extended OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Will Usher will cover our approach to developing this framework, performance considerations, and use cases and examples of the new data-distributed API in OSPRay.	[Video] Recording begins at 2:50. [PDF]
February 14, 2019	Evaluation of Intel Memory Drive Technology Performance for Scientific Applications	Vladimir Mironov Lomonosov Moscow State University Yuri Alexeev Argonne National Laboratory Alexander Moskovsky RSC Technologies Andrey Kudryavtsev Intel Corporation	This talk will present benchmark data for IMDT, which is a new generation of Software-defined Memory (SDM) based on Intel ScaleMP collaboration and using 3D XPoint based Intel SSD called Optane. IMDT performance was studied using synthetic benchmarks, scientific kernels and applications. We chose these benchmarks to represent different patterns for computation and accessing data on disks and memory. To put performance of IMDT in comparison, we used two memory configurations: hybrid IMDT DDR4/Optane and DDR4 only system. The performance was measured as percentage of used memory and analyzed in detail. We found that for some applications DDR4/Optane hybrid configuration outperforms DDR4 setup by up to 20%.	[Video] Recording begins at 5:35. [PDF]
January 10, 2019	Massively Scalable Computing Method for Handling Large Eigenvalue Problem for Nanoelectronics Modeling	Hoon Ryu Korea Institute of Science and Technology Information (KISTI)	This talk will help you learn how Lanczos iterative algorithm can be extended with a parallel computing to solve highly degenerated systems. The talk will address the performance benefits of the core numerical operations in Lanczos iteration, which can be driven with manycore processors (KNL) compared to the heterogeneous systems containing PCI-E and-in devices. This work will also demonstrate an extremely large-scale benchmark (~2500 KNL computing nodes) that has been recently performed with KISTI-5 (NURION) HPC resource. As this talk covers the numerical details of the algorithm, it would be also quite instructive to those who consider KNL system to solve large-scale eigenvalue problems.	[Video] [PDF]
October 11, 2018	Intel Optane Solutions in HPC	Andrey Kudryavtsev Intel Corporation	This session focuses on the latest Intel Optane technologies and the way it’s used by HPC customers. Attendees will learn about the best usage models and benefits Intel Optane introduces for fast storage or extending system memory.	[Video]
August 9, 2018	Machine Learning at Scale	Deborah Bard and Karthik Kashinath NERSC	Deep Learning has revolutionized the fields of computer vision, speech recognition, robotics and control systems. At NERSC, we have applied deep learning to problems in cosmology and climate science, focusing on areas that require supercomputing resources to solve real scientific challenges. In cosmology, we use deep learning to identify the underlying physical model that produced the matter distribution in the universe, and develop a deep learning-based emulator for cosmological observables that can reduce the need for computationally expensive simulations. In addition, we use feature introspection to examine the physical structures identified by the network as distinguishing between cosmological models. In climate, we apply deep learning to detect and localize extreme weather events such as tropical cyclones, atmospheric rivers and weather fronts in large-scale simulated and observed datasets. We will also discuss the challenges involved in scaling deep learning frameworks to supercomputer scale, and how to obtain optimal performance from supercomputing hardware.	[Video]
June 14, 2018	Using Roofline Analysis to Analyze, Optimize, & Vectorize Iso3DFD with Intel® Advisor	Kevin O’Leary Intel Corporation	This presentation will introduce the use of Intel® Advisor to help you enabling vectorization in your application. We will use the Roofline Model in Intel Advisor to see the impact of our optimizations. We will also demonstrate how Intel Advisor can detect wrong memory access patterns or loop carried dependency in your application. The case study we will use is Iso3DFD. This kernel is propagating a wave in a 3D field using finite difference with a 16th order stencil in an isotropic media.	[Video]
May 10, 2018	High Productivity Languages	Rollin Thomas NERSC Sergey Maidanov Intel Corporation	This talk will cover challenges of numerical analysis and simulations at scale. The tools such as Python which are often used for prototyping are not designed to scale to large problems. As a result organizations have to have a dedicated team that takes a prototype created by research scientists and deploy it in the production environment. The new approach is required for addressing both scalability and productivity aspects of applied science that combines two distinct worlds, the best of HPC world and the best of database worlds. Starting with a brief overview of scalability aspects with respect to modern hardware architecture we will characterize what the problem at scale is, its inherit characteristics and how these map onto software design choices. We will also discuss selected experimental/observational science applications making use of Python at the National Energy Research Scientific Computing Center (NERSC), and what NERSC has done in partnership with the Intel Python Team to help application developers improve performance while retaining scientist/developer productivity.	[Slides 1] [Slides 2] [Video]
April 12, 2018	Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors	John McCalpin TACC	Intel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.	[Slides] [Video]
March 8, 2018	Compiler Prefetching on KNL	Rakesh Krishaiyer Intel Corporation	We will cover some of the recent changes in the compiler-based prefetching (for Knights Landing and Skylake) and provide tips on how to tune for performance using compiler prefetching options, pragmas and prefetch intrinsics.	[Slides] [Video]
February 8, 2018	Threading Building Blocks (TBB) Flow Graph: Expressing and Analyzing Dependencies in Your C++ Application	Pablo Reble Intel Corporation	Developing for heterogeneous systems is challenging because applications may be composed of many layers of parallelism and employ a diverse set of programming models or libraries. This session focuses on Flow Graph, an extension to the Threading Building Blocks (TBB) interface that can be used as a coordination layer for heterogeneity that retains optimization opportunities and composes with existing models. This extension assists in expressing complex synchronization and communication patterns and in balancing load between CPUs, GPUs, and FPGAs. Because a Flow Graph can express complex interactions, we use Intel Advisor’s Flow Graph Analyzer (FGA), which has been released as a Technology Preview in Parallel Studio XE 2018 to visualize interactions in a graph and map the application structure to performance data. Finally, we validate this approach by presenting use cases of applications using Flow Graph.	[Slides] [Video]
January 11, 2018	Vectorization of Inclusive/Exclusive Compilier 19.0	Nikolay Panchenko Intel Corporation	We propose a new OpenMP syntax to support inclusive and exclusive scan patterns. In computer science, this pattern is also known as a prefix or cumulative sum. The proposal defines several new constructs to support inclusive and exclusive scans through OpenMP, defines semantics for these constructs and possible combination of parallelization and vectorization. In 18.0 Compiler 3 new OMP SIMD experimental features were added: vectorization of loops with breaks, syntax for compress/expand patterns and syntax for histogram pattern.	[Slides]

For more information about previous meetings, please refer to the minutes.