Speakers & Descriptions

Balint Joo

More Lattice QCD on Xeon Phi Co-Processors

I will discuss some recent progress in lattice QCD calculations, on the Xeon Phi architecture, including production running on Stampede, and consider prospects for the future.


CJ Newburn

HPC Programming for the Future

How should we program for the future? How do we keep programming tractable for scientists, while giving tuners freedom to innovate and tuner performance for target systems? How can we give developers the freedom to lay out data as they wish, but still get good performance? Do we need to abandon standards to achieve this, or should we be stretching and influencing them? This talk will offer some insights on future directions in these areas.

Once developers tune performance within a node, it’s time to scale performance out across a cluster, whether it’s homogeneous or heterogeneous. What should libraries look like to best support that? What kinds of underlying plumbing is needed? We invite you to engage with us in exploring the requirements and implementation issues with your applications.


DK Panda

MVAPICH2 and MVAPICH2-MIC (Tuesday)

We will present the latest status (features and performance results) of MVAPICH2 and MVAPICH2-MIC libraries.

Supporting PGAS Programming Models (OpenSHMEM and UPC) on MIC (Wednesday)
We will present the challenges in supporting emerging PGAS programming models (OpenSHMEM and UPC) on MIC.


Eric Stotzer

OpenMP 4.0 Acceleration
Hardware and software advances in GPU, MIC, ARM and FPGA technologies have accelerated the need for a common many-threaded model for these accelerators. The OpenMP Language Committee has also accelerated its pace and is finalizing features for the 4.1 release that will provide a common threading model for many-core technologies.

Insights into some of the design decisions that went into the OpenMP accelerator model will be presented. Also, a preview of the OpenMP accelerator sub-committee's future releases for the OpenMP specification will be outlined and discussed.


James Reinders

Tools, Standards and Books, oh my! Compilers, Libraries and VTune, oh my!
Intel supports the Intel Xeon Phi Coprocessor development community with more tools, standards and books. James will talk about supporting Xeon Phi development with Intel's latest book projects (you can contribute!), standards additions and direction, compiler innovations and the case for explicit vectorization.


Jeff Hammond

NWChem: The Next Generation
NWChem was designed to be the computational chemistry package of the future, confronting the challenges of massive parallel distributed systems coincident with their appearance. Its parallel programming technology - Global Arrays - predates MPI 1.0, yet has impacted the development of both PGAS languages and the MPI standard to the present day, in particular, MPI-3. This talk considers the ongoing innovation of NWChem towards post-petascale, massively many-core supercomputing systems, especially Intel MIC systems. The essential role of portable HPC standards such as OpenMP 4 and MPI-3 is defended with bona fide scientific results.


Jim Phillips

NAMD and Charm++ on Xeon Phi
The parallel molecular dynamics code NAMD has been adapted to use the Xeon Phi co-processors on the TACC Stampede cluster. This was done with great assistance from Intel engineer David Kunzman, and is based on a clone of the offload infrastructure developed in NAMD for CUDA acceleration. This talk will present performance results and current challenges, as well as plans for NAMD on next-generation self-hosting Knights Landing processors.


Martin Berzins

Large Scale Engineering Simulations on MultiCore and Heterogeneous Architectures Using the Unitah Framework
The Uintah software framework is being used to compute solutions to a number of large scale applications by making use of an asynchronous task-based approach. Such simulations may be run on all of machines such as Mira and Blue Waters. The challenge of the next generation of large scale parallel machines requires the ability to adapt quickly to new heterogeneous architecture. The approach adopted here has three components one is a runtime system that can change the scheduling of tasks, the second is the use of domain specific languages to help in the development of new applications while the third is the possibility of transforming existing legacy code by using approaches such as Kokkos. We will survey developments in these areas and consider their advantages and disadvantages.


Ravi Murty

MPSS: Current Enhancements, Plans and KNL
This presentation will talk about the functional, performance and other enhancements being made in MPSS, the SW stack for the Intel Xeon Phi coprocessor. We also highlight some interesting investigations being conduction on the Linux kernel for KNC that will help in the future KNL processor. Finally we will look at what is planned for KNL in terms of HW and SW.


Srinath Vadlamani

Efforts of Xeon Phi use for CESM

We will present our experiences in porting CESM to Stampede's Xeon Phis. Besides the algorithm and code mapping, build system considerations and method of correctness verification will be covered. We will also present current performance relative to Xeon for specific CESM configurations and strategies for performance engineering.


Thanh Phung

Charactering the communication profile of a large HPC workload targeted for Xeon Phi

Success of good scaling for large HPC workloads from Xeon to Xeon Phi requires a good understanding not only on the floating points but also on the data parallel. In this talk, Intel Trace Analyzer and Collector (ITAC) and Intel MPI 5.0 will be discussed. ITAC tool is used to capture the details of communication profile for electronic manufacturing workloads like the Intel Tape-IN OMEN. Results show that OMEN is unique MPI-based workload with asynchronous communication that successfully overlaps the communication with computation throughout the execution to reduce the communication cost of sending/receiving very large number of 20-40MB message sizes. Intel MPI 5.0 for new MPI 3.0 API with optimization for scalable HW with unified memory and RDMA will also be briefly discussed as it will be required to port OMEN to Xeon Phi.


Vince Betro

Applications Experiences and Training on Beacon, a Cray CS300-AC Equipped with Intel Xeon Phi Co-processors
Given the growing popularity of accelerator-based supercomputing systems, it is beneficial for applications software programmers to have cognizance of the underlying platform and its workings while writing or porting their codes to a new architecture. In this work, the authors highlight experiences and knowledge gained from porting such codes as GROMACS, BLAST, VisIt, ENZO, H3D, GYRO, a BGK Boltzmann solver, HOMME-CAM, PSC, AWP-ODC, TRANSIMS, and ASCAPE to the Intel Xeon Phi architecture running on a Cray CS300-AC\tm Cluster Supercomputer named Beacon. Areas of optimization that bore the most performance gain are highlighted, and a set of metrics for comparison and lessons learned by the team at the National Institute for Computational Sciences Application Acceleration Center of Excellence is presented, with the intention that it can give new developers a head start in porting as well as a baseline for comparison of their own code's exploitation of fine and medium-grained parallelism. Additionally, topics regarding best practices in Xeon Phi training will be discussed.


Dhananjay Brahme

Application Performance Enhancement on Xeon Phi

While the applications running on Xeon is portable to Xeon Phi, the application performance engineers in TCS have experienced that porting performance from the Xeon to Xeon Phi architecture is not straightforward. The lower frequency and relatively simpler architecture of Xeon Phi CPU cores forces the programmer to stretch his thinking towards achieving better scaling. The TCS team is working on optimization of scientific and engineering applications on the Xeon Phi architecture specifically - molecular dynamics, CFD and ocean modeling. To begin with all the applications started with a baseline of below par performance on the Xeon Phi compared to their run time on a two socket Ivybridge server. While the work on scaling is still in progress, all the applications have so far achieved at least 100% speed up in execution over the initial baseline by enabling correct use of hardware features and software tools, API's & their characteristics available on the Xeon Phi platform. This session is intended to share these experiences with the larger Xeon Phi user community. Also discussed will be desirable features in tools and software accompanying the Xeon Phi that can be of great help to application performance engineers.