

# EARTHQUAKE SIMULATIONS ON THE INTEL XEON PHI PROCESSOR

*Alexander Heinecke* (Intel), Josh Tobin (UCSD), Alexander Breuer (UCSD), Charles Yount (Intel), Yifeng Cui (UCSD)

Parallel Computing Lab Intel Labs USA

November 14th 2017

#### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

## AWP-ODC-OS

## What is AWP-ODC-OS?

AWP-ODC-OS (Anelastic Wave Propagation, Olsen, Day, Cui): Simulates seismic wave propagation after a fault rupture

Used extensively by the Southern California Earthquake Center community (SCEC)

#### License: BSD 2-Clause



<u>Combined Hazard map</u> of CyberShake Study 15.4 (LA, CVM-S4.26) and CyberShake Study 17.4 (Central California, CCA-06). AWP-ODC simulations are used to generate hazard maps. Colors show 2 seconds period spectral acceleration (SA) for 2% exceedance probability in 50 years.



#### What is EDGE?

Extreme-scale Discontinuous Galerkin Environment (EDGE): Seismic wave propagation through DG-FEM

Focus: Problem settings with high geometric complexity, e.g., mountain topography

"License": BSD 3-Clause (software), CC0 for supporting files (e.g., user guide)

http://dial3343.org



Example of hypothetical seismic wave propagation with mountain topography using EDGE. Shown is the surface of the computational domain covering the San Jacinto fault zone between Anza and Borrego Springs in California. Colors denote the amplitude of the particle velocity, where warmer colors correspond to higher amplitudes.

#### **Two Representative Codes**

#### AWP-ODC-OS



Finite difference scheme: 4th order in space, 2nd order in time
Staggered-grid, velocity/stress formulation of elastodynamic eqns with frequency dependent attenuation

Memory bandwidth bound

Discontinuous Galerkin Finite Element Method (DG-FEM) Unstructured tetrahedral meshes Small matrix kernels in inner loops

#### Compute bound for higher orders









**AWP-ODC-OS** 

#### **Boosting Single-Node Performance: Vector Folding**





#### **Architecture Comparison**

Xeon Phi KNL 7290: 2x speedup over NVIDIA K20X; 97% of NVIDIA Tesla P100 performance

Memory bandwidth accurately predicts performance of architectures (as measured by STREAM and HPCG-SpMv)



Single node performance comparison of AWP-ODC-OS on a variety of architectures. Also displayed is the bandwidth of each architecture, as measured by a STREAM and HPCG-SpMv [ISC\_17\_2].

inte

#### **Outperforming 20K GPUs**

Parallel efficiency

Cori Phase II and TACC Stampede Extension Parallel efficiency of over 91% from 1 to 9000 nodes (9000 nodes = 612,000 cores) Problem size of 512x512x512 per node (14 GB per node) Performance on 9000 nodes of Cori equivalent to performance of over 20,000

Weak scaling studies on NERSC

K20X GPUs at 100% scaling



Number of nodes

AWP-ODC-OS weak scaling on Cori Phase II and TACC Stampede. We attain 91% scaling from 1 to 9000 nodes. The problem size required 14GB on each node [ISC\_17\_2].



# EDGE

## **Fused Simulations**

Exploits inter-simulation parallelism:

- Full vector operations, even for sparse matrix operators
- Automatic memory alignment
- Read-only data shared among all runs
- Lower sensitivity to latency (memory & network)



Illustration of the memory layout for fused simulations in EDGE. Shown is a third order configuration for line elements and the advection equation. Left: Single forward simulation, right: 4 fused simulations



Illustration of fused simulations in EDGE for the advection equation using line elements. Top: Single forward simulation, bottom: 4 fused simulations.

## Fused Simulations: Performance

Orders: 2-6 (non-fused), 2-4 (fused)

Unstructured tetrahedral mesh: 350,264 elements

Single node of Cori-II (68 core Intel Xeon Phi x200, code-named Knights Landing) EDGE vs. SeisSol (GTS, git-tag 201511) Speedup: <u>2-5x</u>



4.60needup: eisSo 2.871.821.240.910.960.800.740201 02C8 0301 03C8 O4C1 04C8 O5C1 O6C1 configuration (order, #fused simulations)

LOH.1 Benchmark: Example mesh and material regions [ISC16\_1]

EDGE

Speedup of EDGE over SeisSol (GTS, git-tag 201511). Convergence rates O2 – O6: single non-fused forward simulations (O2C1-O6C1). Additionally, per-simulation speedups for orders O2–O4 when using EDGE's full capabilities by fusing eight simulations (O2C8-O4C8). [ISC17\_1]

#### Reaching 10+ PFLOPS

Regular cubic mesh, 5 Tets per Cube, 4th order (O4) and 6th order (O6) Imitates convergence benchmark 276K elements per node 1-9000 nodes of Cori-II (9000 nodes = 612,000 cores) O6C1 @ 9K nodes: 10.4 PFLOPS (38% of peak) O4C8: @ 9K nodes: 5.0 PFLOPS (18% of peak) O4C8 vs. O4C1 @ 9K nodes: 2.0x speedup



Weak scaling study on Cori-II. Shown are hardware and non-zero peak efficiencies in flat mode. O denotes the order and C the number of fused simulations [ISC17\_1].

EDGE

#### Strong at the Limit: 50x and 100x



32-3200 nodes of Theta (64 core Intel Xeon Phi x200,

code-named Knights Landing)

3200 nodes = 204,800 cores

O6C1 @ 3.2K nodes: 3.4 PFLOPS (40% of peak)

O4C8 vs. O4C1 @ 3.2K nodes:

2.0x speedup



Strong scaling study on Theta. Shown are hardware and non-zero peak efficiencies in flat mode. O denotes the order and C the number of fused simulations [ISC17\_1].

inte

EDGE

#### **Outlook: AI Revolution**

- EDGE is a prime candidate for merging traditional HPC and AI
- Work in progress: LIBXSMM for AVX512\_4FMAPS (Knights Mill)
- Future work: AVX512\_4VNNIW for seismic simulations (Knights Mill)
- Future work: Fused simulations to address highdimensional parameter spaces ("crunching data"):
  - EDGElearn: (Deep) Learning from seismic simulations
- Future work: LIBXSMM in TensorFlow



EDGE

#### References

[ISC17\_1] <u>A. Breuer</u>, A. Heinecke, Y. Cui: EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method.

Proceedings of International Super Computing (ISC) High Performance 2017

[ISC17\_2] J. Tobin, <u>A. Breuer</u>, C. Yount, A. Heinecke, Y. Cui: Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor

Proceedings of International Super Computing (ISC) High Performance 2017

[ISC16\_1] A. Heinecke, <u>A. Breuer</u>, M. Bader: High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing).

High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings. <u>http://dx.doi.org/10.1007/978-3-319-41321-1\_18</u>

- [ISC16\_2] A. Heinecke, <u>A. Breuer</u>, M. Bader: Chapter 21 High Performance Earthquake Simulations. In Intel Xeon Phi Processor High Performance Programming Knights Landing Edition.
- [IPDPS16] <u>A. Breuer</u>, A. Heinecke, M. Bader: Petascale Local Time Stepping for the ADER-DG Finite Element Method. In Parallel and Distributed Processing Symposium, 2016 IEEE International. <u>http://dx.doi.org/10.1109/IPDPS.2016.109</u>
- [ISC15] <u>A. Breuer</u>, A. Heinecke, L. Rannabauer, M. Bader: High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol.

In 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015.

- [SC14] A. Heinecke, <u>A. Breuer</u>, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A. Bode, W. Barth, X.-K. Liao, K. Vaidyanathan, M. Smelyanskiy and P. Dubey: Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.
  - In Supercomputing 2014, The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, November 2014. Gordon Bell Finalist.
- [ISC14] <u>A. Breuer</u>, A. Heinecke, S. Rettenberger, M. Bader, A.-A. Gabriel and C. Pelties: Sustained Petascale Performance of Seismic Simulations with SeisSol on SuperMUC.
- In J.M. Kunkel, T. T. Ludwig and H.W. Meuer (ed.), Supercomputing 29th International Conference, ISC 2014, Volume 8488 of Lecture Notes in Computer Science. Springer, Heidelberg, June 2014. 2014 PRACE ISC Award.
- [PARCO13] <u>A. Breuer</u>, A. Heinecke, M. Bader and C. Pelties: Accelerating SeisSol by Generating Vectorized Code for Sparse Matrix Operators.

In Parallel Computing — Accelerating Computational Science and Engineering (CSE), Volume 25 of Advances in Parallel Computing. IOS Press, April 2014.

