# Performance Variability on Xeon Phi





Brandon Cook, Thorsten Kurth, Brian Austin, Samuel Williams, Jack Deslippe

June 22, 2017









- Application developers
  - understanding performance
  - reason effectively about optimizations
  - sound advice to application users
- Users
  - Efficient use of CPU allocations
  - Wasted cycles on terminated jobs
  - Correct estimates of campaign costs
- Facilities
  - System health
  - Advice for users
  - Utilization scheduler efficiency





### **Cori at NERSC**



- 2388 Haswell
  - 2x 16 core @ 2.3 GHz
  - 40 MB shared L3
  - 128 GB DDR
- Cray Aries Interconnect
  - dragonfly topology

- 9688 Xeon Phi (KNL) nodes
  - 68 cores @ 1.4 GHz
  - 34 MB distributed L2
  - 96 GB DDR
  - 16 GB MCDRAM (onpackage)





# **Cori at NERSC**



- 2388 Haswell
  - 2x 16 core @ 2.3 GHz
  - 40 MB shared L3
  - 128 GB DDR
- Cray Aries Interconnect
  - dragonfly topology

- 9688 Xeon Phi (KNL) nodes
  - 68 cores @ 1.4 GHz
  - 34 MB distributed L2
  - 96 GB DDR
  - 16 GB MCDRAM (onpackage)







## MCDRAM

















# **KNL is highly configurable**





### **Cluster modes**

- all-to-all
- quadrant
- SNC2/4

### **Memory modes**

- flat
- cache
- hybrid





## U.S. DEPARTMENT OF Office of Science

# **MCDRAM cache mode**

- 16GB MCDRAM cache
- single NUMA
- No code modification
- No NUMA programing or affinity issues (e.g. numactl)
- but?



- all-to-all
- quadrant
- SNC2/4







### Variability in cache mode









## **Brief introduction to caches**





#### KNL's MCDRAM cache is direct-mapped.





# Direct-Mapped Caches: Thrashing Sgi



http://sc.tamu.edu/help/power/powerlearn/html/ScalarOptnw/sld015.htm





# Misses depend on free page list



- OS stores a list of free memory pages.
- Allocations are made from the top of the list.
- The free page list gets scrambled if memory is not freed in the order it was allocated.









Solution: sort the free page list\*

- zonesort: kernel module provided by Intel
- At NERSC
  - called immediately before application launch







#### zonesort off

#### zonesort on







# effect of zonesort for HPGMG





High Performance Geometric Multi-Grid

Highly instrumented

Perfectly load balanced problem

"smooth time"

256^3 grid per rank



# Job placement



























|                      | Haswell    | Xeon Phi  |  |
|----------------------|------------|-----------|--|
| Flops                | 1.2 TFlops | 3 TFlops  |  |
| Memory Bandwidth     | ~100 GB/s  | ~400 GB/s |  |
| Memory Capacity      | 128 GB     | 96 GB     |  |
| Capacity / bandwidth | 1.28 s     | 0.24 s    |  |

More flops & lower memory capacity / bandwidth

+ same network =

#### more pressure on network!





### **Aries topology**





~386 nodes per group







## Impact of # of groups











### sbatch --switches=<count>[@<max-time>]

<count> = # of groups



<max-time> = time to wait for

constraint





## Impact of # of groups









### **Job Placement**



|  | 1 Aries                                                                                                                                                                                                                                                                   | Group |
|--|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
|  |                                                                                                                                                                                                                                                                           |       |
|  |                                                                                                                                                                                                                                                                           |       |
|  |                                                                                                                                                                                                                                                                           |       |
|  |                                                                                                                                                                                                                                                                           |       |
|  |                                                                                                                                                                                                                                                                           |       |
|  | $\begin{array}{c} \blacksquare \blacksquare$ |       |
|  |                                                                                                                                                                                                                                                                           |       |
|  |                                                                                                                                                                                                                                                                           |       |

.

### nodes allocated to job





### Chroma HMC – 256 nodes



BERKELEY LAB







- MCDRAM Cache
  - direct map cache
  - leads to cache conflicts
  - Intel zonesort
- Job placement
  - Bad placement introduces extra hops for data
  - Bad placement increases potential for interference
  - SLURM topology control helps (# of nodes < 350)</li>
- Not covered in this talk
  - IO! (burst buffer on compute fabric)
  - Identification of network "Aggressors"
  - frequency scaling (DVFS)







### National Energy Research Scientific Computing Center



