

# A Unified Network for HPC and Al

#### Uri Elzur, Intel

September 21st, 2023





Intel® Data Center GPU Max Series



## Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your cost and results may vary. // Performance varies by use, configuration and other factors. // See our complete legal Notices and Disclaimers.

Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See Intel's Global Rights Principles. Intel's products and software are intended only to be used in applications that do not cause or contribute to a violation of internationally recognized human rights.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.



Ultra Ethernet Consortium overview

Summary

- Workload large or small? Model vs Tokens
- Large Language Model and RecSys
- Cloud approach vs dedicated Training cluster
- HPC embraces AI or AI using HPC? Convergence of sorts?
- GPU, TPU, Wafer Scale or other???
- Optics: NPO, CPO, Direct drive, Intra ASIC?
- Network: Lossy vs Lossless religion or technology?



- WL large or small? Model vs Tokens
- LLM and RecSys
- Cloud approach vs dedicated Training cluster
- HPC embraces Al or Al using HPC? Convergence of sorts?
- GPU, TPU, Wafer Scale or other???
- Optics: NPO, CPO, Direct drive, Intra ASIC?
- Network: Lossy vs Lossless religion or technology?





#### THE CONVERGENCE OF HPC \* AI

Integrating the Third and Fourth Pillars of Scientific Discovery



- WL large or small? Model vs Tokens
- LLM and RecSys
- Cloud approach vs dedicated Training cluster
- HPC embraces AI or AI using HPC? Convergence of sorts?
- GPU, TPU, Wafer Scale or other???
- Optics: NPO, CPO, Direct drive, Intra ASIC?
- Network: Lossy vs Lossless religion or technology?











- WL large or small? Model vs Tokens
- LLM and RecSys
- Cloud approach vs dedicated Training cluster
- HPC embraces AI or AI using HPC? Convergence of sorts?
- GPU, TPU, Wafer Scale or other???
- Optics: NPO, CPO, Direct drive, Intra ASIC?
- Network: Lossy vs Lossless religion or technology?



- WL large or small? Model vs Tokens
- LLM and RecSys
- Cloud approach vs dedicated Training cluster
- HPC embraces AI or AI using HPC? Convergence of sorts?
- GPU, TPU, Wafer Scale or other???
- Optics: NPO, CPO, Direct drive, Intra ASIC?
- Network: Lossy vs Lossless religion or technology?





# At a Crossroads or maybe a Perfect Storm...?





O PyTorch 2.0 Preview Announced!





176B params 59 languages Open-access





国

 $\mathbf{b}$ 

From IEEE DCB tutorial

 $\oplus \mathcal{D}$ 

M



# Networks of Interest: Basic Characteristics



#### Network #1 – CSP or big lab - proprietary

Network #2 - Ultra Ethernet

Network #3 – Vendor specific? Network or Memory? ASIC/Node/Package/Optics Technology

#### Primary DC network

- Used by all 3 deployment models
- Main network for some HPC At Scale
- Very large scale: up to 100K-1M Endpoints
- Distance: >150m; RTT ~100 uS +; BW/GPU ~10GB/S
- Storage attached e.g., over RoCE RDMA
- Network semantics

#### GPU/TPU Scale-Out Network

- DL/Inference Cluster -10k nodes and 7
- Distance: <100m; RTT <10 uS +; BW ~50GB/S</p>
- Main network for some HPC At Scale
- Network semantics

#### GPU/TPU Scale-Up Network

- Within a node; small scale e.g., 256 XPU?
- Distance: ~1m; RTT ~1 uS +; BW ~1000 GB/S
- Direct connect and/or switched
- Memory and Network semantics

## The 3 Key Deployment Models



## Common Requirements





Transport<br/>primitives for• Large Scale• Optimized RDMA• Multi pathing• Performance – BW, latency, tail latency, Packets/S• Relaxed ordering• High network utilization• Modernized Congestion Control• Stability and Reliability

# The Network – direct workload performance influence!

#### Al



- Framework coordinated systolic
- High Bandwidth
- Large messages
- In Network Compute 2x potential

#### https://youtu.be/miv5PExXTmc?t=782

#### HPC

#### Performance Evaluation – Micro-benchmarks Experimental results from Dell Bluebonnet osu bw Large Message 15000 Up to 20% reduction in small message point-topoint latency 2.3.x-broade OpenMPI From 0.1x to 2x increase in bandwidth 2.3.7 Up to 12.4x lower MPI Allreduce latency \* # # # # # # # Message Size (Bytes) Up to 5x lower MPI\_Scatter latency Alltoall - 64 Nodes, 128 PPN Allreduce - 64 Nodes, 128 PPN Scatter - 64 Nodes, 128 PP Message Size (Byter MPI Small messages – Latency sensitive

https://mvapich.cse.ohio-state.edu/static/media/talks/slide/kawthar-slingshot-osu-booth-sc22\_2.pdf

Existing application support - required



An Ethernet-based, open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale

Uri Elzur Technical Advisory Committee Chair, Ultra Ethernet Consortium

## **INTRODUCING: THE PROMISE OF ULTRA ETHERNET**

https://ultraethernet.org/



## **Steering Committee Members** ARISTA 111111 **Hewlett Packard CISCO** Enterprise an atos business intel ∧ Meta Microsoft ORACLE

Ultra*Ethernet* 

## **TARGET DEPLOYMENT MODELS / USE CASES**



Profiles defined for AI and HPC use cases

Copyright ©2023 Ultra Ethernet Consortium. All Rights Reserved

## APPROACH

#### <u>у к</u> У К

The founding companies are seeding the consortium with highly valuable contributions in four working groups: **Physical Layer, Link Layer, Transport Layer and Software Layer**.



UEC will follow a systematic approach with modular, compatible, interoperable layers and tight integration of these layers to provide a holistic improvement for demanding workloads is paramount.

The consortium will work on **minimizing communication stack changes** while maintaining and **promoting Ethernet interoperability**.

Project under the Joint Development Foundation (JDF) of the Linux Foundation

#### Ultra*Ethernet*

## **TECHNICAL GOALS**

**Open** specifications, APIs, source code for optimal performance of AI and HPC workloads at scale.



## **UEC TRANSPORT ADDRESSES GRAND CHALLENGES**

- Future proof system scale with up to 1M endpoints
- Improved network utilization with multi-path routing
- Lower tail latency with flexible packet ordering
- Faster congestion control response times
- Modernized & optimized RDMA operations and APIs
- Security built-in from the beginning
- End-To-End telemetry provides improved network visibility



## **FUTURE PROOF SYSTEM SCALE & NETWORK UTILIZATION**

- Determinism and predictability become more difficult as systems grow
  - Network Stability, Fairness, re-convergence times, deadlock avoidance are part of the design
- "Packet spraying" every flow to simultaneously uses all paths to the destination, vs flow using a single path
  - Achieves more balanced use of entire network
- From Rigid to Flexible Ordering
  - Rigid packet and message ordering uses "go-back-n" for loss recovery, but restricts network utilization and increases tail latencies
  - Flexible ordering enables packet-spraying in bandwidth-intensive large collective operations; without reorder buffers
  - Supports modernized RDMA operations and APIs, relaxing packet ordering while enabling maintenance of message ordering
  - Minimize state and complexities of Initiator and target
  - Critical to curtail tail latencies

#### Ultra **Ethernet**



## **ADVANCED SECURITY, CONGESTION CONTROL & TELEMETRY**

#### Congestion

- Optimized response time while maintaining high utilization
- Support packet spraying
- Address incast (e.g., as a result of All-to-All)
- Telemetry
  - Address wire and end-point congestion
  - Leverage shortened congestion signaling path, with more information to the endpoints to allow a more responsive congestion control
  - Information = location and cause of the congestion
- Advanced Security
  - Encryption support that doesn't balloon the session state in hosts and network interfaces
  - Similar conditions in AI and HPC



## Modern Transport and RDMA Services Needs for AI and HPC

| Requirement                                | UEC Transport                                                             | Legacy RDMA                                                                            | UEC Advantage                                        |
|--------------------------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------|
| Multi-Pathing                              | Packet spraying                                                           | Flow-level multi-pathing                                                               | Higher network utilization                           |
| Flexible Ordering                          | Out-of-order packet delivery with in-order message delivery               | N/A                                                                                    | Matches application requirements, lower tail latency |
| AI and HPC Congestion Control              | Workload-optimized,<br>configuration free, lower latency,<br>programmable | DCQCN: configuration required,<br>brittle, signaling requires<br>additional round trip | Incast reduction, faster response, future-proofing   |
| E2E Telemetry                              | Sender or Receiver                                                        | ECN                                                                                    | Faster congestion resolution, better visibility      |
| Simplified RDMA                            | Streamlined API, native workload interaction, minimal endpoint state      | Based on IBTA Verbs                                                                    | App-level performance, lower cost implementation     |
| Security                                   | Scalable, 1 <sup>st</sup> class citizen                                   | Not addressed, external to spec                                                        | High scale, modern security                          |
| Large Scale with Stability and Reliability | Targeting 1M endpoints                                                    | Typically, a few thousand simultaneous end points                                      | Current and future-proof scale                       |

## Summary

- The Network as an island of stability amidst the storm
- Collaborate with us to move Ethernet to next level
  - Join UEC

#### www.ultraethernet.org

- Industry benefits
  - Std high volume Ethernet based AI/HPC network products
  - AI/HPC convergence support/acceleration

#### Ethernet Interconnect Family Performance Share





