

# **COMPUTE OFFLOAD ACCELERATION WITH FPGA**IXPUG 2019 – Tutorial CERN September, 27<sup>th</sup> 2019

francisco.perez@intel.com

#### Agenda

Introduction to FPGA Acceleration Stack

Software and board installation

• Walkthrough on bring-up test on the server

What is an Accelerator Functional Unit (AFU)

Application Development using the Acceleration Stack

AFU development using High Level Synthesis (C/C++)

- Introduction to HLS tools
- HLS Interfaces
- HLS AFU development flow

# INTRODUCTION TO INTEL<sup>®</sup> PROGRAMABLE ACCELERATION STACK

## The Big Data Problem



We are generating data at a faster rate than our ability to analyze, understand, transmit, secure and reconstruct in realtime

Not enough compute power, storage or infrastructure to compute in real time with a reasonable TCO

This creates an immense demand for compute architectures that can scale up and out exponentially



#### Focused investments to accelerate HPC & AI



INTEL IS BUILDING THE HARDWARE, SOFTWARE, INTERCONNECT, MEMORY AND SECURITY ARCHITECTURES NEEDED TO ENABLE YOUR TOMORROW'S APPLICATIONS



#### **Acceleration Choices**





# **The Intel Vision**

#### Heterogeneous Systems:

 Span from CPU to GPU to FPGA to dedicated devices with consistent programming models, languages, and tools



#### FPGAs are the focus of today



# What is a FPGA?

- Field Programmable Gate Array (FPGA)
  - Millions of logic elements
  - Thousands of embedded memory blocks
  - Thousands of DSP blocks
  - Programmable routing
  - High speed transceivers
  - Various built-in hardened IP
- Programmable interconnect
- Used to create Custom Hardware!



# How Do Intel<sup>®</sup> FPGAs Help to Solve the Problem?

#### Workload Optimization:

ensure Xeon cores serve their highest value processing FPGA focus on intensive tasks

#### **Efficient Performance:**

improve performance/watt Custom hardware tailored

#### **Real-Time:**

high bandwidth connectivity and low-latency parallel processing In-line data streaming









inte

**XEON** 

ARRIA

(intel)

AGILEX



### Separation of concerns

Two groups of developers:

- Domain experts concerned with getting a result
  - Host application developers leveraging optimized libraries
- Tuning experts concerned with performance
  - Typical FPGA developers that create optimized libraries

Intel<sup>®</sup> Math Kernel Library a simple example of raising the level of abstraction to the math operations

- Domain experts focus on formulating their problems
- Tuning experts focus on vectorization and parallelization

Host CPU

HW

Accel

# Traditional FPGA Design and Use is *"Difficult"*

#### Low level hardware design requires complicated, long, time-consuming efforts





#### Software Developers are the New FPGA Developers

"I don't speak FPGA!

What is the programming model, and where are the compilers, libraries and tools I am used to?"

New use case of FPGAs as software-defined hardware and the benefits as accelerators

Opens up the usage for a much larger developer base





Programmable Solutions Group

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

18

#### **Components of Acceleration Stack: Overview**



el

# The Challenge: Enabling the Performance & Capabilities of FPGA for Everyone



Board Design & Qualification



Software Development



FPGA Accelerator Development

#### Intel<sup>®</sup> Investment in All These Areas Democratizes FPGA Acceleration



**FPGA Acceleration Cards for datacenters** 

#### Intel<sup>®</sup> FPGA Programmable Acceleration Cards for Application Acceleration

#### Intel<sup>®</sup> FPGA PAC with Arria<sup>®</sup> 10 GX

**Broad deployment at low power** 1/2 length, 1/2 height, 1 PCIe slot card 70W TDP

#### Intel<sup>®</sup> FPGA PAC with Stratix<sup>®</sup> 10 GX



**Enabling high throughput** ¾ length, full height, dual PCIe slot card 225W TDP



#### Intel® FPGA Programmable Acceleration Card with Intel® Arria® 10 GX FPGA





Low power programmable acceleration platform with data center-grade software stack enabling in-line processing and memory intensive applications.

#### Features

- 1.15 million logic elements
- DDR4 memory, 2 banks 4GB @2133Mbps
- 53Mbit embedded memory
- 4x10G / 1x40G QSFP
- PCIe\* Gen 3 x8 (x16 mechanical)
- BMC for monitoring and control (PLDM)
- 1/2 length, 1/2 height, 1slot PCIe\* card
- 70W TDP, 45W FPGA
- Acceleration Stack for Intel<sup>®</sup> Xeon<sup>®</sup> CPU with FPGAs

#### Intel® FPGA Programmable Acceleration Card with Intel® Stratix® 10 GX FPGA



#### **Features**

- 2.8 million logic elements
- 32 Gb DDR4 DIMM memory (4x8GB, 2133Mbps)
- 229 Mbit embedded memory
- 2x 100G (4x25Gb) QSFP
- PCIe\* Gen3 x16
- BMC for monitoring and control (PLDM)
- 3/4 length, full height, dual slot card
- 225W TDP, 150W FPGA
- Acceleration Stack for Intel<sup>®</sup> Xeon<sup>®</sup> CPU with FPGAs



High bandwidth programmable acceleration platform with data centergrade software stack enabling in-line processing and memory intensive applications.

\* Other names and brands may be claimed as the property of others. Specifications preliminary and are subject to change



# **Open Programmable Acceleration Engine (OPAE)**

**Consistent API across product generations and platforms** Abstraction for hardware specific FPGA resource details

**Designed for minimal software overhead and latency** Lightweight user-space library (*libfpga*)

**Open ecosystem for industry and developer community** FPGA driver being upstreamed into Linux kernel

Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)

Includes guides, command-line utilities and sample code

#### Simplified FPGA Programming Model for Application Developers



Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE

26

# INTEL<sup>®</sup> PROGRAMABLE ACCELERATION CARD AND STACK INSTALLATION Step Guide



Runtime package includes only OPAE drivers and sample AFU, Developer package includes Quartus + IP Lic + drivers

Programmable Solutions Group





#### Select Supported Server

| OEM                  | Dell                                     | Fujitsu          | HPE                        | Inspur     | Quanta                                                      | Kontron            | Supermicro                                       |
|----------------------|------------------------------------------|------------------|----------------------------|------------|-------------------------------------------------------------|--------------------|--------------------------------------------------|
| Status               | Qualified                                | Qualified        | Qualified                  | Qualified* | Qualified                                                   | Ongoing            | Qualified*                                       |
| Servers<br>Supported | R640<br>R740<br>R740xd<br>R840<br>R940xa | RX2540<br>TX2550 | ProLiant<br>DL360<br>DL380 | 5280M5     | QuantaGrid<br>D52BQ-1U<br>D52BQ-2U<br>QuantaVault<br>JG4080 | Symkloud<br>MS2900 | Sys-1029U<br>Sys-2029U<br>Sys-6019U<br>Sys-6029U |
|                      | CONTRACTOR                               |                  |                            |            |                                                             |                    |                                                  |

Customers can deploy on their servers of choice following: Intel Programmable Acceleration Card Platform Qualification Guidelines\*

\*Available on request



# Install PAC – Arria10 PAC in HPE DL360



#### Front





# Install Supported OS

Acceleration Stack v1.2 validated OS

- RHEL kernel 3.10 (v7.4 & 7.6)
- CentOS kernel 3.10 (v7.4 & 7.6)
- Ubuntu kernel 4.4 (v16.04)



#### **Download Acceleration Stack**

https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/overview.html



In today's world, the number of connected devices and the amount of data continues to increase every day. The rate at which data arrives from these devices into data centers also continues to increase. By leveraging Intel<sup>®</sup> FPGAs as accelerators, a wide range of workloads can be enhanced to accommodate this increased data, and new demands for analyzing it.

FPGAs are silicon devices that can be dynamically reprogrammed with a datapath that exactly matches your workloads, such as data analytics, image inference, encryption, and compression. This versatility enables the provisioning of a faster processing, more power efficient, and lower latency service – lowering your total cost of ownership, and maximizing compute capacity within the power, space, and cooling constraints of your data centers.

Traditionally, FPGAs require deep domain expertise to program, but the Intel® Acceleration Stack for Intel Xeon® CPU with FPGAs simplifies the development flow and enables rapid deployment across the data center. Intel is partnering with FPGA intellectual property (IP) developers, server original equipment manufacturers (OEMs), virtualization platform providers, operating system (OS) vendors, and system integrators to enable customers to efficiently develop and operationalize their infrastructure.

#### **Benefits of Intel FPGAs**

- Ease of deployment The Intel FPGA Programmable Acceleration Card (Intel® FPGA PAC) provides an Intel FPGA in a PCIe-based card that is available on validated servers from several leading OEMs. While the Intel® Acceleration Stack for Intel Xeon® CPU with FPGAs with FPGAs abstracts away much of the complexity of programming FPGAs.
- Standardization The Intel® Acceleration Stack for Intel Xeon® CPU with FPGAs defines standardized interfaces that FPGA developers and development and operations teams can use to hot-swap accelerators and enable application portability.
- Accelerator Solutions We offer a portfolio of accelerator solutions developed by Intel and third-party technologists to expedite application development and deployment. Application classes that can stand to benefit from FPGA acceleration range from streaming analytics, image inference, financial, and beyond.





# Download Intel<sup>®</sup> Acceleration Stack

Download Intel® Acceleration Stack Version 1.2

| Components                                  | Acceleration Stack for Runtime                                                                                                  | Acceleration Stack for Development                                                                                                                                                             |  |  |
|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Purpose                                     | Smaller footprint package for software development of runtime host application. Intel®<br>Quartus® Prime Software not included. | Accelerator function development using the Intel Quartus Prime Pro Edition Software, Intel FPGA Software Development Kit (SDK) for OpenCL™ and Acceleration Stack                              |  |  |
| Intel Acceleration Stack Version            | Intel Acceleration Stack Version 1.2                                                                                            | Intel Acceleration Stack Version 1.2                                                                                                                                                           |  |  |
| Intel Quartus Prime software and interfaces | Not required. Pre-compiled binaries and FPGA images provided in the release                                                     | Requires Intel Quartus Prime Pro Edition software version 17.1.1. Software and related<br>Interfaces (SR-IOV, Low Latency 10 Gbps and 40 Gbps Ethernet MAC/PHY) are provided in the<br>release |  |  |
| OpenCL software                             | Intel FPGA Runtime Environment (RTE) for OpenCL                                                                                 | Intel FPGA SDK for OpenCL                                                                                                                                                                      |  |  |
| Validated Servers                           | View server models                                                                                                              | View server models                                                                                                                                                                             |  |  |
| Validated operating system                  | RHEL 7.4, CentOS 7.4, Ubuntu 16.04                                                                                              | RHEL 7.4, CentOS 7.4, Ubuntu 16.04                                                                                                                                                             |  |  |
| Release notes                               | Intel® Acceleration Stack for Intel® Xeon® CPU with FPGAs Version 1.2 Release Notes                                             | Intel® Acceleration Stack for Intel® Xeon® CPU with FPGAs Version 1.2 Release Notes                                                                                                            |  |  |
| Quick Start Guide*                          | Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel<br>Arria 10 GX FPGA              | Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel<br>Arria 10 GX FPGA                                                                             |  |  |
| Download size                               | ~200 MB                                                                                                                         | ~18 GB                                                                                                                                                                                         |  |  |
| Intel Acceleration Stack download           | Download Now                                                                                                                    | Download Now                                                                                                                                                                                   |  |  |
| md5sum                                      | 393061340C31717C7C31C09E29F18FD2                                                                                                | 28A5BEF88AF2435D08EDC2F78F1AEA99                                                                                                                                                               |  |  |
|                                             | Firmware 26889                                                                                                                  | Firmware 26889                                                                                                                                                                                 |  |  |
| Board Management Controller (BMC) version   | Firmware Bootloader 26879                                                                                                       | Firmware Bootloader 26879                                                                                                                                                                      |  |  |
| BMC firmware and tools download             | Register at Intel PAC Firmware and Tools and select Intel PAC                                                                   | Register at Intel PAC Firmware and Tools and select Intel PAC                                                                                                                                  |  |  |



# AFU Development Software Requirements

Acceleration Stack SDK (all licenses included in development package)

- Quartus Prime Pro Software 17.1.1 for v1.2 Acceleration Stack, 18.0.1 for v2.0
- IP-PCIE/SRIOV License
- Low Latency 10Gbps Ethernet MAC(6AF7-0119) license
- Low Latency 40Gbps Ethernet MAC and PHY(6AF7-011B) license

python2-jsonschema package from the epel repository (version 2.7 or higher)

GCC – C compiler version 4.7 or greater

**RTL Simulator** 

- Synopsys VCS-MX version 2016.06-SP2-1
- 64-bit ModelSim SE or QuestaSim version 10.5c or higher



# Installing the Intel<sup>®</sup> Acceleration Stack

1. Extract the archive file:

tar xvf \*rte\_installer.tar.gz or tar xvf \*dev\_installer.tar.gz

2. Change to the installation directory.

cd \*rte\_installer or cd \*dev\_installer

3. Install Extra packages for Enterprise Linux (EPEL) for RHEL 7.4 only

sudo yum install <u>https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm</u>

sudo subscription-manager repos --enable "rhel-\*-optional-rpms" --enable "rhel-\*-extras-rpms"

4. Run setup script

./setup.sh

5. Run the initialization script from the installation directory to setup environment variables

source /home/<username>/intelrtestack/init\_env.sh or source /home/<username>/inteldevstack/init\_env.sh

# Intel<sup>®</sup> Acceleration Stack Directory Structure





4



# **BOARD BRING-UP**

# **Out-of-Box User Flow for Acceleration Stack**



Programmable Solutions Group

Runtime package includes only OPAE drivers and sample AFU Developer package includes Quartus + IP Lic + drivers



#### **Board Bring up steps**





# Locating PAC in Multi-Card System: SYSFS Entry

#### To list all SYSFS entries in a multi-PAC system

\$ ls -l /sys/class/fpga/intel-fpga-dev.?/device

lrwxrwxrwx. 1 root root 0 Oct 25 12:34 /sys/class/fpga/intel-fpga-dev.0/device -> ../../../0000:3b:00.0
lrwxrwxrwx. 1 root root 0 Oct 25 12:34 /sys/class/fpga/intel-fpga-dev.1/device -> ../../../0000:86:00.0
lrwxrwxrwx. 1 root root 0 Oct 25 12:34 /sys/class/fpga/intel-fpga-dev.2/device -> ../../../0000:87:00.0



5.1

#### 5.2 Finding Board Serial Number

#### To view serial number for a particular SYSFS entry





# Find serial number on front bottom of Arria<sup>®</sup> 10 GX PAC

\$ hexdump -C /sys/class/fpga/intel-fpga-dev.2/intel-fpga-fme.2/intel-pac-hssi.?.auto/hssi\_mgmt/eeprom

| 00000000 | 4d 41 43 3d | 30 30 3a 30 | 62 3a 33 65 3a 30 31 3a | MAC=00:0b:3e:01: |
|----------|-------------|-------------|-------------------------|------------------|
| 00000010 | 65 65 3a 66 | 38 0a 53 4e | 3d 32 30 33 32 31 36 0a | ee:f8 SN=203216. |
| 00000020 | 50 43 3d 41 | 31 30 53 41 | 34 2d 30 55 2d 42 31 31 | PC=A10SA4-0U-B11 |
| 00000030 | 35 58 32 45 | 32 51 2d 32 | 32 2d 49 34 30 31 34 30 | 5X2E2Q-22-I40140 |
| 00000040 | 54 2d 36 0a | 52 45 56 3d | 31 2e 31 32 2e 30 2e 30 | T-6.REV=1.12.0.0 |
| 00000050 | 2e 30 0a 0a | ff ff ff ff | ff ff ff ff ff ff ff ff | .0               |
| ***      |             |             |                         |                  |
| 00000200 |             |             |                         |                  |



#### Check PCIe Speed and Width

#### \$ sudo lspci -d 8086:09c4 -vvv

#### 05:00.0 Processing accelerators: Intel Corporation Device 09c4

Subsystem: Intel Corporation Device 0000 Physical Slot: 2 Control: I/O- Mem+ BusMaster+ SpecCvcle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRO 25 Region 0: Memory at eab00000 (64-bit, prefetchable) [size=512K] Region 2: Memory at eaa00000 (64-bit, prefetchable) [size=1M] Capabilities: [68] MSI-X: Enable+ Count=7 Masked-Vector table: BAR=0 offset=00009000 PBA: BAR=0 offset=0000a000 Capabilities: [78] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-Capabilities: [80] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-MaxPayload 256 bytes, MaxReadReg 1024 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <4us, L1 <1us ClockPM- Surprise- LLActRep- BwNot-LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

(truncated)

5.3
# Useful OPAE Command-line Utilities For Board Management

| Commands  | Description                                                                                             |
|-----------|---------------------------------------------------------------------------------------------------------|
| fpgainfo  | User can read the Board Telemetry data. For example temperature or Voltages.                            |
| fpgabist  | Performs self-diagnostic test: measure bandwidth between local DDR4 memory and system memory            |
| fpgaconf  | Configure Acceleration Function Unit (AFU) into FPGA;<br>Check compatibility with targeted FPGA and FIM |
| fpgaflash | Updates FPGA Interface Manager (FIM) image (.rpd file) being stored in flash;<br>Updates BMC firmware.  |



# **Checking FIM and BMC Version**

- Board needs active PCIe\* link to check FIM version
- Use OPAE tool fpgainfo to check PAC's FIM and BMC version

\$ sudo fpgainfo fme

Sample output

5.4







| Acceleration Stack<br>Version | FIM Version (PR Interface ID)        | OPAE<br>Version | BMC Version |
|-------------------------------|--------------------------------------|-----------------|-------------|
| 1.2 Production                | 69528db6-eb31-577a-8c36-68f9faa081f6 | 1.1.2-1         | 26889       |
| 1.2 Alpha                     | 93abeb6a-30c8-5f77-8172-d828c3a699ca | 1.1.1-1         | 26889       |
| 1.1 Production                | 9926ab6d-6c92-5a68-aabc-a7d84c545738 | 1.0.2           | 26822       |



# Ensure PAC Is Visible In-System

If FIM is loaded correctly, PAC should show up as PCIe\* endpoint, and can be seen from 'lspci' (Linux command).

\$lspci | grep 09c4

5.4

OS on host CPU will discover PAC cards as PCIe device 8086:09c4

04:00.0 Processing accelerators [1200]: Intel Corporation Device [8086:09c4]



### Checking OPAE Software Version

Follow Quick Start Guide to check OPAE version

For example, in CentOS/RHEL, run the following to check OPAE version:

\$ rpm -qa | grep opae

Sample output

5.4

```
opae-tools-1.1.2-1.x86_64
opae-devel-1.1.2-1.x86_64
opae-libs-1.1.2-1.x86_64
opae-1.1.2-1.x86_64
```



#### What is Diagnostic test

- The fpgabist tool performs self-diagnostic tests on supported FPGA platforms.
- Tests PCIe, DMA from CPU DDR to Device DDR and memory access bandwidth
- Currently, fpgabist accepts the following AFs:
- **1. nlb\_mode\_3:** The native loopback (NLB) test implements a loopback from TX to RX. Use it to verify basic functionality and to measure bandwidth.
- **2. dma\_afu:** The direct memory access (DMA) AFU test transfers data from host memory to FPGA-attached local memory.



# Run FPGA Diagnostics

Configure the number of system hugepages the fpgadiag utility requires

sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr\_hugepages"

Configure and run diagnostics with NLB\_3 AFU Image

sudo fpgabist \$OPAE\_PLATFORM\_ROOT/hw/samples/nlb\_mode\_3/bin/nlb\_mode\_3.gbs

Configure and run diagnostics with DMA AFU Image

sudo fpgabist \$OPAE\_PLATFORM\_ROOT/hw/samples/dma\_afu/bin/dma\_afu.gbs

# DMA AFU: Built-In Self Test (fpgabist)





# DMA Built-in Self Test Output

fpgainfo Tool output (FME, TEMP, POWER, PORT)

FME and PORT error status registers (for AFU developer and user to debug)

Partial Reconfiguration messages (loading AFU)

DMA bandwidth report



### DMA BIST Output 1/5

\_\_\_\_\_\_

#### Beginning FPGA Built-In Self-Test

| Device: bus = 04, device = 00                                 | 0, func = 0                            |  |  |  |
|---------------------------------------------------------------|----------------------------------------|--|--|--|
| Board Management Controller, 1                                | microcontroller FW version 26889       |  |  |  |
| Last Power Down Cause: POK_CO                                 | ORE                                    |  |  |  |
| Last Reset Cause: None                                        |                                        |  |  |  |
| //***** FME *****//                                           |                                        |  |  |  |
| Object Id                                                     | : 0xF300000                            |  |  |  |
| PCIe s:b:d:f                                                  | : 0000:04:00:0                         |  |  |  |
| Device Id                                                     | : 0x09C4                               |  |  |  |
| Socket Id                                                     | : 0x00                                 |  |  |  |
| Ports Num                                                     | : 01                                   |  |  |  |
| Bitstream Id                                                  | : 0x121000200000161                    |  |  |  |
| Bitstream Version                                             | : 0x10201                              |  |  |  |
| Pr Interface Id                                               | : 93abeb6a-30c8-5f77-8172-d828c3a699ca |  |  |  |
| Board Management Controller,                                  | microcontroller FW version 26889       |  |  |  |
| Last Power Down Cause: POK_CO                                 | ORE                                    |  |  |  |
| Last Reset Cause: None                                        |                                        |  |  |  |
| //****** PORT ******//                                        |                                        |  |  |  |
| Object Id                                                     | : 0xF200000                            |  |  |  |
| PCIe s:b:d:f                                                  | : 0000:04:00:0                         |  |  |  |
| Device Id                                                     | : 0x09C4                               |  |  |  |
| Socket Id                                                     | : 0x00                                 |  |  |  |
| Ports Num                                                     | : 01                                   |  |  |  |
| Bitstream Id                                                  | : 0x121000200000161                    |  |  |  |
| Bitstream Version                                             | : 0x10201                              |  |  |  |
| Pr Interface Id                                               | : 93abeb6a-30c8-5f77-8172-d828c3a699ca |  |  |  |
| Accelerator Id                                                | : 331db30c-9885-41ea-9081-f88b8f655caa |  |  |  |
| Board Management Controller, microcontroller FW version 26889 |                                        |  |  |  |
| Last Power Down Cause: POK_CORE                               |                                        |  |  |  |
| Last Reset Cause: None                                        |                                        |  |  |  |
|                                                               |                                        |  |  |  |

#### Output of "fpgainfo fme" command

Output of "fpgainfo port" command



### DMA BIST Output 2/5

//\*\*\*\*\*\* TEMP \*\*\*\*\*// Object Id : 0xF300000 PCIe s:b:d:f : 0000:04:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x121000200000161 Bitstream Version : 0x10201 Pr Interface Id : 93abeb6a-30c8-5f77-8172-d828c3a699ca (11) FPGA Core TEMP : 73.00 °C (12) Board TEMP : 47.00 °C (14) OSFP TEMP : No reading (reading state unavailable) (15) Core Supply Temp : 75.96 °C Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK CORE Last Reset Cause: None //\*\*\*\*\* POWER \*\*\*\*\*// Object Id : 0xF300000 PCIe s:b:d:f : 0000:04:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x121000200000161 Bitstream Version : 0x10201 Pr Interface Id : 93abeb6a-30c8-5f77-8172-d828c3a699ca ( 0) Total Input Power : 23.50 Watts ( 1) PCIe 12V Current : 1.96 Amps ( 2) PCIe 12V Voltage : 11.60 Volts (3) 1.2V Voltage : 1.22 Volts ( 4) 1.2V Current : 2.66 Amps (5) 1.8V Voltage : 1.83 Volts ( 6) 1.8V Current : 2.91 Amps (7) 3.3V Mgmt Voltage : 3.36 Volts (8) 3.3V Current : 0.72 Amps ( 9) FPGA Core Voltage : 0.90 Volts (10) FPGA Core Current : 8.02 Amps

#### Output of "fpgainfo temp" command

#### Output of "fpgainfo power" command



### DMA BIST Output 3/5

| //initial FME ERRORS (MARKY/)         Object Id       : 0xF300000         PCIe s:b:d:f       : 0000:04:00:0         Device Id       : 0x09C4         Socket Id       : 0x00         Ports Num       : 01         Bitstream Id       : 0x121000200000161         Bitstream Version       : 0x7FFD0010201         Pr Interface Id       : 93abeb6a-30c8-5f77-8172-d828c3a699ca         First Error       : 0x0         Next Error       : 0x0         PCIe1 Errors       : 0x0         Nonfatal Errors       : 0x0         Inject Error       : 0x0         Catfatal Errors       : 0x0         PCIe0 Errors       : 0x0                             | //****** FMF FDDODC ******// |   |                                      |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|---|--------------------------------------|
| Object Id       : 0xF30000         PCIe s:b:d:f       : 0000:04:00:0         Device Id       : 0x09C4         Socket Id       : 0x00         Ports Num       : 01         Bitstream Id       : 0x121000200000161         Bitstream Version       : 0x7FFD00010201         Pr Interface Id       : 93abeb6a-30c8-5f77-8172-d828c3a699ca         First Error       : 0x0         Next Error       : 0x0         PCIe1 Errors       : 0x0         Nonfatal Errors       : 0x0         Inject Error       : 0x0         Catfatal Errors       : 0x0         PCIe0 Errors       : 0x0         Inject Error       : 0x0         PCIo0 Errors       : 0x0 | //****** FME ERRURS ******// |   |                                      |
| PCIe s:b:d:f       : 0000:04:00:0         Device Id       : 0x09C4         Socket Id       : 0x00         Ports Num       : 01         Bitstream Id       : 0x121000200000161         Bitstream Version       : 0x7FFD00010201         Pr Interface Id       : 93abeb6a-30c8-5f77-8172-d828c3a699ca         First Error       : 0x0         Next Error       : 0x0         PCIe1 Errors       : 0x0         Nonfatal Errors       : 0x0         Inject Error       : 0x0         Catfatal Errors       : 0x0                                                                                                                                       | Object Id                    | : | 0xF300000                            |
| Device Id: 0x09C4Socket Id: 0x00Ports Num: 01Bitstream Id: 0x121000200000161Bitstream Version: 0x7FFD0010201Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0                                                                                                                                                                                                                                                                                                                                                    | PCIe s:b:d:f                 | : | 0000:04:00:0                         |
| Socket Id: 0x00Ports Num: 01Bitstream Id: 0x121000200000161Bitstream Version: 0x7FFD0010201Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0                                                                                                                                                                                                                                                                                      | Device Id                    | : | 0x09C4                               |
| Ports Num: 01Bitstream Id: 0x121000200000161Bitstream Version: 0x7FFD0010201Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0                                                                                                                                                                                                                                                                                                                                                                                     | Socket Id                    | : | 0x00                                 |
| Bitstream Id: 0x121000200000161Bitstream Version: 0x7FFD0010201Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0Catfatal Errors: 0x0                                                                                                                                                                                                                                                                                                                                               | Ports Num                    | : | 01                                   |
| Bitstream Version: 0x7FFD00010201Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0PCIaf Errors: 0x0                                                                                                                                                                                                                                                                                                                                                                                                                        | Bitstream Id                 | : | 0x121000200000161                    |
| Pr Interface Id: 93abeb6a-30c8-5f77-8172-d828c3a699caFirst Error: 0x0Next Error: 0x0Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0PCIG Error: 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Bitstream Version            | : | 0x7FFD00010201                       |
| First Error: 0x0Next Error: 0x0Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0PCL0 Errors: 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Pr Interface Id              | : | 93abeb6a-30c8-5f77-8172-d828c3a699ca |
| Next Error: 0x0Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0PCLo0 Error: 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | First Error                  | : | 0x0                                  |
| Errors: 0x0PCIe1 Errors: 0x0Nonfatal Errors: 0x0Inject Error: 0x0Catfatal Errors: 0x0PCIe0 Error: 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Next Error                   | : | 0x0                                  |
| PCIe1 Errors       : 0x0         Nonfatal Errors       : 0x0         Inject Error       : 0x0         Catfatal Errors       : 0x0         DCLa0 Errors       : 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Errors                       | : | 0x0                                  |
| Nonfatal Errors       : 0x0         Inject Error       : 0x0         Catfatal Errors       : 0x0         DCLa0 Errors       : 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | PCIe1 Errors                 | : | 0x0                                  |
| Inject Error     : 0x0       Catfatal Errors     : 0x0       DCLa0 Errors     : 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Nonfatal Errors              | : | 0x0                                  |
| Catfatal Errors : 0x0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Inject Error                 | : | 0x0                                  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Catfatal Errors              | : | 0x0                                  |
| PCTED ELLOLS : 0X0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | PCIe0 Errors                 | : | 0x0                                  |

#### Output of "fpgainfo error" command



#### DMA BIST Output 4/5

Loading DMA AFU (FPGA partial reconfiguration)

AFU will "find slot" if AFU version matched FIM version

Running mode: dma\_afu Attempting Partial Reconfiguration: Reading bitstream Looking for slot Found slot Programming bitstream Writing bitstream Done



#### DMA BIST Output 5/5

Running fpga\_dma\_test test...

Running test in HW mode Buffer Verification Success! Buffer Verification Success! Running DDR sweep test Buffer pointer = 0x7f1b68982000, size = 0x100000000 (0x7f1b68982000 through 0x7f1c68982000) Allocated test buffer Fill test buffer DDR Sweep Host to FPGA Measured bandwidth = 6810.764668 Megabytes/sec Clear buffer DDR Sweep FPGA to Host Measured bandwidth = 6917.527127 Megabytes/sec Verifying buffer.. Buffer Verification Success! DDR sweep with unaligned pointer and size Buffer pointer = 0x7f1b6938303d, size = 0xffffffbe (0x7f1b6938303d through 0x7f1c69382ffb) .... Ruffer pointer = 0x7f1b69383000. size = 0xfffffff9 (0x7f1b69383000 through 0x7f1c69382ff9) Allocated test buffer Fill test buffer DDR Sweep Host to FPGA Measured bandwidth = 6813.543883 Megabytes/sec Clear buffer Clear buffer DDR Sweep FPGA to Host Measured bandwidth = 6926.264906 Megabytes/sec Verifying buffer.. Buffer Verification Success!

Finished Executing DMA Tests

Measured bandwidth for each direction

Built-in Self-Test Completed.



# **ACCELERATOR FUNCTIONAL UNIT (AFU)**

### FPGA Interface Manager (FIM) + AFU



Programmable Solutions Group

) |

#### How Can FPGA Accelerators Be Created?

#### **Self-Developed**

#### **Externally-Sourced**





#### **Accelerator Function Development**







### FPGA INTERFACE MANAGER (FIM): Under the hood



### **Overview of OPAE Platform for AFUs**

Platform Interface Manager (PIM) defines a generic OPAE platform for which AFU top-levels should be designed

- The AFU requests the device interfaces and properties it needs from the PIM using a platform configuration file specification (.json)
- Generates a shim that translates hardware platform-specific device interfaces to the OPAE Platform's generic device interfaces used by the AFU
- Shim inserted between platforms PR region and the AFU providing top level module interface for the AFU



# **OPAE Platform Device Classes**

power

. . .

error



Legend





clocks

CCİ-D

## Core Cache Interface: Overview

CCI abstracts AFU from lower level PCIe protocol Enables AFU to access host memory and respond to MMIO requests Composed of 3 command and response channels



- Supports bidirectional 512-bit data operating at 400MHz pClk domain
- Host memory accesses are on 64Byte Cache Line (CL) basis
  - Supports Multi-CL bursts of 2 or 4
  - Supports write fence mechanism to support synchronizing shared host memory accesses between AFU and Host SW application



### Intel FPGA Basic Building Blocks (BBB)

Suite of RTL shims for transforming the CCI interface

Memory Properties Factory (MPF)

Adds features to the base CCI memory interface

#### CCI Async-shim

- Clock crossing shim for slower-running accelerators
   CCI Multiplexer
- Allows multiple agents to share a single CCI-P interface

\$ git clone https://github.com/OPAE/intel-fpga-bbb





### **Example Designs to Get Started**



| Example           | Description                                                                                                      |
|-------------------|------------------------------------------------------------------------------------------------------------------|
| Hello AFU         | Simple AFU with direct CCI connection for MMIO access                                                            |
| Hello Intr AFU    | Example use of user interrupts                                                                                   |
| Hello Mem AFU     | Example showing using USR Clock to auto close timing in the AFU                                                  |
| DMA AFU           | Example DMA AFU to move data between host memory<br>and local FPGA memory. Uses BBB and bridges Avalon to<br>CCI |
| Streaming DMA AFU | Example DMA AFU to move data between host memory and the AFU directly as a streaming packet                      |
| Eth e2e e10       | 10Gb Ethernet loopback design                                                                                    |
| Eth e2e e40       | 40Gb Ethernet loopback design                                                                                    |
| NLB mode 0        | Native LoopBack adaptor (rd/wr) with more features                                                               |
| NLB mode 0 stp    | Native LoopBack adaptor with SignalTap remote debug                                                              |
| NLB mode 3        | Native LoopBack adaptor (rd/wr)                                                                                  |



#### AF Project Structure Overview of hello afu example AFU



Start with existing design and modify for your needs

- The ./hw directory provides an example file structure for the AFU's design source and build structure
- Host OPAE software application source in the ./sw directory
  - To perform the co-simulation environment

Project directory typically contains :

- AFU's Quartus settings file (./hw/afu.qsf)
- AFU's RTL
- AFU's Quartus PR build directory (./build) with project files and compiled AF image (.gbs)
- Platform configuration file (.json)
- Build configuration file (.txt)



# AFU RTL Source

#### Mandatory Source Files and Hierarchical Structure



afu.sv

- AFU top-level RTL source file describing accelerator
- Can have any name, but the top-level module within must be named "afu"

#### ccip\_std\_afu.sv

- Mandatory top level wrapper RTL file that instantiates the AFU module described in afu.sv
- Instantiates mandatory ccip\_interface\_reg module described in the mandatory ccip\_interface\_reg.sv source file

The .json file is the platform configuration file describing the devices classes required by AFU

The filelist.txt file specifies the build configuration (including source files and .json file)



# Platform Configuration File (.json)

#### Specify the AFU's UUID

- uuidgen To generate
- Request a top-level interfaces
- ccip\_std\_afu, ccip\_std\_afu\_avalon\_mm and optional HSSI device interfaces
   Request pipelining on device interfaces
- Adds user defined number of pipeline register stages to cci or local memory interfaces
   Request clock crossing on device interfaces
- Inserts clock crossing bridge to synchronize cci and local memory to a clock
   Specify a requested device interface as optional

#### Specify AFU user clock timing

Close timing using user clock frequency range defined here

#### AFU RTL Source ccip\_std\_afu.sv Source File (1/2)











AF Simulation Environment (ASE) enables seamless portability to real HW

- Allows fast verification of OPAE software together with AF RTL without HW
  - SW Application loads ASE library and connects to RTL simulation
- For execution on HW, application loads Runtime library and RTL is compiled by Intel<sup>®</sup> Quartus into FPGA bitstream



### AFU Development Flow Using OPAE SDK

AFU requests the ccip\_std\_afu top level interface classes

- \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/hello\_afu.json
   AFU RTL files implementing accelerated function
- \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/afu.sv
- List all source files and platform configuration file
- \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/filelist.txt
   In terminal window, enter these commands:
- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu
- afu\_sim\_setup --source hw/rtl/filelist.txt build\_sim





## AFU Development Flow Using OPAE SDK

Compile AFU and platform simulation models and start simulation server process

- cd build\_sim
- make
- make sim
- In 2<sup>nd</sup> terminal window compile the host application and start the client process
- Export ASE\_WORKDIR= \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/ build\_sim/work
- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/sw
- make USE\_ASE=1
- ./hello\_afu





## AFU Simulation Environment (ASE)

Hardware software co-simulation environment

Uses simulator Direct Programming Interface (DPI) for HW/SW connectivity

- Not cycle accurate (used for functional correctness)
- Converts SW API to CCI transactions

Provides transactional model for the Core Cache Interface (CCI-P) protocol and memory model for the FPGA-attached local memory

Validates compliance to

- CCI-P protocol specification
- Avalon<sup>®</sup> Memory Mapped (Avalon-MM) Interface Specification
- Open Programmable Acceleration Engine



#### **Simulation Complete**

| [APP]                                                                   | # [SIM] 1 ADDED /umas.187070359034322                                                        |
|-------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| [APP] Issuing Soft Reset                                                | # [SIM] Request to deallocate "/umas.187070359034322"                                        |
| [APP] MMIO Read : tid = 0x002, offset = 0x0                             | # [SIM] 1 REMOVED /umas.187070359034322                                                      |
| [APP] MMIO Read Resp : tid = 0x002, data = 1000010000000000             | # [SIM] Request to deallocate "/mmio.187070359034322"                                        |
| AFU DFH REG = 100001000000000                                           | # [SIM] 0 REMOVED /mmio.187070359034322                                                      |
| [APP] MMIO Read : tid = 0x003, offset = 0x8                             | # [SIM] ASE recognized a SW simkill (see ase.cfg) Simulator will EXIT                        |
| [APP] MMIO Read Resp : tid = 0x003, data = 9722d43375b61c66             | # [SIM] SIM-C : Exiting event socket server@/tmp/ase event server 187070359034322            |
| AFU ID L0 = 9722d43375b61c66                                            | # [SIM] Closing message gueue and unlinking                                                  |
| [APP] MMIO Read : tid = $0 \times 004$ , offset = $0 \times 10$         | # [SIM] Unlinking Shared memory regions                                                      |
| [APP] MMIO Read Resp : tid = 0x004, data = 850adcc26ceb4b22             | # [SIM] Session code file removed                                                            |
| AFU ID HI = 850adcc26ceb4b22                                            | # [SIM] Removing message gueues and buffer handles                                           |
| [APP] MMIO Read : tid = 0x005, offset = 0x18                            | # [SIM] Cleaning session files                                                               |
| [APP] MMIO Read Resp : tid = 0x005, data = 0                            | # [SIM] Simulation generated log files                                                       |
| AFU NEXT = 00000000                                                     | # [SIM] Transactions file   \$ASE WORKDIR/ccip transactions.tsv                              |
| [APP] MMIO Read : tid = 0x006, offset = 0x20                            | # [SIM] Workspaces info \$ASE WORKDIR/workspace info.log                                     |
| [APP] MMIO Read Resp : tid = 0x006, data = 0                            | # [SIM] ASE seed   \$ASE WORKDIR/ase seed.txt                                                |
| AFU RESERVED = 00000000                                                 | # ÎSIMÎ                                                                                      |
| [APP] MMIO Read : tid = 0x007, offset = 0x80                            | # [SIM] Tests run => 1                                                                       |
| [APP] MMIO Read Resp : tid = 0x007, data = 0                            | # (SIM)                                                                                      |
| Reading Scratch Register (Byte Offset=00000080) = 00000000              | # [SIM] Sending kill command                                                                 |
| MMIO Write to Scratch Register (Byte Offset=00000080) = 123456789abcdef | # [SIM] Simulation kill command received                                                     |
| [APP] MMIO Write : tid = 0x008, offset = 0x80, data = 0x123456789abcdef | #                                                                                            |
| [APP] MMIO Read : tid = 0x009, offset = 0x80                            | # Transaction count   VA VL0 VH0 VH1   MCL-1 MCL-2 MCL-4                                     |
| [APP] MMIO Read Resp : tid = 0x009, data = 123456789abcdef              | #                                                                                            |
| Reading Scratch Register (Byte Offset=00000080) = 123456789abcdef       | # MMIOWrReq 2                                                                                |
| Setting Scratch Register (Byte Offset=00000080) = 00000000              | # MMIORdReg 10                                                                               |
| [APP] MMIO Write : tid = 0x00a, offset = 0x80, data = 0x0               | # MMIORdRsp 10                                                                               |
| [APP] MMIO Read : tid = 0x00b, offset = 0x80                            | # IntrReq 0                                                                                  |
| [APP] MMIO Read Resp : tid = 0x00b, data = 0                            | # IntrResp 0                                                                                 |
| Reading Scratch Register (Byte Offset=00000080) = 00000000              | # RdReq 0 0 0 0 0 0 0 0                                                                      |
| Done Running Test                                                       | # RdResp 0 0 0 0 0 0                                                                         |
| [APP] Deinitializing simulation session                                 | # WrReq 0  0 0 0 0  0 0 0                                                                    |
| [APP] Closing Watcher threads                                           | # WrResp 0 0 0 0 0 0 0 0                                                                     |
| [APP] Deallocating UMAS                                                 | # WrFence 0 0 0 0 0 0                                                                        |
| [APP] Deallocating memory /umas.187070359034322                         | # WrFenRsp 0 0 0 0 0 0                                                                       |
| [APP] SUCCESS                                                           | #                                                                                            |
| [APP] Deallocating MMIO map                                             | # ** Note: \$finish : /home/student/fpga_trn/AccelStack_Workshop/hello_afu/build_sim/rtl/cci |
| [APP] Deallocating memory /mmio.187070359034322                         | 4)                                                                                           |
| [APP] SUCCESS                                                           | # Time: 21620047500 ps Iteration: 2 Instance: /ase_top/ccip_emulator                         |
| [APP] Deallocate all buffers                                            | # End time: 12:34:40 on Aug 21,2018, Elapsed time: 0:28:57                                   |
| LADDI Took E93 001 606 page                                             |                                                                                              |
| [APP] 100K 565,251,090 lisec                                            | # Errors: 0, Warnings: 3                                                                     |

#### Application SW Window (client)

#### AFU Simulator Window (server)



# AFU Development Flow Using OPAE SDK

#### Generate the AF build environment:

- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu
- afu\_synth\_setup --source hw/rtl/filelist.txt build\_synth

#### Generate the AF

- cd build\_synth
- \$OPAE\_PLATFORM\_ROOT/bin/run.sh





### Using the Quartus GUI

Compiling the AFU uses a command line-driven PR compilation flow

- Builds PR region AF as a .gbs file to be loaded into OPAE hardware platform
   Can use the Quartus GUI for the following types of work:
- Viewing compilation reports
- Interactive Timing Analysis
- Adding SignalTap instances and nodes
  - For on-board debugging


# AFU Debug with Remote SignalTap

Remote SignalTap enables in-system debug of AFUs on PAC installations with limited physical access

Remote debug capability in OPAE supports the following in-system debug tools included with Quartus Prime Pro:

- In-system sources and probes
- In-system memory content editor
- Signal Probe
- System Console





### AFU Design Using High Level Synthesis (HLS)

Leverage GNU compatible HLS compiler to produce verified RTL

Designing at a higher level of abstraction = increase productivity

- Debugging software is much faster than hardware
- Easier to specify functions in software
- Simulation of RTL takes thousands times longer than software
- Easier to modify C/C++ source than RTL







Programmable Solutions Group

https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html



# APPLICATION DEVELOPMENT ON THE ACCELERATION STACK

#### **Components of Acceleration Stack: Overview**



### Co-Design for HW and SW



Programmable Solutions Group

#### **OpenCL<sup>™</sup> Programming**



## **Open Programmable Acceleration Engine (OPAE)**

#### **Consistent API across product generations and platforms**

• Abstraction for hardware specific FPGA resource details

#### Designed for minimal software overhead and latency

Lightweight user-space library (libfpga)

#### Open ecosystem for industry and developer community

- License: FPGA API (BSD), FPGA driver (GPLv2)
- FPGA driver being upstreamed into Linux kernel
- Supports both virtual machines and bare metal platforms

Faster development and debugging of Accelerator Functions with the included AFU Simulation Environment (ASE)

Includes guides, command-line utilities and sample code

Simplified FPGA Programming Model for Application Developers



Start developing for Intel FPGAs with OPAE today: http://01.org/OPAE

#### The OPAE Library at a Glance

Enumerate, access, and manage FPGA resources through API objects

A common interface across different FPGA form factors

C API designed for extensibility

AFU Simulation Environment (ASE) allows developing and debugging accelerator functions and software applications without an FPGA

Tools for partial reconfiguration, FPGA hardware information, error reporting, etc.



## Useful OPAE Command-line Utilities For Board Management

| Commands  | Description                                                                                             |
|-----------|---------------------------------------------------------------------------------------------------------|
| fpgainfo  | User can read the Board Telemetry data. For example temperature or Voltages.                            |
| fpgabist  | Performs self-diagnostic test: measure bandwidth between local DDR4 memory and system memory            |
| fpgaconf  | Configure Acceleration Function Unit (AFU) into FPGA;<br>Check compatibility with targeted FPGA and FIM |
| fpgaflash | Updates FPGA Interface Manager (FIM) image (.rpd file) being stored in flash;<br>Updates BMC firmware.  |
| fpgad     | A daemon to monitor FPGA drivers' error status; report errors as events to OPAE                         |



#### **Application Development with OPAE**





### The OPAE Library Programming Model





#### **Enumeration and Discovery**



| fpga_properties <b>prop</b>                 | fpgaEnumerate() | fpga_token <b>token</b>                                                |
|---------------------------------------------|-----------------|------------------------------------------------------------------------|
| objtype: FPGA_ACCELERATOR<br>guid: 0xabcdef |                 | <internal accelerator<br="" reference="" to="">resource&gt;</internal> |
|                                             |                 |                                                                        |
| foga properties prop:                       |                 |                                                                        |

```
fpga_properties prop;
fpga_token token;
fpga_guid myguid; /* 0xabcdef */
fpgaGetProperties(NULL, &prop);
fpgaPropertiesSetObjectType(prop, FPGA_ACCELERATOR);
fpgaPropertiesSetGUID(prop, myguid);
fpgaEnumerate(&prop, 1, &token, 1, &n);
fpgaDestroyProperties(&prop);
```



#### Acquire and Release Accelerator Resource







#### Software Developer Needs AFU Specification

Memory mapped register space

- Software uses to discover, control and communicate with FPGA accelerator
  - Report status flags
  - Configure AFU settings
  - Start/Stop control of acceleration workload



#### **Management and Reconfiguration**





### A Code Example - Put Everything Together

The hello\_afu.c code in the \$OPAE\_PLATFORM\_ROOT/hw/samples directory of the OPAE library

- Demonstrates all OPAE API functions discussed in this presentation
- Write and read configuration registers from the host to the FPGA to show basic configuration accesses are done
- The same flow can be used to access and exercise any other AFUs

To compile source code run appropriate gcc/make commands





#### Include OPAE header files

// State from the AFU's JSON file, extracted using OPAE's afu\_json\_mgr script
#include "afu\_json\_info.h"

#### int usleep(unsigned);



Define constants that will be used when communicating with FPGA accelerator

static int s\_error\_count = 0;





| int ma | /in(int argc, char *argv[])                                                                                                                                                                                                                                                                        |                                                                                                                                               |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| ĩ      | <pre>fpga_properties filter = NULL;  fpga_token afc_token; fpga_handle afc_handle; fpga_guid guid; uint32_t num_matches; fpga_result res = FPGA_OK;</pre>                                                                                                                                          | Create variables and objects that will be used when communicating with FPGA accelerator                                                       |
|        | <pre>if (uuid_parse(HELLO_AFU_ID, guid) &lt; 0) {     fprintf(stderr, "Error parsing guid '%s'\n", HELLO_AFU_ID     goto out_exit; }</pre>                                                                                                                                                         | ", Create an empty FPGA properties object                                                                                                     |
|        | <pre>/* Look for AFC with MY AFC_ID */ res = fpgaGetProperties(NULL, &amp;filter); ON_ERR_GOTO(res, out_exit, "creating properties object"); res = fpgaPropertiesSetObjectType(filter, FPGA_ACCELERATOR); ON_ERR_GOTO(res, out_destroy_prop, "setting object type");</pre>                         | Populate the opaque FPGA properties object with desired search parameters                                                                     |
| L      | <pre>res = fpgaPropertiesSetGUID(filter, guid);<br/>ON_ERR_GOTO(res, out_destroy_prop, "setting GUID");<br/>/* TODO: Add selection via BDF / device ID */<br/>res = fpgaEnumerate(&amp;filter, 1, &amp;afc_token, 1, #_matches);<br/>ON_ERR_GOTO(res, out_destroy_prop, "enumerating_AFCs");</pre> | Search for matching FPGA resources using <i>fpga_Enumerate()</i> which returns the list of <i>matches</i> to the fpga_token. <i>afc_token</i> |
|        | <pre>if (num_matches &lt; 1) {     fprintf(stderr, "AFC not found.\n");     res = fpgaDestroyProperties(&amp;filter);     return FPGA_INVALID_PARAM; }</pre>                                                                                                                                       | <ul> <li>Error and destroy object if none are found</li> </ul>                                                                                |



/\* Open AFC and map MMIO \*/
res = fpgaOpen(afc\_token, &afc\_handle, 0);
ON\_ERR\_GOTO(res, out\_destroy\_tok, "opening AFC");

res = fpgaMapMMIO(afc\_handle, 0, NULL); ON\_ERR\_GOTO(res, out\_close, "mapping MMIO space");

printf("Running Test\n");

/\* Reset AFC \*/
res = fpgaReset(afc\_handle);
ON\_ERR\_GOTO(res, out\_close, "resetting AFC");

// Access mandatory AFU registers uint64\_t data = 0; res = fpgaReadMMI064(afc\_handle, 0, AFU\_DFH\_REG, &data); ON\_ERR\_GOTO(res, out\_close, "reading from MMI0"); printf("AFU\_DFH\_REG = %08lx\n", data);

res = fpgaReadMMI064(afc\_handle, 0, AFU\_ID\_LO, &data); ON\_ERR\_GOTO(res, out\_close, "reading from MMIO"); printf("AFU ID\_LO = %08lx\n", data);

res = fpgaReadMMI064(afc\_handle, 0, AFU\_ID\_HI, &data); ON\_ERR\_GOTO(res, out\_close, "reading from MMIO"); printf("AFU\_ID\_HI = %08lx\n", data);

res = fpgaReadMMI064(afc\_handle, 0, AFU\_NEXT, &data); ON\_ERR\_GOTO(res, out\_close, "reading from MMIO"); printf("AFU\_NEXT = %68lx\n", data);

res = fpgaReadMMI064(afc\_handle, 0, AFU\_RESERVED, &data); ON\_ERR\_GOTO(res, out\_close, "reading from MMI0"); printf("AFU\_RESERVED = %08lx\n", data); Acquire ownership of resource pointed to by *afc\_token* using fpga\_Open() receiving the fpga\_handle, *afc\_handle* 

Map accelerator register space to user space

Reset the Accelerator Function using the *fpgaReset* API

Read the Device Feature Header registers from the AFU and print them to screen







/\* Unmap MMIO space \*/
res = fpgaUnmapMMIO(afc\_handle, 0);
ON\_ERR\_GOTO(res, out\_close, "unmapping MMIO space");

/\* Release accelerator \*/

out\_close:

res = fpgaClose(afc\_handle);|
ON\_ERR\_GOTO(res, out\_destroy\_tok, "closing AFC");

/\* Destroy token \*/

out\_destroy\_tok: #ifndef USE\_ASE res = fpgaDestroyToken(&afc\_token); ON\_ERR\_GOTO(res, out\_destroy\_prop, "destroying token"); #endif

#### /\* Destroy properties object \*/

out\_destroy\_prop: res = fpgaDestroyProperties(&filter); ON\_ERR\_GOTO(res, out\_exit, "destroying properties object");

out\_exit: if(s error count > 0)

```
printf("Test FAILED!\n");
```

return s\_error\_count;

**Unmap Register space** 

#### Release the accelerator for others to use

Destroy the token

Destroy the property object

If any errors occur during configuration register access, increase error count and print failure

# **AFU DEVELOPMENT USING HLS**

#### AFU development using HLS - Agenda

- Introduction to High Level Synthesis
- HLS interfaces
- HLS AFU development flow



## Introduction to High-Level Synthesis

### Introduction to HLS - Agenda

- Introduction
- x86 Emulation
- Cosimulation
- Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration



### **High Level Synthesis**

Synthesize a C/C++ function into an RTL implementation

- Develop the component in a software environment
- Verify the functionality of the component within a software environment
- Integrate it seamlessly with hardware simulation environment
- Optimize design using software-centric tools and reports
- Integrate generated IP easily within traditional FPGA design tools



#### **Traditional FPGA Design Process**

#### Potentially Time-Consuming Effort



#### **Behavioral Simulation**





Designing at a higher level of abstraction = increase productivity

- Debugging software is much faster than hardware
- Easier to specify functions in software
- Simulation of RTL takes thousands times longer than software











## Intel<sup>®</sup> HLS Compiler

- Targets Intel<sup>®</sup> FPGAs
- Command-line executable: i++
- Builds an IP block
  - To be integrated into a traditional FPGA design using FPGA tools



- Leverages standard C/C++ development environment
- Goal: Same performance as hand-coded RTL with 10-15% more resources



#### **HLS Procedure**







## Intel<sup>®</sup> HLS Compiler Usage and Output



a is the default output name, -o option can be used to specify a non-default output name



### Introduction to HLS Agenda

- Introduction
- x86 Emulation
- Cosimulation
- Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration



#### HLS Procedure: x86 Emulation





### g++ Compatibility

Intel<sup>®</sup> HLS Compiler is command line compatible with g++

- Similar command-line flags, x86 behavior, and compilation flow
- Changing "g++" to "i++" should just work
  - g++ <flags> <src>
  - i++ <flags> <src>
- x86 behavior should match g++
- No source modifications required (for x86 mode)
- Support for GNU Makefiles


## x86 Debugging Tools

- printf/cout
- gdb
- Valgrind





## Using printf()

- Requires "HLS/stdio.h"
  - Maps to <stdio.h> when appropriate
- Can be included in the testbench or the component
  - Used with no limitations in the x86 emulation flow
- printf statements inside the component ignored for HDL generation
  - Ignored in the cosimulation flow with an HDL simulator

### Using printf(): Example

#### Example Program

```
// test.cpp
#include "HLS/stdio.h"
void say hello() {
  printf("Hello from the component\n");
int main() {
  printf("Hello from the testbench\n");
```

```
say hello();
return 0;
```

#### Terminal Commands and output

```
$ i++ test.cpp
$ ./a.out
Hello from the testbench
Hello from the component
$
```

```
$ i++ test.cpp -march=Arria10 \
      --component say hello
$ ./a.out
Hello from the testbench
$
```



### Debugging Using gdb

- i++ integrates well with GNU gdb
  - Debug data is generated by default
    - Unlike g++, -g enabled by default, use -g0 to turn off debug data
- -march=x86-64 flow:
  - Can step through any part of the code (including the component)
- -march=<fpga family> flow:
  - Can step through testbench code
  - gdb does not see the component side execution (that runs in an HDL simulator)



### Debugging with Valgrind

The Valgrind tool suite provides a number of debugging and profiling tools that help you make your programs faster and more correct

#### Valgrind tools can detect:

- Memory leaks
- Invalid pointer uses
- Use of uninitialized values
- Mismatched use of malloc/new vs free/delete
- Doubly freed memory
- Use to debug component and testbench in the x86 emulation flow



http://www.valgrind.org/



#### Introduction to HLS Agenda

- Introduction
- x86 Emulation
- Cosimulation
- Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration



#### **HLS Procedure: Cosimulation**





#### Example Component/Testbench Source





#### Translation from C function API to HDL module

- All component functions are synthesized to HDL
  - Each synthesized component is an independent HDL module
- Component functions can be declared:
  - Using component keyword in source
  - Specifying "--component <component\_name>" in the command-line



#### Cosimulation

Cosimulation: combines x86 testbench with RTL simulation

- HDL code for the component runs in an RTL Simulator
  - Verilog
  - RTL testbench automatically created from software
- main() and everything else called from main runs on x86 as the testbench
- Communication using SystemVerilog Direct Programming Interface (DPI)
  - Allows C/C++ to interface SystemVerilog
  - Inter-process communication (IPC) library used to pass testbench input data to RTL simulator, and returns the data back to the x86 testbench



### **Cosimulation Verifying HLS IP**

The Intel<sup>®</sup> HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator

- To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture
  - Any calls to the component function becomes calls the simulator through DPI





#### Viewing Component Waveforms

- Compile design with i++ -ghdl flag
  - Enable full visibility and logging of all HDL signals in simulation
- After cosimulation execution, waveform available at a.prj/verification/vsim.wlf
- Examine with the ModelSim\* Simulator GUI:
  - vsim a.prj/verification/vsim.wlf

### Viewing Waveforms in the Modelsim\* Simulator

| и                                                                                                                                                                                                                                    |                                                         |                                                                                                                                                                                                             |                 | Model       | Sim - Intel FPGA Edition 1 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------|----------------------------|
| <u>F</u> ile <u>E</u> dit <u>V</u> iew <u>C</u> ompile <u>S</u> imulat                                                                                                                                                               | te A <u>d</u> d <b>Objec<u>t</u>s</b> T <u>o</u> ols La | yo <u>u</u> t Boo <u>k</u> marks <u>W</u> indow <u>H</u> el                                                                                                                                                 | p               |             |                            |
| 🖹 • 😅 🔒 🤣 🚳   🐰 🖻 🕷                                                                                                                                                                                                                  | 🖁 🖄 😂   💿 - 🛤 🖺 🗖                                       | 🔹 🖄 🕮 🚑 🕺   💁 🕇                                                                                                                                                                                             | • 🖛 🛶   🗊 🗌     | 100 🍨 🚉 🚉 🚉 | i 🔹   🗊 🖫 🕘   🕇 🏊 🕇        |
| I O 1/0 🗓 ALL 🥓 🖹 🖫                                                                                                                                                                                                                  | 1 4 II II   🕪 🛛 🕹 L                                     | ** 1 2 3 4                                                                                                                                                                                                  | 👋 🔹 🌸 🛛 Search: |             | ] 🏨 🚓 👘 🗍 🍕 🍳 🦓 .          |
| 🛺 vsim - Default 😑 📰 🛨 🛃 🗙                                                                                                                                                                                                           | 😒 Objects 🕬 🕫 🗶 🚦                                       | 🛯 Wave - Default                                                                                                                                                                                            |                 |             |                            |
| ▼ Instance                                                                                                                                                                                                                           | ▼N; 7 ● 72883 ps → 🕨                                    | <u></u>                                                                                                                                                                                                     | Msgs            |             |                            |
| tb clock_reset_inst component_dpi_controlle concatenate_component Locate ller_inst Component_ent_dpi ent_dpi myr_ult_component_dpi myr_ult_component_dpi myr_ult_inst myr_ult_internal_inst myrult_internal split_component_start_in | <ul> <li></li></ul>                                     | <pre>/tb/mymult_inst/clock /tb/mymult_inst/resetn /tb/mymult_inst/start /tb/mymult_inst/busy /tb/mymult_inst/a /tb/mymult_inst/b /tb/mymult_inst/clone /tb/c_mult_inst/done /tb/c_mult_inst/stall als</pre> | St0             |             |                            |
|                                                                                                                                                                                                                                      | to Wavefo                                               | orm                                                                                                                                                                                                         |                 |             |                            |



#### **Need for Cosimulation**

- x86-emulation sufficient to functionally debug vast majority of issues
- Cosimulation used to test latency and performance of component
- Cosimulation used to catch hardware generation issues
  - Improper use of HLS compiler directives
    - e.g. #pragmas
  - Improper use of HLS compiler attributes
  - Improper use of HLS-specific constructs
  - Test component reset behavior
- Cosimulation should be done before integrating component with FPGA



### C/C++ Functions to Dataflow Circuits

Each component function is converted into custom dataflow hardware

- Gain the benefits of Intel<sup>®</sup> FPGAs without the length design process
- Implement C/C++ operators as circuits
  - HDL code located in <HLS Installation>\ip
  - Load Store units to read/write memory
  - Arithmetic units to perform calculations
  - Flow control units
  - Connect circuits according to data flow in the function

| acl_staging_reg.v     | acl_work_group_li   | bram_512x4M_hw.tcl   | dotp_core.vhd     |
|-----------------------|---------------------|----------------------|-------------------|
| acl_stall_free_sink.v | acl_work_group_li   | bram_512x33M.v       | dotp_core_sv.vhd  |
| acl_stall_free_sink   | acl_work_item_iter  | bram_512x33M_hw      | dotProduct64_dut  |
| acl_stall_monitor.v   | avalon_concatenat   | config_switch1.v     | dotProduct64_dut  |
| acl_start_signal_ch   | avalon_concatenat   | config_switch32.v    | dotProduct64_safe |
| acl_stream_fifo.v     | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper.v    |
| acl_stream_to_vect    | avalon_conduit_fa   | CosDPStratixVf400    | dotp_wrapper_sv.v |
| acl_task_copy_finis   | avalon_split_multib | CosPiDPStratixVf40   | dotp_wrapper_tom  |
| acl_toggle_detect.v   | avalon_split_multib | CosPiDPStratixVf40   | dp_addb.vhd       |
| acl_token_fifo_cou    | barrier_fifo.v      | cra_ring_node.sv     | dp_addpipe.vhd    |
| acl_valid_fifo_coun   | bram_256x4M.v       | cra_ring_node_hw.tcl | dp_adds.vhd       |
| acl_vector_to_stre    | bram_256x4M_hw.tcl  | cra_ring_rom.sv      | dp_clz64.vhd      |
| acl_vector_to_stre    | bram_256x67M.v      | cra_ring_rom_hw.tcl  | dp_clzpipe64.vhd  |
| acl_work_group_di     | bram_256x67M_hw     | cra_ring_root.sv     | dp_div_core.vhd   |
| acl_work_group_di     | bram_512x4M.v       | cra_ring_root_hw.tcl | dp_divnornd.vhd   |
|                       |                     |                      |                   |



#### **Compilation Example**

Software compiled into dataflow circuit with flow control



#### Main HTML Report

The Intel<sup>®</sup> HLS Compiler automatically generates HTML report that analyzes various aspects of your function including area, loop structure, memory usage, and system data flow

Located at a.prj/reports/report.html





### **HTML Report: Summary**

#### Overall compile statics

- FPGA Resource Utilization
- Compile Warnings
- Intel<sup>®</sup> Quartus<sup>®</sup> Software fitter results
  - Available after compilation

etc.

| Project Name                                                                                                             | ./ipga/add_ex                                                                 |                                   |                                    |                            |                                                                     |  |
|--------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-----------------------------------|------------------------------------|----------------------------|---------------------------------------------------------------------|--|
| Target Family, Device                                                                                                    | Arria 10, 10AX11                                                              | Arria 10, 10AX115U1F45I1SG        |                                    |                            |                                                                     |  |
| i++ Version                                                                                                              | 17.1.0 Build 240                                                              |                                   |                                    |                            |                                                                     |  |
| Quartus Version                                                                                                          | 17.1.0 Build 240                                                              |                                   |                                    |                            |                                                                     |  |
| Command                                                                                                                  | i++ -march=Arria10component add add_ex.cpp -o ./fpga/add_ex.out               |                                   |                                    |                            |                                                                     |  |
| Reports Generated At                                                                                                     | Tue Oct 31 10:18                                                              | Tue Oct 31 10:18:13 2017          |                                    |                            |                                                                     |  |
|                                                                                                                          |                                                                               |                                   |                                    |                            |                                                                     |  |
| Summer Fit clash c                                                                                                       |                                                                               |                                   |                                    |                            |                                                                     |  |
| Quartus Fit Clock Sur                                                                                                    | nmary                                                                         |                                   | 1x clock fn                        | nax                        |                                                                     |  |
| Quartus Fit Clock Sur                                                                                                    | nmary                                                                         |                                   | 1x clock fn<br>612.75              | lax                        |                                                                     |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource                                                         | nmary<br>Utilization Summary<br>ALMs                                          | /<br>FFs                          | 1x clock fm<br>612:75<br>RAMs      | nax                        | DSPs                                                                |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource<br>add                                                  | nmary<br>Utilization Summary<br>ALMs<br>18                                    | /<br>FFs<br>3                     | 1x clock fm<br>612.75<br>RAMs<br>0 | nax                        | DSPs<br>0                                                           |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource<br>add                                                  | nmary<br>Utilization Summary<br>ALMs<br>18<br>Usage                           | FFs<br>3                          | 1x clock fm<br>612.75<br>RAMs<br>0 | nax                        | DSPs<br>0                                                           |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource<br>add<br>Estimated Resource I<br>Component Name        | nmary<br>Utilization Summary<br>ALMs<br>18<br>Usage<br>ALUTs                  | 7<br>FFs<br>3<br>FFs              | 1x clock fm<br>612.75<br>RAMs<br>0 | RAMs                       | DSPs<br>0<br>DSPs                                                   |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource<br>add<br>Estimated Resource I<br>Component Name<br>add | nmary<br>Utilization Summary<br>ALMs<br>18<br>Usage<br>ALUTs<br>38            | 7<br>FFs<br>3<br>FFs<br>2         | 1x clock fm<br>612.75<br>RAMs<br>0 | nax<br>RAMs<br>0           | DSPs<br>0<br>DSPs<br>0                                              |  |
| Quartus Fit Clock Sur<br>Frequency (MHz)<br>Quartus Fit Resource<br>add<br>Estimated Resource I<br>Component Name<br>add | nmary<br>Utilization Summary<br>ALMs<br>18<br>Usage<br>ALUTs<br>38<br>38 (0%) | 7<br>FFs<br>3<br>FFs<br>2<br>2 (0 | 1x clock fm<br>612.75<br>RAMs<br>0 | nax<br>RAMs<br>0<br>0 (0%) | DSPs<br>0<br>DSPs<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0 |  |

#### **HTML Report: Loops**

Serial loop execution hinders function dataflow circuit performance

- Use Loop Analysis report to see if and how each loop is optimized
  - Helps identify component pipeline bottlenecks





#### HTML Report: Loop Analysis

#### Loop analysis shows how loops are implemented

- Ability to correlate with source code





#### HTML Report: Area Analysis

View detailed estimated resource consumption by system or source line

- Analyze data control overhead
- View memory implementation
- Shows resource usage
  - ALUTs
  - FFs
  - RAMs
  - DSPs
- Identifies inefficient uses

Area report (source view) Area utilization values are estimated Notation *file:X > file:Y* indicates a function call on line X was inlined using code on line Y.

|                                     | ALUTS | FFS  | RAMS | DSPs | Details      |
|-------------------------------------|-------|------|------|------|--------------|
| A 11 CM 1                           |       |      |      |      |              |
| variable:<br>- '1' (example.cop:11) | 25    | 133  | e    | Ð    | • Implemente |
| a second to budy                    |       |      |      |      | Manager and  |
| example.cpp:12 (a_out)              | 33    | 1152 | 16   |      | · Memory Sys |
| example.cpp:13 (b_buf)              | e     | 0    | 64   | 0    | • Nemory sys |
| > No Source Line                    | 553   | 1168 | 0    | 0    |              |
| > example.cop:14                    | 37    | 51   | 1    | 0    |              |
| ♥ example.cpp:15                    | 94    | 111  | 0    | e    |              |
| state                               | 68    | 87   | e    | 0    |              |
| store                               | 34    | 24   | e    | е    |              |
| > example.cpp:16                    | 94    | 111  | 0    | 0    |              |
| > example.cpp:22                    | 1038  | 784  | 0    | 0    |              |
| > example.cpp:26                    | 14    | 28   |      | 0    |              |
|                                     |       |      |      |      |              |

| example | .cpp hls.h hls_internal.h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |      |
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
| 1       | #include "HL5/hls.h"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |      |
| 2       | #include "stdio.h"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | - 1  |
| 3       | #include "stdlib.h"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |      |
| 4       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 5       | typedef altera::stream_in <int> my_operand;</int>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |
| 6       | typedef altera::stream_out(int) my_result;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |      |
| 7       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 8       | <pre>component void vec_add_kernel(my_operand &amp;a, my_operand &amp;b, my_re<br/>&amp;c)</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | sult |
| 9+      | (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |
| 10      | int i;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |      |
| 11      | int j;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |      |
| 12      | int a_buf[32][32];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |      |
| 13      | int b_buf[32][32];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |      |
| 14 *    | for (i = 0; i < 32 * 32; i++) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 15      | <pre>a_buf[i / 32][i % 32] = a.read();</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |      |
| 16      | <pre>b_buf[i / 32][i % 32] = b.read();</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |      |
| 17      | }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |
| 18 *    | for (j = 0; j < 1024 * 32; j++) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |
| 19      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 28      | #pragma unroll                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |      |
| 21 *    | for (1 = 0; 1 < 32; 1++) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |      |
| 22      | <pre>b_buf[j % 32][1] += a_buf[1][j % 32];</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |      |
| 23      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 24      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 25.*    | for $(1 = 0; 1 < 32 - 32; 1++)$ {                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |
| 20      | c.write(b_bu*[1 / 32][1 % 32]);                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 20      | , /                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |      |
| 20      | I. Contraction of the second se |      |
| 29      | ist said () (                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |      |
| 30 +    | THE MATHIA I                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |      |
| 22      | ny_operand a, o,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |      |
| 32      | whitehore of al                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |      |
| 3.4     | unrised loss loss start - alters ble set sin time();                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |      |
|         | cuprement roug roug acoust - erceueTuraTectore();                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |      |



155

#### **HTML Report: Component Viewer**

Displays abstracted netlist of the HW implementation

- View data flow pipeline
  - See loads and stores
  - Interfaces including stream reads and writes
  - Memory structure
  - Loop structure
  - Possible performance bottlenecks
    - Unpipelined loops are colored light red
    - Stallable points are red



componentivec add kernel

settry

whiletrue.entry

tel) 156

#### **HTML Report: Memory Viewer**

Displays local memory implementation and accesses

- Visualize memory architecture
  - Banks, widths, replication, etc
- Visualize load-store units (LSUs)
  - Stall-free?
  - Arbitration
  - Red indicates stalled



Correlates with source code.



#### HTML Report: Verification Statistics

Reports execution statics from testbench execution, available after component is simulated (testbench executable ran)

- Number and type of component invocation
- Latency of component
- Dynamic Initiation interval of Component

#### Data rates of streams

Verification Statistics ш Latency Invocations Details (min.max.avg) (min.max.avg) dut (Unknown location) 101 4.4.4 Click for details Explicit component invocations (Unknown location) 1 4.4.4 n/a,n/a,n/a Enqueued component invocations (Unknown location) 100 4,4,4 1,1,1

Measurements based on latest execution of testbench



#### Introduction to HLS Agenda

- Introduction
- x86 emulation
- Cosimulation
- Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration



#### **HLS Procedure: Integration**





160

#### Intel Quartus<sup>®</sup> Software QoR Metrics for IP

Use Intel<sup>®</sup> Quartus<sup>®</sup> Prime software to generate quality-of-result reports

- i++ creates the Quartus project in a.prj/quartus
- To generate QoR data (final resource utilization, fmax)
  - Run quartus\_sh --flow compile quartus\_compile
  - Or use i++ --quartus-compile option
- Report part of the HTML report
  - a.prj/reports/report.html
  - Summary page



### Intel<sup>®</sup> Quartus<sup>®</sup> Software Integration

• a.prj/components directory contains all the files to integrate

- One subdirectory for each component
  - Portable, can be moved to a different location if desire
- 2 use scenarios
  - 1. Instantiate in HDL
  - 2. Adding IP to a Platform Designer system



#### **HDL** Instantiation

- Add Components to Intel<sup>®</sup> Quartus<sup>®</sup> Software Project
  - <component>.qsys to Standard Edition
  - <component>.ip to Pro Edition
- Instantiate component module in your design
  - Use template

a.prj/components/<component>/<component> inst.v

```
add add inst
 // Interface: clock (clock end)
             (), // 1-bit clk input
  .clock
  // Interface: reset (reset end)
             (), // 1-bit reset n input
  . resetn
  // Interface: call (conduit sink)
  .start
             (), // 1-bit valid input
             (), // 1-bit stall output
  .busy
  // Interface: return (conduit source)
  . done
             (), // 1-bit valid output
             (), // 1-bit stall input
  .stall
  // Interface: returndata (conduit source)
  .returndata(), // 32-bit data output
  // Interface: a (conduit sink)
             (), // 32-bit data input
  . a
  // Interface: b (conduit sink)
  .b
                  // 32-bit data input
```





#### **Platform Designer System Integration Tool**



**Catalog of** available IP

- Interface protocols
- Memory
- DSP
- Embedded
- Bridges
- Custom Components
- Custom Systems

Accelerate development





Simplify integration

Automate integration tasks



# **HLS** Interfaces

How to integrate your component with the rest of the system

#### **HLS Interfaces Section - Agenda**

- Avalon<sup>®</sup> Interfaces
- Default HLS Interfaces
- Memory Master Interfaces
- Explicit Streaming Interfaces
- Register Interfaces
- Memory Slave Interfaces



#### Avalon<sup>®</sup> Interfaces

Easily connects components in an Intel® FPGA to simplify system design

- Standard interfaces design for interoperability
- HLS compiler generates Avalon<sup>®</sup> interfaces around HLS components
- Avalon Streaming Interface (Avalon-ST)
  - Unidirectional flow of data, simple flexible interface
- Avalon Memory Mapped Interface (Avalon-MM)
  - Address-based read/write interface typical of master-slave connections
- Other Interfaces
  - Conduit, Tri-State Conduit, Interrupt, Clock, Reset



#### Avalon<sup>®</sup>-ST Interfaces

- Standard, flexible, and modular protocol for transfer of data
  - Unidirectional
  - Point-to-point connections
  - Fully synchronous
  - Supports simple and complex interface requirements





#### Avalon<sup>®</sup>-MM Interfaces

- Address-based (memory-mapped) protocol that allows components to communicate using read/write requests
- Master interface
  - Initiates read/write transfers targeting specific address
- Slave interface
  - Accepts and responds to transfer requests
- Interconnect handles decoding of master address request to actual slave interface, backpressure, clocking differences, etc.





## Avalon<sup>®</sup> Interface Specification

- Defines the entire Avalon interface standard, including all variations
- Provides reference information on additional transfer types
  - Use cases
  - Waveform diagrams
- https://www.intel.com/content/dam/www/programmable/us/en/pdfs/lit erature/manual/mnl\_avalon\_spec.pdf



#### **Avalon<sup>®</sup> Interface Specifications**

Updated for Intel<sup>®</sup> Quartus<sup>®</sup> Prime Design Suite: 18.1

Subscribe

MNL-AVABUSREF | 2018.09.26 Latest document on the web: PDF | HTML


### **Default Interfaces for Scalars**

 Scalar arguments results in an input conduit associated with start and busy signals







#### Pointers: Implicit Memory-Mapped Interface

- All pointer or reference arguments becomes address input associated with start and busy signals
- Memory-mapped master interface automatically created
- Default 64bit address space









#### **Explicit MM Master Interface**

- Explicitly declare Avalon-MM Master interfaces using mm\_master<> class
  - Greater control over interface
  - Specify attributes through parameters





#### ihc::mm master Class Parameters

Usage: ihc::mm\_master<datatype, /\*template arguments\*/>

| Feature       | Valid Values | Default | Description                                                                                  |
|---------------|--------------|---------|----------------------------------------------------------------------------------------------|
| ihc::dwidth   | 8,16,32,1024 | 64      | Width of data bus                                                                            |
| ihc::awidth   | 1-64         | 64      | Width of address bus (byte addressing)                                                       |
| ihc::aspace   | >0           | 1       | Address space #, masters with the same address space are arbitrated                          |
| ihc::align    | >default     | type    | Byte alignment of pointer address                                                            |
| ihc::latency  | >=0          | 1       | Guaranteed latency from read to valid data,<br>0=variable latency                            |
| ihc::maxburst | 1-1024       | 1       | Max transfers associated with a read/write.<br>For fixed latency interfaces, value must be 1 |

Other attributes including readwrite\_mode, and waitrequest described in the HLS Compiler Reference Manual

#### **MM Master Address Spaces**



- Having multiple address spaces creates multiple MM Masters
  - Allows simultaneous multimastering over Platform Designer interconnect



#### **Streaming Interfaces**

- Scalar function arguments become pipelined input ports on the HDL module
  - Avalon Streaming interface associated with start and busy inputs
  - Implicit
- Explicit Streaming Interfaces
  - Use ihc::stream\_in<> and ihc::stream\_out<> template classes
    - Pass by reference
  - Creates Avalon Streaming interface with valid and ready signals
  - Explicit control over interface



#### Explicit Streaming Interface Example





#### **Explicit Streaming Interface Customizations**

Usage: ihc::stream\_in<datatype, /\*template arguments\*/>

| Feature          | Valid Values  | Description                                     |
|------------------|---------------|-------------------------------------------------|
| ihc::buffer      | Positive int  | FIFO buffer capacity in words (for inputs)      |
| ihc::usesPackets | true or false | Exposes startofpacket and endofpacket signals   |
| ihc::usesValid   | true or false | Whether a valid signal is present (for inputs)  |
| ihc::usesReady   | true or false | Whether a ready signal is present (for outputs) |

Other attributes including bitsPerSymbol and readylatency described in the HLS Compiler Reference Manual

#### **Slaves Interfaces**

- Component control and status register
  - In lieu of start/busy/done/stall signals
- Slave data registers
  - Ideal for smaller inputs
- Slave memories
  - For larger arrays



#### **MM Slave Component**

- Component can have 1 CSR slave interface for function call and return
  - Shared with slave arguments
  - Address map described in generated <component\_name>\_csr.h
- irq\_done signifies component is finished
   Used in place of default streaming calls and returns
   hls\_avalon\_slave\_component component int dut(...) { return result; }

### **MM Slave Register Argument**

- Can be used independent of slave component
- Used in lieu of default conduit argument
- Ideal for smaller inputs



a[31:0]

### Slave Component and Register Address Map

| /*<br>Register<br>Address | Access | Register Contents<br>(64-bits)                     | Description                                                                                                                                       | <component>_csr.h contains</component>                                                                                                                                                |
|---------------------------|--------|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0×0                       | R      | <pre>{reserved[62:0],     busy[0:0]}</pre>         | Read the busy status of<br>the component<br>0 - the component is ready<br>to accept a new start<br>1 - the component cannot<br>accept a new start | <ul> <li>Address Map</li> <li>Macros created for register byte<br/>addresses and bit masks</li> </ul>                                                                                 |
| 0x8                       | W      | <pre>{reserved[62:0],     start[0:0]}</pre>        | Write 1 to signal start to the component                                                                                                          |                                                                                                                                                                                       |
| 0x10                      | R/W    | <pre>{reserved[62:0], interrupt_enable[0:0]}</pre> | 0 - Disable interrupt,<br>1 - Enable interrupt ▼                                                                                                  | /* Byte Addresses */<br>#define MYCOMP_CSR_BUSY_REG (0x0)<br>#define MYCOMP_CSR_START_REG (0x8)                                                                                       |
| 0x18                      | R/Wclr | <pre>{reserved[61:0],</pre>                        | Signals component completion<br>done is read-only and<br>interrupt_status is write 1<br>to clear                                                  | <pre>#define MYCOMP_CSR_INTERRUPT_ENABLE_REG (0x10) #define MYCOMP_CSR_INTERRUPT_STATUS_REG (0x18) #define MYCOMP_CSR_RETURNDATA_REG (0x20) #define MYCOMP_CSR_ARG_A_REG (0x28)</pre> |
| 0x20                      | R      | <pre>{reserved[31:0], returndata[31:0]}</pre>      | Return data                                                                                                                                       | #define MYCOMP_CSR_ARG_B_REG (0x30)                                                                                                                                                   |
| 0x28                      | R/W    | {reserved[31:0],<br>a[31:0]}                       | Argument a 🖡                                                                                                                                      | /* Argument Sizes (bytes) */<br>#define MYCOMP_CSR_RETURNDATA_SIZE (4)                                                                                                                |
| 0x30                      | R/W    | {reserved[31:0],<br>b[31:0]}                       | Argument b                                                                                                                                        | #define MYCOMP_CSR_ARG_A_SIZE (4)<br>#define MYCOMP_CSR_ARG_B_SIZE (4)                                                                                                                |



#### Streaming HLS Component in a System





#### Memory-Mapped HLS Component in a System









#### Interface Synthesis Tutorials

Located in <hls\_install\_folder>/examples/tutorials/interfaces

- explicit\_streams\_buffer
- explicit\_streams\_packets\_ready\_valid
- mm\_master\_testbench\_operators
- mm\_slaves
- multiple\_stream\_call\_sites
- pointer\_mm\_master
- stable\_arguments



## HLS AFU development flow

#### HLS development flow

Intel<sup>®</sup> High Level Synthesis Accelerator Functional Unit Design Example User Guide

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-hls-afu.pdf

The Intel High Level Synthesis (HLS) Accelerator Functional Unit (AFU) design example shows how to create AFUs for the Intel<sup>®</sup> Acceleration Stack for Intel Xeon<sup>®</sup> CPU with FPGAs with with the Intel HLS.

The package includes all the source code, scripts and makefile needed.

You can use this code as a model to create your own HLS AFUs if your AFUs use the same interfaces as the example design. Also, you might be able to convert your HLS application into an AFU by adding the required interfaces to the hardware design.



HLS on Acceleration Stack





#### **HLS AFU Container block diagram**





#### HLS on Acceleration Stack (basic vector reduce)

```
1. component
2. float floatingPointVectorReduce basic (float *masterRead,
3.
                                        float *masterWrite,
                                        int size)
4.
5. {
6. float sum = 0.0f;
7. for (int idx = 0; idx < size; idx++)
8.
   {
9.
        float readVal = masterRead[idx];
10.
        sum += readVal;
11.
12.
        masterWrite[idx] = readVal + 1.0f;
13. }
14.
15. return sum;
16.}
```



## HLS on Acceleration Stack (HLS Signature)

```
typedef ihc::mm master<float, ihc::dwidth<512>,
1.
2.
                             ihc::awidth<48>, ihc::latency<0>,
3.
                             ihc::aspace<1>, ihc::readwrite mode<readonly>,
4.
                             ihc::waitrequest<true>, ihc::align<64>,
5.
                             ihc::maxburst<4> > MasterReadFloat;
6.
7.
    typedef ihc::mm master<float, ihc::dwidth<512>,
8.
                             ihc::awidth<48>, ihc::latency<0>,
9.
                             ihc::aspace<2>, ihc::readwrite mode<writeonly>,
10.
                             ihc::waitrequest<true>, ihc::align<64>,
11.
                             ihc::maxburst<4> > MasterWriteFloat;
12.
13. component
14. hls avalon slave component
15. float floatingPointVectorReduce float (
16.
           hls avalon slave register argument MasterReadFloat &masterRead,
           hls avalon slave register argument MasterWriteFloat &masterWrite,
17.
18.
           hls avalon slave register argument uint64 t size)
                                                                        Control/Status
                                                                                                   Avalon-MM
                                                                                     Avalon-MM
19.
                                                                        Register Slave
                                                                                    Master (Read)
                                                                                                  Master (Write)
                                                                                 floatingPointVectorReduce
                                                                                    (HLS Component)
```

#### HLS on Acceleration Stack (Code Body)

#define UNROLL\_FACTOR 16
// 16 32-bit floats in 1 512-bit dword
#define FLOAT\_BITS 32
// 32 bits in one float

We can transfer up to 16x 32bit float in one 512 data bus cycle We unroll to make it in parallel

```
20. {
21. float sum = 0.0f;
22. int iterations = 1 + ((size - 1) / UNROLL FACTOR);
     for (int loop idx = 0; loop idx < iterations; loop idx++)
23.
24.
25.
        float readSum = 0.0f;
26. #pragma unroll UNROLL FACTOR
        for (int itr = 0; itr < UNROLL FACTOR; itr++)</pre>
27.
28.
29.
           int idx = itr + (loop idx * UNROLL FACTOR);
30.
           if (idx < size)
31.
32.
              float readVal = masterRead[idx];
33.
             readSum += readVal;
34.
             masterWrite[idx] = readVal + 1.0f;
35.
          }
36.
37.
        sum += readSum;
38.
39.
     return sum;
40. }
```

#### **HLS AFU Flow Overview**

- 1. Build/Verify HLS code
- 2. Insert into Platform Designer
- 3. Build with Acceleration Stack tools
  - Either ASE, or AF Bitstream
- 4. Build and run host



# Compiling and Simulating the HLS Component with the i++ Command

- We compile this example design using the included makefile
- Build and emulate the design using x86 instructions run these commands:
- \$ make test-x86-64
- \$ ./test-x86-64
- Generate RTL and simulate generated RTL with the ModelSim simulator:
   \$ make test-fpga
- \$ ./test-fpga

<u>Confirm that the outputs from the test-x86-64 the test-fpga command match.</u> The test-x86-64 command runs C++ code on the processor, while the test-fpga command compiles the C++ source to Verilog RTL and then simulates the generate RTL using the testbench defined in the code.



#### Viewing waves in simulator (opcional)

As we have built the component using -ghdl the ModelSim testbench generated will log all HDL signals in a wlf file

\$ vsim fpga\_ghdl.prj/verification/vsim.wlf

Add the desired signals to waveform viewer in the selected simulator





## Generating a Platform Designer container for the HLS component

Use Platform Designer to integrate the HLS component into an AFU with the predesigned hardware interfaces available in the Acceleration Stack, and verify that all sources are linked correctly.

\$ qsys-edit hls\_afu\_container.qsys





After integrating the HLS component into an AFU, you might want to cosimulate the AFU in the Intel AFU Simulation Environment (ASE), to quickly confirm the functionality of your HLS component within the AFU.

To simulate using ASE, navigate to the root of your project (the hls\_afu directory) and run:

\$ afu\_sim\_setup --source hw/rtl/filelist.txt build\_ase\_dir/

| Ċ  | malea |     | perezfra@localhost:~/HLS/HLS_DCP_1.x/hls_afu_2019-04-30/build_ase_dir                                                                                                                                                                                                                                                          | • | > | ĸ |
|----|-------|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|---|
| Ŷ  | шаке  |     | File Edit View Search Terminal Help                                                                                                                                                                                                                                                                                            |   |   |   |
| \$ | make  | sim | <pre># [SIM] Protocol Checker initialized<br/># [SIM] ASE lock file .ase_ready.pid written in work directory<br/># [SIM] ** ATTENTION : BEFORE running the software application **<br/># [SIM] Set env(ASE_WORKDIR) in terminal where application will run (copy-and-paste) =&gt;<br/># [SIM] \$SHELL   Run:<br/># [SIM]</pre> |   |   |   |
|    |       |     | # [SIM] For any other shield, consult your linux auministrator<br># [SIM]<br># [SIM] Ready for simulation<br># [SIM] Press CTRL-C to close simulator                                                                                                                                                                           |   |   |   |

Open a new terminal window to compile the host application

Export the ASE\_WORKDIR environment variable using the export command from the output of the make sim command in the ASE terminal window.

\$ export ASE\_WORKDIR=<path to work folder>

Build the host application with simulation support and run

- \$ make USE\_ASE=1
- \$ ./hls\_afu\_host



#### Host terminal window with all transactions

| perezfra@localhost=/HLS/HLS_DCP_1x/hls_afu_2019-04-30/sw = ×                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | pererfra@localhost=/hLS/hLS.DCP_1.x/his_afu_2019-04-30/sw _ u x                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| File Edit View Search Terminal Help                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | File Edit View Search Terminal Help                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| [APP]       Session started         [APP]       ASE compatibilities line Intr         [AP]       ASE compatibilities line Intr         [AP]       ASE compatibilities line Intr         [AP]       Astempting to open a shared memory         [APP]       SUCCESS         [APP]       Maxed memory         [AP]       Success         [AP]       Maxed memory         [AP]       Maxed memory         [AP]       Success         [AP]       Maxed memory         [AP]       Success         [AP]       Maxed memory         [AP]       Maxed memory         [AP]       Success <t< td=""><td>Interrupt exabled = 00000001         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.04040000         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.040400000         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.04         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.01         APU Latercy: 546, 5360 milliseconds         Poll success: Meturn = 1         check totops per         [APP] MHID Write : iid = 0.000, offset = 0.040         [APP] MHID Metad : iid = 0.010, offset = 0.040         [APP] MHID Metad : iid = 0.010, offset = 0.030         [APP] MHID Metad : iid = 0.010, offset = 0.03         [APP] MHID Metad : iid = 0.010, offset = 0.030         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad : iid = 0.011, data = 0.03         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad Resp: iid = 0.0011, data = 4.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.040</td></t<> | Interrupt exabled = 00000001         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.04040000         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.040400000         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.04         [APP] MHID Write : iid = 0.000, offset = 0.040, data = 0.01         APU Latercy: 546, 5360 milliseconds         Poll success: Meturn = 1         check totops per         [APP] MHID Write : iid = 0.000, offset = 0.040         [APP] MHID Metad : iid = 0.010, offset = 0.040         [APP] MHID Metad : iid = 0.010, offset = 0.030         [APP] MHID Metad : iid = 0.010, offset = 0.03         [APP] MHID Metad : iid = 0.010, offset = 0.030         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad : iid = 0.011, data = 0.03         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad : iid = 0.011, data = 4.030         [APP] MHID Metad Resp: iid = 0.0011, data = 4.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.030         [APP] MHID Metad Resp: iid = 0.0011, data = 0.040                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| (APP) [WHID Read Resp: i:id = 0x000, doi:s = 9x409x430b017d<br>(APP) [WHID Read : i:id = 0x001, doi:s = 0x10<br>(APP] [WHID Read Resp: i:id = 0x001, doi:s = 0x10<br>(APP] = WWHID K = ** Calling fpgMapMHID() without passing a pointer is deprecated<br>wwnibg Text<br>[APP]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | The FRGA writes a full Siz-bit word (64 bytes) to host memory, so if the size of your test vector<br>(in bytes) is not a multiple of 64, the FRGA will overvite some spice at the end of output memory.<br>fpgaPrepareBuffer() allocates your host memory in a buffer that is a multiple of 64 bytes, so the<br>FRGA behavior will not affect your application. You should expect to see a single dokeadbeef at the<br>end of the output memory if and only if the size of your test vector (determined by vector_size, and<br>the datatype) is a multiple of 64 bytes (that is, if vector_size) and ultiple of 10.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x4           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         PHID Red         : Lid = 0x402, offset = 0x10           [APP]         Stoffset = 0x10, offset = 0x10           [                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | end of output memory after executing kernel:<br>[6] - 22.33334 (eVelDanab)<br>[6] - 22.66666 (eVelDanab)<br>[6] - 22.6666 (eVelDanab)<br>[6] - 22.6666 (eVelDanab)<br>[6] - 22.6666 (eVelDanab)<br>[6] - 22.6676 (eVelDanab)<br>[6] - 22.6776 (eVe |



The waveform, CCI-P transactions, and simulation log files are stored in the simulation work directory. To view the waveform database, type:

Ś make wave

|                              | ····· · •• ••• ••- |                                 |                                         |                                         | an 🛥 1 🗠 380 380 .                      | ·                                       | •                                       | ····             | ·                   |                     | - 10 III                                | 900 ( <b>6</b> 9 )   ( |                                         |                      |                       |        |
|------------------------------|--------------------|---------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|-----------------------------------------|------------------|---------------------|---------------------|-----------------------------------------|------------------------|-----------------------------------------|----------------------|-----------------------|--------|
| ► • € • 🖗   Search:          | 🛨 🉉 🖏 🖉            | s 🛛 🔍 🗢 🖉 🖓 🖓                   | <b>X  ∐∎ ∎</b>  ∎                       |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| I.                           | Msgs               |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| /ase_top/pClkDiv4            |                    |                                 |                                         |                                         | ստուստուստուստ                          | ատուսուս                                | ստուստուստ                              | տատունու         | ստուսո              | ການແບບເບ            | າກການການ                                | າການການການ             | սուսուսուսու                            | սուսուսուս           | սուսուսու             |        |
| /ase_top/pClkDiv2            | 1'h0               |                                 |                                         | 100000000000000000000000000000000000000 |                                         |                                         |                                         |                  |                     |                     | ,,,,,,,,,,,,,,,,,,,,,,,,                |                        |                                         |                      |                       |        |
| /ase_top/pck_cp2af_softReset |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| /ase_top/uClk_usr            |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| /ase_top/uClk_usrDiv2        | 1'h1               |                                 |                                         |                                         |                                         |                                         |                                         | Citra Citra      | LONG N              |                     | H0 3/60 cCl                             | LEN 1 OREO             |                                         | 0.42:5000805         | 19000 16 <sup>1</sup> | 004191 |
| ase_cop/pck_aizcp_six        | {eVC VH0 2'        | feVC VH0 2'h0 eCL LEF           | 1 eREO RDLINE   6'h00 42'h000           | 30f28000 16'h000                        | 42                                      | NO ECC_CEN_I                            | ENEO NOLINE                             | 61100            | 16vC_v/             | unieve y            | HO 2 HO ECC                             | LEN I ENEQ             | REFERE TO HO                            | 0 42 11000301.       | 28000 161100          | 10471  |
| 🛱 💠 hdr                      |                    | eVC_VH0 2'h0 eCL_LEN            | 1 eREO_RDLINE_I 6'h00 42'h0003          | if28000 16'h0004                        |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
|                              | 0                  | California Dillon Albert Albert | Colores Albert Albert Albert            |                                         |                                         | 1                                       |                                         |                  |                     | 2.0                 |                                         |                        |                                         |                      |                       |        |
| B A bdr                      | 5'h00 eVC          | 10 nxx 2 nx 1 nx 1 n            | 6 hxx 2 hx 1 h0 1 hx 2 hx 4 hx          | hxx 42 hxxxxxxx                         | XX 10 DXX 2 DX 1                        | nx 1 nx 2 nx 4                          | nx 6 nxx 42 nx                          | **********       | PRXXXF 51           | 2 11 2 11 2 10 2000 |                                         | *****                  | *********                               |                      | *******               | 000000 |
| 🗖 💠 data                     |                    |                                 | 0512h418555554182AAAB41800              | 00041AD555541                           | AA)                                     |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
|                              |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     | 0000000             |                                         |                        |                                         |                      |                       |        |
| ■ 🗢 c2                       | {9'n009} 1'n       | 9'h009                          | 00000000000000                          |                                         |                                         |                                         |                                         | 191              | 9'n00e)             | 9°h010              | 2 1 NO 64 NO                            | 000000000000           | 00000                                   |                      |                       |        |
|                              |                    | 511005                          |                                         | ی وجود میں ا                            |                                         |                                         |                                         | î                |                     |                     |                                         |                        |                                         |                      |                       |        |
| 🖻 🥠 data                     | 64'h000000         | 64'h0000000000000000            | 0                                       |                                         |                                         |                                         |                                         | 10               | 4'h0000             | X 64 h 000          | 000000000000000000000000000000000000000 | 000                    |                                         |                      |                       |        |
| /ase_top/pck_cp2af_sRx       | 1'h0 1'h0 {{       | 1'h0 1'h0 {{eVC_VA 1'h          | 0 1'h0 2'h0 2'h0 eRSP_RDLINE 16'        | 100001 512'h I                          | 1'h0 1'h0 {{eVC_VA                      | 1/h0 1/h0 2/h0                          | 2'h0 eRSP_RDL                           | <u>.INE X1'h</u> | <u>11113</u>        | h0 1'h0 {{          | 2VC_VA 1'h0                             | 1'h0 2'h0 2'h          | 0 eRSP_RDLIN                            | E 16'h0000} 5        | 12'h0000000           | 20000  |
| - c1TxAlmFull                |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| і 🧇 с0                       |                    | {eVC_VA 1'h0 1'h0 2'h0          | 2'h0 eRSP_RDLINE 16'h0000} 51:          | 100000000000000                         | 000000000000000000000000000000000000000 | 00000000000                             |                                         | 0000 X [eV       | 1{e1{e              | VC_VA 1'h           | ) 1'h0 2'h0 2                           | 'h0 eRSP_RD            | INE 16'h0000                            | } 512%00 <b>0</b> 00 | 00000000000           |        |
| hdr                          | eVC_VA 1'h0        | eVC_VA 1'h0 1'h0 2'h0           | The eRSP_RDLINE 16 h0000                |                                         |                                         |                                         |                                         | I evc            | . Lev Lev           | /C_VA 1'h0          | 1'h0 2'h0 2'                            | 10 eRSP_RDLI           | NE 16'h0000                             |                      |                       |        |
|                              | 0                  | EVC_VA                          |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
|                              |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| n 🔶 rsvd0                    | 2'h0               | 2'h0                            |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| resp type                    | eRSP RDLINE        | eRSP RDLINE                     |                                         |                                         |                                         |                                         |                                         | XeRS             | IeR., IeR           | SP ROLINE           |                                         |                        |                                         |                      |                       |        |
| 🗖 🔶 mdata                    |                    | 16'h0000                        |                                         |                                         |                                         |                                         |                                         | X16              | inine               | s:h0000             |                                         |                        |                                         |                      |                       |        |
| data                         | 512'h00000         | 512'h0000000000000000           | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000000000000000000000000000 | 000000000000000  | 512 <sup>-h00</sup> | 00000000            | 0000000000                              | 0000000000             | 000000000000000000000000000000000000000 | 000000000000         | 00000000000           | 00000  |
|                              | o                  |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| - 🧄 mmioWrValid              |                    |                                 |                                         | و محمد معال                             |                                         |                                         |                                         |                  | 1                   |                     |                                         |                        |                                         |                      |                       |        |
| 1 <b>2</b> c1                | {eVC_VA 1'h0       | feVC_VA 1'h0 1'h0 1'h0          | 1'h0 2'h0 eRSP_WRLINE 16'h0000          | 11/10                                   | {eVC_VA 1'h0 1'h0 1                     | 10 1'h0 2'h0 e                          | RSP_WRLINE 1                            | 6'h0 X [eVC      | VA 1'h0 1'          | 'h0 1'h0 1'         | 0 2'h0 eRSP                             | WRLINE 16              | 0000}1'h0                               |                      |                       |        |
| spValid                      | evC_VA 1'h0        | _evc_va_rn0_1'h0_1'h0           | The 2'ne eRSP_WRLINE 16'h0000           |                                         | evc_vairh01h01h                         | 10 1 n0 2'h0 eR                         | SP_WHLINE 16                            | nuuuu IeVC       | VA 1'10 1'h         | 0 1 n0 1 h          | 2 no eRSP                               | WREINE 16'ho           | 000                                     |                      |                       |        |
| /ase_top/pck_cp2af_pwrState  |                    | 2'h0                            |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |
| /ase_top/pck_cp2af_error     | 1'h0               |                                 |                                         |                                         |                                         |                                         |                                         |                  | 11. 11.             |                     | 0.10.00.00                              |                        |                                         |                      |                       |        |
| /ase_top/DBG_CORxMMIO        | 16 10000 2 1       | 16 N0000 2'h0 1'h0 9'h          | 000                                     |                                         |                                         |                                         |                                         | (16'             | 11 116              | 510000 21           | 01109100                                | 10                     |                                         |                      |                       |        |
|                              |                    |                                 |                                         |                                         |                                         |                                         |                                         |                  |                     |                     |                                         |                        |                                         |                      |                       |        |



205

#### Synthesizing the AFU

Generate the AF build environment and create the AF (.gbs) image.

- \$ afu\_synth\_setup --source hw/rtl/filelist.txt build\_synth
- \$ cd build\_synth
- \$ \$OPAE\_PLATFORM\_ROOT/bin/run.sh

When the AFU is created successfully, you get the following message:

| perezfra@localhost:~/HLS/HLS_DCP_1.x/hls_afu_2019-04-30/build_synth                                                                                                                                                                                                                                                                                                                                                                                                                                 |  | × |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|---|
| File Edit View Search Terminal Help                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |   |
| <pre>Info (23030): Evaluation of Tcl script ./a10_partial_reconfig/report_timing.tcl was successful<br/>Info: Quartus Prime TimeQuest Timing Analyzer was successful. 0 errors, 189 warnings<br/>Info: Peak virtual memory: 5885 megabytes<br/>Info: Processing ended: Fri Sep 20 10:13:41 2019<br/>Info: Elapsed time: 00:00:55<br/>Info: Total CPU time (on all processors): 00:01:23<br/>Info (19538): Reading SDC files took 00:00:11 cumulatively in this process.<br/>Wrote hls_afu.gbs</pre> |  |   |
| PR AFU compilation complete<br>AFU gbs file is 'hls_afu.gbs'<br>Design meets timing<br>====================================                                                                                                                                                                                                                                                                                                                                                                         |  | 1 |



#### Synthesizing the AFU

#### The run.sh script indicates the status of timing closure – make sure the generated AF has no hardware timing violations.

#### Optional:

Open the dcp.qpf Quartus project file in the Ouartus Prime Pro GUI with the synthesis build project's afu fit revision to view the details of the timing report and perform interactive timing analysis.

|                                       |                   |                  |            | -          |              |                            |             | -                 | -                |             |                           | _          |
|---------------------------------------|-------------------|------------------|------------|------------|--------------|----------------------------|-------------|-------------------|------------------|-------------|---------------------------|------------|
| File View Metlist Constraints Reports | Script Tools      | s <u>W</u> indow | Help       |            |              |                            |             |                   |                  |             | Search Intel FPGA         |            |
| Set Operating Conditions 🛛 🚇 🖻 🕷 📔    |                   |                  |            |            |              | Slow 900                   | mV 100C Mos |                   |                  |             |                           |            |
|                                       | Command Info      | Summary o        | f Paths    |            |              |                            |             |                   |                  |             |                           |            |
| Snapshot: final                       | Slack             |                  |            |            | From N       | ode                        |             |                   |                  | Tol         | lode                      |            |
| Slow 900mV 100C Model                 | 0.507 fpg         | ra toplinst f    | iu toplin: | st ccip fa | bric tuto    | generatedIfifo ramIram blo | ck1a2~reg1  | foga toplinst f   | iu toplinst ccir | fabric t a  | eneratedIdpfifolFIFOramIl | utrama     |
| Slow 900mV 0C Model                   | 2 0.589 fpg       | a_top inst_f     | iu_top in: | st_ccip_fa | bric_top[c1  | 6ui_cvl2xy_TxPort_T1a.C1D  | ata[401]    | fpga_top inst_f   | lu_top inst_ccip | a fabric_to | generated fifo_ram ram_bl | lock1a     |
| Fast 900mV 100C Model                 | 3 0.598 fpg       | a_top inst_f     | iu_top in: | st_ccip_fa | bric_tto_g   | enerated fifo_ram ram_bloo | :k1a15~reg1 | fpga_top inst_f   | iu_top inst_ccip | _fabricme   | _top inst_fme_csr ReqAdd  | -[2]p_rt   |
| Cost 000m3/ 0C Madel                  | 0.616 fpg         | a_top inst_f     | iu_top in: | st_ccip_fa | bric_top ins | t_fme_top inst_fme_csr Re  | Addr_q[10]  | fpga_top inst_f   | iu_top inst_ccip | _fabric_top | inst_fme_top inst_fme_csr | TxCfgR     |
| O Past sound of House                 |                   |                  |            |            |              |                            |             |                   |                  |             |                           |            |
| Report P @                            | ath #1: Setup sl  | lack is 0.507    |            |            |              |                            | Path #1:    | Setup slack is 0. | .507             |             |                           |            |
| TimeQuest Timing Analyzer Sumr        | Path Summary      | Statistics       | Data Pati  | Wavef      | orm Extra    | Fitter Information         | Path Sur    | nmary Statistic   | cs Data Path     | Waveform    | Extra Fitter Information  |            |
| 📰 Timing Delays: Final Snapshot 🛛     | Data Arrival Path | 1                |            |            |              |                            |             |                   |                  |             |                           |            |
| Advanced I/O Timing                   | Total             | Incr             | RF         | Type       | Fanout       | Location                   |             |                   |                  |             |                           |            |
| 5DC File List                         | 0.000             | 0.000            |            |            |              |                            |             |                   |                  |             | 13.7                      | 782 ns     |
| Summary (Setup)                       | 2 - 9.992         | 9.992            |            |            |              |                            | 1           | Launch            |                  |             |                           | 1          |
| Setup: u0[dcp_iopll]dcp_iopll]clk1    | 0.000             | 0.000            |            |            |              |                            | Launch C    | LOCK LINGHCH      |                  |             |                           |            |
| Slow 900mV 100C Model                 | 2 0.000           | 0.000            |            |            | 1            | PIN_AP18                   | Setup Re    | Lationship        | 5.0 ns           |             |                           |            |
| Turka 1                               | 3 0.000           | 0.000            | RR         | IC         | 1            | IOIBUF_X78_Y6_N47          | Stoop M.    | TOTO A CONSTRAINT |                  |             |                           |            |
|                                       | \$ 0.566          | 0.566            | RR         | CELL       | 1            | IOIBUF_X78_Y6_N47          | Latch C1    | pck.              | Lato             | h           |                           |            |
| Set Operating Conditions     Reports  | -                 |                  |            |            |              |                            |             | -                 |                  | -           |                           |            |
| The slack                             | Data Required P   | ath              |            |            |              |                            | Data Arr:   | ival              |                  |             |                           | X          |
| 🗸 🔤 Report Setup Summary              | Total             | Incr             | RF         | Туре       | Fanout       | Location                   |             |                   | 9.               | 992 ns      |                           | <b>T T</b> |
| Report Hold Summary                   | 5.000             | 5.000            |            |            |              |                            | Clock De    | Lay               |                  |             | *****                     |            |
| Report Recovery Summary               | 2 🕶 14.917        | 9.917            |            |            |              |                            | Data Dal    |                   |                  |             | 4.29 ms                   |            |
| Report Kemoval Summary                | 1 5.000           | 0.000            |            |            |              |                            |             |                   |                  |             |                           |            |
| Report Max Skew Summary               | 2 5.000           | 0.000            |            |            | 1            | PIN_AP18                   | Slack       |                   |                  |             |                           | 0.507 n    |
| Report Net Delay Summary              | 3 5.000           | 0.000            | RR         | IC         | 1            | IOIBUF_X78_Y6_N47          |             |                   |                  |             |                           | ÷.         |
|                                       |                   | 0.566            | DD         | CELL       | 1            | KOUDLIE V70 VE NAT         |             |                   |                  |             |                           | V          |

FimeQuest Timing Analyzer - /home/nerezfra/HLS/HLS\_DCP\_1 v/bls\_afu\_2019-04-30/huild\_synth/huild/dcn - afu\_fit

kcb report timing -to clock { u0|dcp iop11|dcp iop11|clkix } -setup -npaths 10 -detail full path -panel name {Setup: u0|dcp iop11|dcp iop11|clkix}

Report Timing: Found 10 setup paths (0 violated). Worst case slack is 0.507

### Running the AFU

To run the bitstream, ensure that your host system contains an Intel FPGA PAC and that you have Acceleration Stack (including OPAE) installed and configured.

Load the AF into the FPGA

\$ fpgaconf hls\_afu.gbs

Navigate to the hls\_afu/sw directory. Build and run the host application (<u>do not</u> <u>specify USE\_ASE=1</u>)

- \$ make
- \$ ./hls\_afu\_host



### Running the AFU

| perezfra@localhost:~/HLS/HLS_DCP_1.x/hls_afu_2019-04-30/sw                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | • | > |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|---|
| File Edit View Search Terminal Help                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |   |   |
| <pre>The text ref years the formed representation of the formed representa</pre> |   |   |
| check output memory:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |   |   |
| output memory OK!<br>sum: Expected 715.000000, calculated 715.000000.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |   |   |
| The FPGA writes a full 512-bit word (64 bytes) to host memory, so if the size of your test vector<br>(in bytes) is not a multiple of 64, the FPGA will overwrite some space at the end of output memory.<br>fpgaPrepareBuffer() allocates your host memory in a buffer that is a multiple of 64 bytes, so the<br>FPGA behavior will not affect your application. You should expect to see a single 0xdeadbeef at the<br>end of the output memory if and only if the size of your test vector (determined by vector_size, and<br>the datatype) is a multiple of 64 bytes (that is, if vector_size is a multiple of 16).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |   |   |
| end of output memory after executing kernel:<br>[62] - 22.333334 (0x41b2aaab)<br>[63] - 22.666666 (0x41b55555)<br>[64] - 6259853398707798016.000000 (0xdeadbeef)<br>[65] - 0.000000 (0x0)<br>Vector size is 64 (256 bytes), so expect memory output at [64] = 0xdeadbeef<br>Finished Running Test.<br>Test PASSED<br>[perezfra@localhost sw]\$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |   |   |




What is the Acceleration Stack for Intel<sup>®</sup> Xeon<sup>®</sup> CPU with FPGAs

- Robust collection of software, firmware, and tools
- Makes it easy to develop and deploy Intel FPGAs in the data center
- Supports both RTL and HLS development flows
- Intel FPGA Acceleration Hub for more information

How to develop an AFU using HLS

- Introduction to <u>HLS</u>
- Integration of HLS component, simulation & synthesis flows
- Developing a host application and run your accelerator



