

# Targeting FPGAs with an LLVM compiler

Dmitry Denisenko

Intel Programmable Solutions Group

November 13, 2016, LLVM-HPC3 SC'16, Salt Lake City, UT



# **FPGA** Overview

# **FPGAs are Everywhere!**



Broadcasting



# Spectrum of approaches to high performance





# What's in my FPGA?

#### **DSPs**

 Dedicated single-precision floating point multiply and accumulators

#### **Block RAMs**

 Small embedded memories that can be stitched to form an arbitrary memory system

#### Arithmetic Logic Modules

Implement arbitrary logic functions

#### Programmable Interconnect

 Programmable routing that can build arbitrary topologies







### **FPGA Hardware Design**



# Hardware Design Entry Complexity

Traditional description of these circuits is done through **Hardware Design** Languages such as VHDL or Verilog.

Incredibly detailed design must be done before a first working version is possible

- Cycle by cycle behavior must be specified for every register in the design
- The complete flexibility of the FPGA means that the designer needs to specify all aspects of the hardware circuit
  - Buffering, Arbitration, IP Core interfacing, etc



# Why OpenCL on FPGAs

Intel FPGA SDK for OpenCL is an LLVM-based compiler that raises the level of abstraction for FPGA design to make it accessible to more people.





# FPGAs vs CPUs

#### **FPGAs are dramatically different than CPUs**

- Massive fine-grained parallelism
- Complete configurability
- Huge internal bandwidth
- No callstack
- No dynamic memory allocation
- Very different instruction costs
- No fixed number of program registers
- No fixed memory system
- Much more flexibility with data types



# **Targeting an Architecture**

In a CPU, the program is mapped to a fixed architecture

In an FPGA, there is NO fixed architecture

The program defines the architecture





# 1. Computation in Space

# A simple 3-address CPU





# Load memory value into register





## Add two registers, store result in register





# A simple program

Mem[100] += 42 \* Mem[101]

#### CPU instructions:

R0 ← Load Mem[100] R1 ← Load Mem[101] R2 ← Load #42 R2 ← Mul R1, R2 R0 ← Add R2, R0 Store R0 → Mem[100]











(intel)

# ... and specialize by position

R0 ← Load Mem[100]



R1 ← Load Mem[101]

R2 ← Load #42

 $R2 \leftarrow Mul R1, R2$ 

 $R0 \leftarrow Add R2, R0$ 

Store R0  $\rightarrow$  Mem[100]

1. Instructions are fixed. Remove "Fetch"



### ... and specialize

R0 ← Load Mem[100]

R1 ← Load Mem[101]

R2 ← Load #42

R2 ← Mul R1, R2

 $R0 \leftarrow Add R2, R0$ 

Store R0  $\rightarrow$  Mem[100]



- 1. Instructions are fixed. Remove "Fetch"
- 2. Remove unused ALU ops



### ... and specialize

R0 ← Load Mem[100]



R1 ← Load Mem[101]

R2 ← Load #42

R2 ← Mul R1, R2

 $R0 \leftarrow Add R2, R0$ 

Store R0  $\rightarrow$  Mem[100]







- 1. Instructions are fixed. Remove "Fetch"
- 2. Remove unused ALU ops
- 3. Remove unused Load / Store





- 1. Instructions are fixed. Remove "Fetch"
- 2. Remove unused ALU ops
- 3. Remove unused Load / Store
- 4. Wire up registers properly! And propagate state.



- 1. Instructions are fixed. Remove "Fetch"
- 2. Remove unused ALU ops
- 3. Remove unused Load / Store
- 4. Wire up registers properly! And propagate state.
- 5. Remove dead data.

# Optimize the Datapath



 $R2 \leftarrow Mul R1, R2$ 

 $R0 \leftarrow Add R2, R0$ 

#### 1. Instructions are fixed. Remove "Fetch"

- 2. Remove unused ALU ops
- 3. Remove unused Load / Store
- 4. Wire up registers properly! And propagate state.
- 5. Remove dead data.
- 6. Reschedule!

Store R0  $\rightarrow$  Mem[100]



### Data parallel kernel













While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored







On each cycle the portions of the datapath are processing different threads

While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored







While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored





8 work items for vector add example 5 6 4 Thread IDs On each cycle the portions of the datapath are processing different threads While thread 2 is being loaded, thread 1

is being added, and thread 0 is being stored









# **Compiler Flow**

# **Compiler Flow**





# Example Compiler Optimizations

# **Branch Conversion**

Control flow is expensive.

Instead, <u>execute both sides</u> of a branch, pick the result for the "true" path, and <u>predicate</u> commands that have side-effects.

If a function has no loops, the whole function loses all branches.

Loops lose all internal branches.

X\_temp = X + 2; X = cond ? X\_temp : W;
(selector) in hardware.

array[z] = Y only if cond

Single IR instruction to store only if condition is true. "cond" is predicate on the store unit. Requires store IR instruction to accept predicate.

?: operator is a mux

# Local memory address space splitting

FPGAs lack seamless memory hierarchy that CPUs have.

We use memory <u>address spaces</u> to distinguish different memory locations: on-chip, off-chip, and special types of off-chip memory (e.g. constant, QDR, HMC).

On-chip (aka local memory) is further split into multiple address spaces based on access patterns for much better implementation efficiency:





Split is possible only if compiler can prove that pointers to a[] and b[] never mix.



# **Optimizing Bit Swizzling**

Bit swizzling with compile-time known pattern (e.g. bit reversal) is free on FPGA.



Without optimization, IR above is a very expensive tree of ORs and ANDs.

Compiler detects such an IR tree and turns it into a single shufflevector instruction.





# 3. Loop Pipelining

### **Data-Parallel Execution**

On the FPGA, we use pipeline parallelism to achieve acceleration





Threads execute in an embarrassingly parallel manner.

Ideally, all parts of the pipeline are active at the same time.



39

### **Data-Parallel Execution - drawbacks**

Difficult to express programs which have partial dependencies during execution





## Would require complicated hardware and new language semantics to describe the desired behavior



### Loop-pipelining

Allow users to express programs as a single-thread

```
for (int i=1; i < n; i++) {
    c[i] = c[i-1] + b[i];
}</pre>
```

Pipeline parallelism still leveraged to efficiently execute loops via **loop pipelining** – multiple loop iterations are executed concurrently.



## Loop Pipelining Example

### No Loop Pipelining



### With Loop Pipelining



Looks almost like multithreaded execution!

#### No Overlap of Iterations!

Finishes Faster because Iterations Are Overlapped

Loop Pipelining enables Pipeline Parallelism AND the communication of state information between iterations.



42

### **Loop-Carried Dependencies**

Loop-carried dependencies are dependencies where one iteration of the loop depends upon the results of another iteration of the loop

```
kernel void state_machine(ulong n)
{
  t_state_vector state = initial_state();
  for (ulong i=0; i<n; i++) {
    state = next_state( state );
    unit y = process( state );
    write_output(y);
  }
}</pre>
```

The variable state in iteration 1 depends on the value from iteration 0.

44

### **Loop-Carried Dependencies**

To achieve acceleration, we pipeline each iteration of a loop with loop-carried dependencies

- Analyze any dependencies between iterations
- Schedule these operations
- Launch the next iteration as soon as the critical dependency is calculated





### Trouble with Loop-Carried Dependencies

Many things can go wrong with loop pipelining:

- Loop-carried dependency takes too long to compute.
- Loop iterations may get out of order.

Consequences of having a loop-carried dependency are severe:

- If introduce dependency on global location: loop initialization internal can go from 1 to ~70.
  - That's 70x drop in performance!
- The compiler has to be good at analyzing and reporting these dependencies!





## LLVM: Benefits & Challenges



### LLVM is awesome!



Programmable Solutions Group

## Challenges

Our instruction costs are wildly different from CPUs.

Well formed loops are extremely important to us but ...

 Our ideal loop form is not the same as for CPU. Never want loops replicated or put inside a condition. Often no point in hoisting if there is no dependency.

Need many custom intrinsics to better model our hardware:

- Load/store units with additional arguments: predicates, byte-enables, dependencies.
- Channels to express communication between parallel tasks.

Have to use debug data to carry additional information:

Have "styles" of load/store units and even multipliers (high throughput, low area)

Can't express instruction-level, block-level, and task-level parallelisms:

Only decide on this in the backend, and it's late or very expensive to do optimizations then.







# Thank You