PGI Fortran, C and C++ compilers have been widely-used on Linux/x86 HPC systems and clusters for nearly 20 years, and more recently on heterogeneous GPU-enabled platforms and OpenPOWER CPUs. Over the past 3 years, these compilers have been re-designed to integrate LLVM for low-level optimization and code generation on both GPUs and multicore CPUs, retaining the PGI optimizer for high-level optimizations, vectorization, parallelization and GPU offloading. This work has evolved to include a joint project with the US DOE to create an open source Fortran language front-end designed for integration with LLVM. This talk will cover the current status and capabilities of the PGI+LLVM compilers, our experiences integrating LLVM into a production HPC compiler infrastructure, and plans and priorities for future development.
Programming today's increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base is not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose a new heterogeneous compiler that brings us closer to the dream of automatic accelerator mapping. Starting from a sequential compiler IR, we automatically generate a hybrid executable that - in combination with a new data management system - transparently offloads suitable code regions. Our approach is almost regression free for a wide range of applications while improving a range of compute kernels as well as two full SPEC CPU applications. We expect our work to reduce the initial cost of accelerator usage and to free developer time to investigate algorithmic changes.
An overview of the ways that LLVM is used in ARM's products. Covering:
Hardware high-level synthesis targeting Field Programmable Gate Arrays (FPGAs) with an LLVM compiler presents many unique challenges. In this talk we will give a brief introduction to FPGAs and how their compute model differs from the classical von-Neumann architecture of CPUs and GPUs. Then we will describe how LLVM IR instructions are mapped to hardware and scheduled to clock cycles, give examples of some FPGA-specific compiler optimizations, and present some of the challenges in adapting the LLVM compiler to our needs.
There is increasing interest in using FPGA-based accelerators in high-performance computing as FGPA-based accelerators have shown better performance and energy-efficiency over other general purpose processors like CPUs and GPGPUs, without scarifying the flexibility.
However, users with HPC background still face difficulty when they try to program FPGAs as compute devices in HPC, even with the help of the existing High-level Synthesis solution.
In this talk, I will discuss how Xilinx, a leading FPGA company, plans to reduce the difficulty of programming FPGAs for HPC users with its OpenCL compiler and enable FPGA programming for the masses.
Unified Parallel C (UPC) is an extension of the C programming language designed for high-performance computing on large-scale parallel machines, including those with a common global address space (SMP and NUMA) and those with distributed memory (e.g. clusters). The UPC language expresses parallelism by extending ISO C 99 with constructs to explicitly specify a parallel execution model, shared address space, synchronization primitives and memory consistency model, memory management, and explicit communication primitives.
The Clang UPC2C tool is a Clang based translator which converts the source code written in UPC language into the C language source. UPC2C leverages the previously developed Clang UPC front end and uses Clang's TreeTransform capability to convert UPC specific constructs into plain C, targeting the Berkeley UPC runtime. While this transformation works well, a few challenges were encountered during the course of development.
OpenMP 4.5 allows performance portability by enabling users to write a single application code and run it on multiple types of accelerators. Our goal is to deliver a high- performance implementation of OpenMP into the Clang/LLVM project. This paper describes our initial work to fully support code generation for OpenMP device offloading constructs. We describe a new driver implementation to handle compilation for multiple host and device types, which generalizes the current Clang CUDA implementation and supports OpenMP. It can also be extended to any offloading based language including OpenCL and OpenACC. We describe an implementation of the OpenMP offloading constructs in the runtime library, giving details on two critical aspects. First, how data mapping is implemented. Second, how different device code sections in the binaries are handled to enable application execution on different devices without recompilation. We report initial performance on a prototype that extends current LLVM trunk repositories with all our proposed patches plus future ones, showing near-CUDA performance of our solution.
In this paper, we introduce a new LLVM analysis, called Bandwidth-Critical Data Analysis (BCDA), to decide when it is beneficial to allocate data in High-Bandwidth Memory (HBM) and then transform allocation calls into specific HBM allocation calls, for increased performance in parallel systems. High-Bandwidth Memory (HBM) is a new memory technology that features stacked 3D chips on processor dies.
The well-known SSA-based compilation infrastructure for sequential and parallel languages LLVM will be used to detect frequently used data and patterns of memory accesses in order to decide on which level to allocate the data: HBM or DDR. BCDA core analysis counts the number of data uses and detects irregular and simultaneous accesses, generating a priority value for every variable. Using this priority value information, LLVM will generate memkind_alloc function calls, to transform mallocs to HBM allocations if HBM is present and a sufficient size of HBM is available.
As a use case for validating our approach, we show how the Conjugate Gradient (CG) benchmark from the NAS Parallel suite can be optimized for the use of MCDRAM, as the HBM on the Knights Landing Xeon Phi processors is called. We implement BCDA in the LLVM compiler and apply it on CG to detect when it is beneficial to allocate data in the HBM. Then, we allocate the data in the MCDRAM using hbwmalloc library calls. Using the priority generated by BCDA, we achieved a 2.29x performance improvement using the LLVM compiler and 2.33x using Intel's compiler compared to the DDR version of CG.
LLVM has become an integral part of the software-development ecosystem for developing advanced compilers, high-performance computing software and tools. This paper presents a small set of LLVM IR extensions for explicitly parallel, vector, and offloading program constructs. The proposed LLVM IR extensions enable the lowering and transformation in the LLVM middle-end for the OpenMP C/C++ and Fortran API, and any other explicitly parallel/simd constructs in high-level source languages. This paper discusses the rationale of the LLVM IR extensions to support OpenMP constructs and clauses, and presents the LLVM intrinsic functions, the framework for parallelization, vectorization, and offloading, and the sandwich scheme to model the OpenMP parallel, simd, offloading and data-attribute semantics under the SSA form. Examples are given to show our implementation in the LLVM middle-end passes, which paves the way to achieve a better interaction with scalar optimizations, vectorization, and loop optimizations, and thus resulting in higher performance.
The LLVM intermediate representation (IR) lacks semantic constructs for denoting common high-performance operations such as parallel and concurrent execution, communication, and synchronization. Currently, representing such semantics in LLVM requires either extending the intermediate form (a significant undertaking) or the use of ad hoc indirect means such as encoding them as intrinsics and/or the use of metadata constructs. In this paper we discuss a work in progress to explore the design and implementation of a new compilation stage and associated high-level intermediate form that is situated between the abstract syntax tree and LLVM's IR. This high-level representation is a superset of LLVM IR and supports the direct representation of these common parallel computing constructs, together with the infrastructure for supporting analysis and transformation passes on this representation.