-
Notifications
You must be signed in to change notification settings - Fork 1
Home
#HSA Support for Chapel# The goal of this project is integrate GPU support as a language feature in Chapel, so that chapel programmers can offload some of the parallel constructs already provided in Chapel to the GPU with minimal extra effort. The GPU support is to be integrated in Chapel through the HSA framework.
##Related Work## ###GPU Support in Chapel### While there has been previous work investigating adding GPUs into Chapel, the work was CUDA-only and not very elegantly integrated into the language (Sidelnik, IPDPS 2012). In fact, that work is currently not part of the Chapel branch and, from discussions with the Cray team, their previous efforts to integrate it into the mainline branch have shown it to be too difficult, because it has changed too much in the meantime. In addition, this effort to add CUDA to Chapel relied on creating a new GPU domain map, which the Cray team felt was a somewhat unwieldy way to integrate GPUs. However, the previous CUDA work has been helpful as we have used it as the basis for some of this effort.
Our original approach was to make use of the capability in Chapel to call externally defined C functions in object files using the "extern" keyword. This allowed us to execute GPU kernels from Chapel by calling functions that are defined in external object files and execute GPU kernels. These object files were built from OpenCl kernel code using our SNACK (Simple No API Compiled Kernels) framework. This method is not very programmer friendly and does not integrate well with the language. Our collaborations with Cray have led us to believe using a hierarchical locale would be the cleanest method for HSA integration.
##Current Approach## Our current approach is to expose GPU execution natively in Chapel through the use of GPU sublocales in hierarchical Chapel locales. Locales are Chapel’s concept for describing the underlying system architecture which is running the Chapel program. Originally, Chapel’s locale model was a simple collection of one or more homogeneous processors connected to memory. These locales were then connected by a communication network to describe the cluster executing the application. While these models were easy to understand and develop for, they didn’t necessarily map to real hardware architectures. With the Chapel 1.8 release in October 2013, the Chapel team first introduced the concept of “hierarchical locales” which greatly increased the ability to describe underlying architectures. Thus far, this hierarchical locale concept has really only been used to describe a non-uniform memory access (NUMA) model, where sublocales are NUMA domains with various memory access times. These locales are used in conjunction with Chapel’s “on” syntax to direct tasks to execute on a particular node. Chapel’s locales were originally used to initiate work on a specific location within the cluster.
on Locales[1] do {
...;
}
With the new hierarchical NUMA locale, the NUMA locale is used to specify which of the NUMA locations within the node (noted by the childDomain) should be used.
on Locales[1].childDomain[2] do {
...;
}
We have added a new hsa
hierarchical locale with two sublocales - CPU and GPU.The GPU sublocale can then be specified to
execute code on the GPU.
on (Locales[0]:LocaleModel).GPU do {...;}
This modification of the “on” syntax and creation of the HSA-aware hierarchical locale module with automatic device-discovery form the majority of the runtime component for Chapel-HSA integration.
The next step is to actually drive the creation of the code which will actually execute on the GPU. Ideally the CHapel compiler should be able to take unmodified Chapel code snippets and emit binaries suitable for execution on a GPU. Since taking general Chapel code and translating it to run on the GPU can be a difficult step, we’ve decided to start with providing support for automatic GPU acceleration of a single Chapel construct -- the reduction -- so that we can learn from the process and determine how to best generalize it.
Chapel provides a number of predefined reduction operators that can be used to reduce aggregate expressions (such as arrays, sequences, or any iteratable expression) to a single result value. Chapel also provides a mechanism to support user defined reductions where the programmer can implement custom operators for the reductions. Reductions are widely used in data-parallel applications and can often be computed using parallel reduction trees, because of the associative and commutative nature of the operators. Hence, reductions are an interesting candidate for GPU acceleration, and researchers have developed optimized reduction kernels for GPUs.
As part of our efforts in integrating HSA-support within Chapel, we have modified the Chapel compiler and runtime to allow execution of predefined reductions defined on 1D Chapel arrays on GPUs, if the reduction operation is executed on a GPU sublocale. For example, the following code snippet will be executed on a GPU.
on Locales[0] do {
var A: [1..3] int = (1,2,3);
on (Locales[0]:LocaleModel).GPU do {
var sum = + reduce A;
}
}
Currently the predefined reductions are implemented through Chapel classes defined in the internal module ChapelReduce. Each predefined operator is backed by a separate class that implements a specific interface. The Chapel parser replaces a reduction expression with a call to a new function that returns the result of the reduction. It also introduces this new function that implements the reduction by repeatedly calling the methods of the backend operator-class in a loop. For GPU acceleration of reductions, we store the pre-compiled version of an OpenCL kernel for every combination of reduction operator and data-type that Chapel supports. We implement a C-function inside the Chapel runtime that constructs the AQL packets and enqueues the appropriate kernels depending on the operator and data-type of reduction. The function also waits for completion of kernel execution and returns the final result. We then modify the Chapel compiler so that the parser introduces extra code that calls this C-function if the reduction is executed on a GPU sublocale. Note that the C-function has access to the GPU device, the HSA command queue, and the symbols for the predefined kernels since GPU device discovery and HSA initialization happens during the initialization of the GPU sublocale, before the execution of any reduction. The HSA framework provides a shared address-space abstraction to the application, so that data allocated on the CPU can be accessed from the GPU without any explicit data-movement operations. So the reduction executing on the GPU sublocale can directly refer to and access the array A declared in the parent locale without any explicit data-movement. With Chapel reductions correctly executing on the GPU, we will now move to integrating a more general framework for GPU code within Chapel. We expect to use the same hierarchical locale concept for specifying GPUs using an on-syntax.