Skip to content

OpenCL Device Vector Performance Parameters

Jay edited this page Mar 5, 2014 · 1 revision

This page discusses the performance of bolt::cl::device_vector in a special case.

Device Vector is a container designed to encapsulate an OpenCL buffer on the GPU. The performance of device_vector varies depending on the type and location of the buffer. For instance, it's advisable not to use a CL_MEM_READ_WRITE buffer if the program eventually executes on the host. In the examples, we assume that the system consists of a host CPU and a discrete GPU device with OpenCL.

#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>

...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> dv_input( hv_input.begin(),   
                                       hv_input.end() );

bolt::cl::device_vector<int> dv_output( hv_output.begin(),   
                                        hv_output.end() );

// Create a control structure
bolt::cl::control ctl;

// Force to run on GPU device
ctl.setForceRunMode(bolt::cl::control::OpenCL);

bolt::cl::transform( ctl,
                     dv_input.begin(),
                     dv_input.end(),
                     dv_output.begin(),
                     bolt::cl::plus<int>() );
...

The code snippet above, demonstrates the correct usage of device_vector — The buffer resides on the GPU and the code runs on the GPU.

#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>

...

// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> dv_input( hv_input.begin(),
                                       hv_input.end() );


bolt::cl::device_vector<int> dv_output( hv_output.begin(),
                                        hv_output.end() );

// Create a control structure
bolt::cl::control ctl;

// Force to run on Multicore host device
ctl.setForceRunMode(bolt::cl::control::MultiCoreCpu);

bolt::cl::transform( ctl,
                     dv_input.begin(),
                     dv_input.end(),
                     dv_output.begin(),
                     bolt::cl::plus<int>() );
...

In the code snippet above, an OpenCL buffer with CL_MEM_READ_WRITE flag is created on the GPU. Notice that transform takes the TBB path as guided by ctl and this results in an additional job for the system — To get the buffer back to the host system from GPU memory. To avoid this performance hit, it's recommended to either use a host vector like std::vector or use device_vector with CL_MEM_USE_HOST_PTR flag, so that the buffer resides on the host memory.

// Creating device_vector with CL_MEM_USE_HOST_PTR flag
bolt::cl::device_vector<int> dv_input( hv_input.begin(),
                                       hv_input.end(),
                                       CL_MEM_USE_HOST_PTR );

Note that, if a host vector such as std::vector is passed to any algorithm, a corresponding device_vector is created with CL_MEM_USE_HOST_PTR flag.

Clone this wiki locally