-
Notifications
You must be signed in to change notification settings - Fork 65
OpenCL Device Vector Performance Parameters
This page discusses the performance of bolt::cl::device_vector
in a special case.
Device Vector is a container designed to encapsulate an OpenCL buffer on the GPU. The performance of device_vector varies depending on the type and location of the buffer. For instance, it's advisable not to use a CL_MEM_READ_WRITE
buffer if the program eventually executes on the host. In the examples, we assume that the system consists of a host CPU and a discrete GPU device with OpenCL.
#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>
...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> dv_input( hv_input.begin(),
hv_input.end() );
bolt::cl::device_vector<int> dv_output( hv_output.begin(),
hv_output.end() );
// Create a control structure
bolt::cl::control ctl;
// Force to run on GPU device
ctl.setForceRunMode(bolt::cl::control::OpenCL);
bolt::cl::transform( ctl,
dv_input.begin(),
dv_input.end(),
dv_output.begin(),
bolt::cl::plus<int>() );
...
The code snippet above, demonstrates the correct usage of device_vector — The buffer resides on the GPU and the code runs on the GPU.
#include <bolt/cl/transform.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/control.h>
...
// Calls device_vector constructor with default flag: CL_MEM_READ_WRITE
bolt::cl::device_vector<int> dv_input( hv_input.begin(),
hv_input.end() );
bolt::cl::device_vector<int> dv_output( hv_output.begin(),
hv_output.end() );
// Create a control structure
bolt::cl::control ctl;
// Force to run on Multicore host device
ctl.setForceRunMode(bolt::cl::control::MultiCoreCpu);
bolt::cl::transform( ctl,
dv_input.begin(),
dv_input.end(),
dv_output.begin(),
bolt::cl::plus<int>() );
...
In the code snippet above, an OpenCL buffer with CL_MEM_READ_WRITE flag is created on the GPU. Notice that transform
takes the TBB path as guided by ctl
and this results in an additional job for the system — To get the buffer back to the host system from GPU memory. To avoid this performance hit, it's recommended to either use a host vector like std::vector
or use device_vector
with CL_MEM_USE_HOST_PTR
flag, so that the buffer resides on the host memory.
// Creating device_vector with CL_MEM_USE_HOST_PTR flag
bolt::cl::device_vector<int> dv_input( hv_input.begin(),
hv_input.end(),
CL_MEM_USE_HOST_PTR );
Note that, if a host vector such as std::vector
is passed to any algorithm, a corresponding device_vector is created with CL_MEM_USE_HOST_PTR
flag.