Algorithms execute incorrectly when used with cross-device memory #854

BenBrock · 2023-03-22T23:56:31Z

If I execute a oneDPL algorithm on one device, while one of the outputs is located on another device, the algorithm does not execute successfully. I've reproduced this on 2- and 4-GPU machines with Xe Link on ORTCE with inclusive_scan and for_each.

I have written a minimal example with inclusive_scan, summarized below.

v points to a USM device memory allocation on GPU 0.
ptrs[1] points to a USM device memory allocation on GPU 1.
Execute inclusive_scan on the input [v, v + n), writing results to [ptrs[1], ptrs[1] + n)
The output still contains zeros after the algorithm is executed, instead of the correct result.

// Allocate USM device buffers on each device, to be used as outputs
std::vector<int*> ptrs;
for (auto&& device : devices) {
  int* ptr = sycl::malloc_device<int>(n, device, context);
  ptrs.push_back(ptr);
}

// Allocate USM device buffer on device 0, to be used as input
int* v = sycl::malloc_device<int>(n, devices[0], context);

// Copy data to input buffer
sycl::queue q(context, devices[0]);
q.memcpy(v, lv.data(), lv.size()*sizeof(int)).wait();

for (auto&& ptr : ptrs) {
  sycl::queue q(context, devices[0]);
  oneapi::dpl::execution::device_policy policy(q);

  // Perform `inclusive_scan` on device 0
  // Input buffer is on device 0, output buffer on every device, round robin
  oneapi::dpl::inclusive_scan(policy, v, v + n, ptr);
}
. . .

// Memcpy buffers pointed to by `ptrs` and print them out.

Output:

(base) bbrock@sdp125072:~/src/distributed-ranges/examples/shp$ ./write_test-dosxx 
Reference: [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
Result for GPU 0: [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
Result for GPU 1: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Result for GPU 2: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Result for GPU 3: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I encountered this using the most recent commit of oneDPL compiled using both icpx 2023.0.0.20221201 and the most recent commit of intel/llvm. All buffers are in USM device memory, and these multi-GPU machines have Xe Link with full peer-to-peer support, so as far as I know this should work. (Replacing the oneDPL for_each or inclusive_scan with a q.parallel_for writing to the same regions results in visible changes.)

I encountered this in the context of our distributed ranges inclusive_scan implementation, where the distributed ranges given for the input and output may not line up perfectly, meaning that the input and output can be on different devices.

Oddly, algorithms seem to execute correctly if I pass in iterators (e.g. GCC's __normal_iterator) instead of raw pointers. This is the precisely the opposite of what I would expect, as there is currently a Level Zero runtime bug preventing iterators from working across GPUs inside SYCL kernels. Is oneDPL somehow handling memory differently?

The text was updated successfully, but these errors were encountered:

MikeDvorskiy · 2023-03-24T10:18:18Z

Hello @BenBrock,

"sycl::malloc_device" allocates memory on a device, and it is not shared memory (between the host, in particular)
I think oneDPL doesn't support cross-device working, at least because API doesn't provide to pass information more than one device (queue). Indeed, oneDPL algorithm, can accept just one policy, which contains just one queue(device) in its turn.
So, in a kernel we do direct access to data via pointer, if it is USM pointer (device pointer or shared pointer).
In case of "normal" iterator type (not host pointer) - oneDPL creates temporary sycl buffer, copies data to them, and works with data via sycl accessors. And sycl buffer has not some option to point what device where buffer should be allocated. I believe it is a device associated with passed queue. So, in that case oneDPL algorithms also don't support cross-device working. A fact that you notices that it works - it is accidentally, UB other words.

BenBrock · 2023-03-27T16:09:42Z

#861 partially addresses this issue, allowing oneDPL algorithms to be invoked as below.

auto first = oneapi::dpl::make_direct_iterator(v);
auto last = oneapi::dpl::make_direct_Iterator(v + n);
auto d_first = oneapi::dpl::make_direct_iterator(ptr);
oneapi::dpl::inclusive_scan(policy, v, v + n, ptr);

(This does not yet completely fix the issue on Intel multi-GPU systems due to an issue with the level zero runtime.)

BenBrock changed the title ~~Wrong executing algorithms with cross-device memory~~ Algorithms execute incorrectly when used with cross-device memory Mar 22, 2023

BenBrock mentioned this issue Mar 23, 2023

Use shp::devices() inside algorithms, disable non-aligned oneapi-src/distributed-ranges#134

Merged

This was referenced Mar 24, 2023

Fix use of device iterators with oneDPL oneapi-src/distributed-ranges#137

Merged

Implement direct_iterator and make_direct_iterator #861

Open

Cannot use device iterators in oneDPL algorithms #855

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithms execute incorrectly when used with cross-device memory #854

Algorithms execute incorrectly when used with cross-device memory #854

BenBrock commented Mar 22, 2023 •

edited

Loading

MikeDvorskiy commented Mar 24, 2023 •

edited

Loading

BenBrock commented Mar 27, 2023 •

edited

Loading

Algorithms execute incorrectly when used with cross-device memory #854

Algorithms execute incorrectly when used with cross-device memory #854

Comments

BenBrock commented Mar 22, 2023 • edited Loading

MikeDvorskiy commented Mar 24, 2023 • edited Loading

BenBrock commented Mar 27, 2023 • edited Loading

BenBrock commented Mar 22, 2023 •

edited

Loading

MikeDvorskiy commented Mar 24, 2023 •

edited

Loading

BenBrock commented Mar 27, 2023 •

edited

Loading