Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithms execute incorrectly when used with cross-device memory #854

Open
BenBrock opened this issue Mar 22, 2023 · 2 comments
Open

Algorithms execute incorrectly when used with cross-device memory #854

BenBrock opened this issue Mar 22, 2023 · 2 comments

Comments

@BenBrock
Copy link
Contributor

BenBrock commented Mar 22, 2023

If I execute a oneDPL algorithm on one device, while one of the outputs is located on another device, the algorithm does not execute successfully. I've reproduced this on 2- and 4-GPU machines with Xe Link on ORTCE with inclusive_scan and for_each.

I have written a minimal example with inclusive_scan, summarized below.

  • v points to a USM device memory allocation on GPU 0.

  • ptrs[1] points to a USM device memory allocation on GPU 1.

  • Execute inclusive_scan on the input [v, v + n), writing results to [ptrs[1], ptrs[1] + n)

  • The output still contains zeros after the algorithm is executed, instead of the correct result.

// Allocate USM device buffers on each device, to be used as outputs
std::vector<int*> ptrs;
for (auto&& device : devices) {
  int* ptr = sycl::malloc_device<int>(n, device, context);
  ptrs.push_back(ptr);
}

// Allocate USM device buffer on device 0, to be used as input
int* v = sycl::malloc_device<int>(n, devices[0], context);

// Copy data to input buffer
sycl::queue q(context, devices[0]);
q.memcpy(v, lv.data(), lv.size()*sizeof(int)).wait();

for (auto&& ptr : ptrs) {
  sycl::queue q(context, devices[0]);
  oneapi::dpl::execution::device_policy policy(q);

  // Perform `inclusive_scan` on device 0
  // Input buffer is on device 0, output buffer on every device, round robin
  oneapi::dpl::inclusive_scan(policy, v, v + n, ptr);
}
. . .

// Memcpy buffers pointed to by `ptrs` and print them out.

Output:

(base) bbrock@sdp125072:~/src/distributed-ranges/examples/shp$ ./write_test-dosxx 
Reference: [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
Result for GPU 0: [0, 1, 3, 6, 10, 15, 21, 28, 36, 45]
Result for GPU 1: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Result for GPU 2: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Result for GPU 3: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I encountered this using the most recent commit of oneDPL compiled using both icpx 2023.0.0.20221201 and the most recent commit of intel/llvm. All buffers are in USM device memory, and these multi-GPU machines have Xe Link with full peer-to-peer support, so as far as I know this should work. (Replacing the oneDPL for_each or inclusive_scan with a q.parallel_for writing to the same regions results in visible changes.)

I encountered this in the context of our distributed ranges inclusive_scan implementation, where the distributed ranges given for the input and output may not line up perfectly, meaning that the input and output can be on different devices.

Oddly, algorithms seem to execute correctly if I pass in iterators (e.g. GCC's __normal_iterator) instead of raw pointers. This is the precisely the opposite of what I would expect, as there is currently a Level Zero runtime bug preventing iterators from working across GPUs inside SYCL kernels. Is oneDPL somehow handling memory differently?

@BenBrock BenBrock changed the title Wrong executing algorithms with cross-device memory Algorithms execute incorrectly when used with cross-device memory Mar 22, 2023
@MikeDvorskiy
Copy link
Contributor

MikeDvorskiy commented Mar 24, 2023

Hello @BenBrock,

  1. "sycl::malloc_device" allocates memory on a device, and it is not shared memory (between the host, in particular)
  2. I think oneDPL doesn't support cross-device working, at least because API doesn't provide to pass information more than one device (queue). Indeed, oneDPL algorithm, can accept just one policy, which contains just one queue(device) in its turn.
    So, in a kernel we do direct access to data via pointer, if it is USM pointer (device pointer or shared pointer).
  3. In case of "normal" iterator type (not host pointer) - oneDPL creates temporary sycl buffer, copies data to them, and works with data via sycl accessors. And sycl buffer has not some option to point what device where buffer should be allocated. I believe it is a device associated with passed queue. So, in that case oneDPL algorithms also don't support cross-device working. A fact that you notices that it works - it is accidentally, UB other words.

@BenBrock
Copy link
Contributor Author

BenBrock commented Mar 27, 2023

#861 partially addresses this issue, allowing oneDPL algorithms to be invoked as below.

auto first = oneapi::dpl::make_direct_iterator(v);
auto last = oneapi::dpl::make_direct_Iterator(v + n);
auto d_first = oneapi::dpl::make_direct_iterator(ptr);
oneapi::dpl::inclusive_scan(policy, v, v + n, ptr);

(This does not yet completely fix the issue on Intel multi-GPU systems due to an issue with the level zero runtime.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants