-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithms execute incorrectly when used with cross-device memory #854
Comments
BenBrock
changed the title
Wrong executing algorithms with cross-device memory
Algorithms execute incorrectly when used with cross-device memory
Mar 22, 2023
Hello @BenBrock,
|
This was referenced Mar 24, 2023
#861 partially addresses this issue, allowing oneDPL algorithms to be invoked as below. auto first = oneapi::dpl::make_direct_iterator(v);
auto last = oneapi::dpl::make_direct_Iterator(v + n);
auto d_first = oneapi::dpl::make_direct_iterator(ptr);
oneapi::dpl::inclusive_scan(policy, v, v + n, ptr); (This does not yet completely fix the issue on Intel multi-GPU systems due to an issue with the level zero runtime.) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If I execute a oneDPL algorithm on one device, while one of the outputs is located on another device, the algorithm does not execute successfully. I've reproduced this on 2- and 4-GPU machines with Xe Link on ORTCE with
inclusive_scan
andfor_each
.I have written a minimal example with
inclusive_scan
, summarized below.v
points to a USM device memory allocation on GPU 0.ptrs[1]
points to a USM device memory allocation on GPU1
.Execute
inclusive_scan
on the input[v, v + n)
, writing results to[ptrs[1], ptrs[1] + n)
The output still contains zeros after the algorithm is executed, instead of the correct result.
Output:
I encountered this using the most recent commit of oneDPL compiled using both
icpx 2023.0.0.20221201
and the most recent commit of intel/llvm. All buffers are in USM device memory, and these multi-GPU machines have Xe Link with full peer-to-peer support, so as far as I know this should work. (Replacing the oneDPLfor_each
orinclusive_scan
with aq.parallel_for
writing to the same regions results in visible changes.)I encountered this in the context of our distributed ranges
inclusive_scan
implementation, where the distributed ranges given for the input and output may not line up perfectly, meaning that the input and output can be on different devices.Oddly, algorithms seem to execute correctly if I pass in iterators (e.g. GCC's
__normal_iterator
) instead of raw pointers. This is the precisely the opposite of what I would expect, as there is currently a Level Zero runtime bug preventing iterators from working across GPUs inside SYCL kernels. Is oneDPL somehow handling memory differently?The text was updated successfully, but these errors were encountered: