-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement direct_iterator
and make_direct_iterator
#861
base: main
Are you sure you want to change the base?
Conversation
direct use of device iterators in oneDPL algorithms.
|
||
#if __cpp_lib_span >= 202002L | ||
|
||
std::span<T> x(p, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems span x
is not used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, just fixed that typo to actually initialize s_first
and s_last
using x.begin()
and x.end()
.
|
||
auto v_ref = std::reduce(v.begin(), v.end(), 0); | ||
|
||
dpl::make_direct_iterator d_first(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p
is a pointer here.. A pointer is passed directly by oneDPL design. A pointer doesn't require a wrapper...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to remove this part of the test if you prefer. My idea was that this could serve as a temporary workaround for #854.
(Although as I mention in the issue, there is unfortunately a bug in the level zero that keeps this workaround from working on Intel multi-GPU systems.)
|
||
std::span<T> x(p, n); | ||
|
||
dpl::make_direct_iterator s_first(x.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can avoid this "identical" iterator-wrapper, by introducing just specialization for the trait oneapi::dpl::__internal::is_passed_directly<_Iter>
?
like oneapi::dpl::__internal::is_passed_directly<std::span::iterator>
, with returns std::true_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't always want to pass in std::span
iterators directly, since they might not be accessible on the device. Suppose a user wrote the following:
std::vector<int> v(...);
std::span s(v);
// Runtime error, since `s.begin()` is a host iterator and cannot
// be used directly on the device.
dpl::reduce(policy, s.begin(), s.end());
You can't in general know whether a span is accessible on the device, and this holds for most ranges you might encounter. There are a lot of iterator types that users might want to pass into oneDPL directly, and I don't think we can automatically most of them. I will add a better motivating example below.
I wonder if providing a wrapper iterator for the purpose of only passing something as-is to oneDPL, is the right approach. Instead, should we maybe follow the approach we use for SYCL buffers, i.e. provide "wrapper" functions that return some object suitable to pass the original class to oneDPL algorithms, without attempting to make it a correct functioning iterator? Or in other words, should we extend the applicability of |
Here's a better example illustrating why I think this is needed. There are potentially many complicated iterator types users will want to pass directly into oneDPL algorithms, and it's not always possible to identify which ones can and can't be passed in directly. Suppose you wanted to implement a ranges-style dot product using oneDPL, like below. template <std::ranges::forward_range X, std::ranges::forward_range Y>
auto dot_product_onedpl(sycl::queue q, X &&x, Y &&y) {
auto z = std::ranges::views::zip(x, y)
| std::ranges::views::transform(
[](auto &&elem) {
auto &&[a, b] = elem;
return a * b;
});
oneapi::dpl::execution::device_policy policy(q);
shp::__detail::direct_iterator d_first(z.begin());
shp::__detail::direct_iterator d_last(z.end());
return oneapi::dpl::experimental::reduce_async(
policy, d_first, d_last, std::ranges::range_value_t<X>(0), std::plus())
.get();
} The iterator type passed into oneDPL is rather complicated. It's not in general possible to know whether a transform view or a zip view is directly accessible on the device. As a user, I happen to know that This gets more complicated when you also have user-defined data structures and views. We will definitely need to keep something like this in our own codebase for distributed ranges. I'll leave it up to you guys to decide whether something like this is more broadly applicable to users. My intention is basically to give users the option of forcing oneDPL algorithms to use device iterators directly on the device. This can be used in cases where either it's not possible to determine whether an iterator can be passed directly ( |
return *this; | ||
} | ||
|
||
reference operator*() const noexcept { return *__iter; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To tell the truth, I really don't understand an essence of that wrapper over _Iter
.
That wrapper repeats the all standard RA iterator functionality, including dereferencing. If _Iter
is not accessible on a device, direct_iterator
also is not accessible on a device... So, what's an essence here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeDvorskiy Sorry for being late getting back to you; this slipped past my inbox.
The idea here is that you have a range/iterator accessible on the device. Let's say a std::span<int>
to device memory. Then, you create a view based on that range. For example:
template <typename T>
auto sum_times_two(std::span<T> x) {
auto z = x
| std::ranges::views::transform(
[](auto &&elem) {
return elem*2;
});
oneapi::dpl::execution::device_policy policy(q);
return oneapi::dpl::experimental::reduce_async(
policy, z.begin(), z.end(), T(0), std::plus())
.get();
}
This code works, but has terrible performance. The reason is that oneDPL does not know transform_view<...>::iterator
is device accessible, so it copies all the elements one-by-one from the device to the host, then uses a buffer to copy it back to the device. We use direct_iterator
to force oneDPL to use the iterator directly, since we know that it can be used directly on the device.
template <typename T>
auto sum_times_two(std::span<T> x) {
auto z = x
| std::ranges::views::transform(
[](auto &&elem) {
return elem*2;
});
oneapi::dpl::execution::device_policy policy(q);
shp::__detail::direct_iterator d_first(z.begin());
shp::__detail::direct_iterator d_last(z.end());
return oneapi::dpl::experimental::reduce_async(
policy, d_first, d_last, T(0), std::plus())
.get();
}
This example is a bit simplified. In the use case in distributed ranges, we have an actual device_ptr
as the underlying iterator type, so we do know that the data lives on the device. It might be worth thinking about how we could integrate distributed range's concepts of device vs. host memory with distributed ranges, but I think there will always be some cases where a user wants to explicitly "promote" a range to being directly accessible on the device. Using a standard library view is a prime example of this, as we're unlikely to be able to hardwire locality information into a view without modifying the standard. (Or providing our own implementation of all views.)
What's left for driving this pull request to completion? Not being able to use custom device iterators is one of the limitations for us in https://github.com/kokkos/kokkos compared with thrust. |
I think the primary blocker is just resources on the oneDPL team. This was going to be merged as part of #1479, but that's been delayed. Maybe @akukanov, @rarutyun, or @MikeDvorskiy can comment on the possibility of accepting this PR individually to enable libraries like Kokkos? |
It would probably be already sufficient if |
Making I suggest to open a RFC discussion at https://github.com/oneapi-src/oneDPL/discussions and/or a design proposal following the process here https://github.com/oneapi-src/oneDPL/tree/main/rfcs. The goal is to have a dedicated design discussion of this idea. The only thing really needed to start is the motivating use cases, but if you have ideas/preferences for how you would customize this trait, that would be useful for the design. I hope it is not too much to ask you for :) The eventual outcome should be a patch to the oneDPL specification that describes the new functionality, and a patch to this repo that implements it. But after the design is accepted in principle, we will take care of these unless you will want to stay involved. |
Implement
direct_iterator
andmake_direct_iterator
, which allow users to wrap device iterators that should be used directly inside SYCL kernels by oneDPL. This PR addresses #855 and #854.This will likely require some work before being accepted, but I just wanted articulate my proposed fix for these issues.