Implement `direct_iterator` and `make_direct_iterator` #861

BenBrock · 2023-03-25T00:23:50Z

Implement direct_iterator and make_direct_iterator, which allow users to wrap device iterators that should be used directly inside SYCL kernels by oneDPL. This PR addresses #855 and #854.

This will likely require some work before being accepted, but I just wanted articulate my proposed fix for these issues.

direct use of device iterators in oneDPL algorithms.

MikeDvorskiy · 2023-03-27T16:22:40Z

test/general/direct_iterator.pass.cpp

+
+#if __cpp_lib_span >= 202002L
+
+    std::span<T> x(p, n);


It seems span x is not used...

Oops, just fixed that typo to actually initialize s_first and s_last using x.begin() and x.end().

MikeDvorskiy · 2023-03-27T16:24:05Z

test/general/direct_iterator.pass.cpp

+
+    auto v_ref = std::reduce(v.begin(), v.end(), 0);
+
+    dpl::make_direct_iterator d_first(p);


p is a pointer here.. A pointer is passed directly by oneDPL design. A pointer doesn't require a wrapper...

Happy to remove this part of the test if you prefer. My idea was that this could serve as a temporary workaround for #854.

(Although as I mention in the issue, there is unfortunately a bug in the level zero that keeps this workaround from working on Intel multi-GPU systems.)

MikeDvorskiy · 2023-03-28T10:06:02Z

test/general/direct_iterator.pass.cpp

+
+    std::span<T> x(p, n);
+
+    dpl::make_direct_iterator s_first(x.begin());


Probably we can avoid this "identical" iterator-wrapper, by introducing just specialization for the trait oneapi::dpl::__internal::is_passed_directly<_Iter> ?

like oneapi::dpl::__internal::is_passed_directly<std::span::iterator>, with returns std::true_type ?

You don't always want to pass in std::span iterators directly, since they might not be accessible on the device. Suppose a user wrote the following:

std::vector<int> v(...); std::span s(v); // Runtime error, since `s.begin()` is a host iterator and cannot // be used directly on the device. dpl::reduce(policy, s.begin(), s.end());

You can't in general know whether a span is accessible on the device, and this holds for most ranges you might encounter. There are a lot of iterator types that users might want to pass into oneDPL directly, and I don't think we can automatically most of them. I will add a better motivating example below.

akukanov · 2023-03-28T14:39:11Z

I wonder if providing a wrapper iterator for the purpose of only passing something as-is to oneDPL, is the right approach. Instead, should we maybe follow the approach we use for SYCL buffers, i.e. provide "wrapper" functions that return some object suitable to pass the original class to oneDPL algorithms, without attempting to make it a correct functioning iterator? Or in other words, should we extend the applicability of dpl::begin()/end() beyond SYCL buffers to other containers of interest?

BenBrock · 2023-03-28T15:26:50Z

Here's a better example illustrating why I think this is needed. There are potentially many complicated iterator types users will want to pass directly into oneDPL algorithms, and it's not always possible to identify which ones can and can't be passed in directly.

Suppose you wanted to implement a ranges-style dot product using oneDPL, like below.

template <std::ranges::forward_range X, std::ranges::forward_range Y>
auto dot_product_onedpl(sycl::queue q, X &&x, Y &&y) {
  auto z = std::ranges::views::zip(x, y)
         | std::ranges::views::transform(
           [](auto &&elem) {
             auto &&[a, b] = elem;
             return a * b;
           });

  oneapi::dpl::execution::device_policy policy(q);

  shp::__detail::direct_iterator d_first(z.begin());
  shp::__detail::direct_iterator d_last(z.end());
  return oneapi::dpl::experimental::reduce_async(
             policy, d_first, d_last, std::ranges::range_value_t<X>(0), std::plus())
      .get();
}

The iterator type passed into oneDPL is rather complicated. It's not in general possible to know whether a transform view or a zip view is directly accessible on the device. As a user, I happen to know that X and Y are device ranges, and so I know that the resulting view z can safely be used on the device.

This gets more complicated when you also have user-defined data structures and views.

We will definitely need to keep something like this in our own codebase for distributed ranges. I'll leave it up to you guys to decide whether something like this is more broadly applicable to users. My intention is basically to give users the option of forcing oneDPL algorithms to use device iterators directly on the device. This can be used in cases where either it's not possible to determine whether an iterator can be passed directly (span and a few other views) as well as when complex iterator types make using is_passed_directly a bit unwieldy (most views and some user-defined types).

MikeDvorskiy · 2024-02-22T10:50:53Z

include/oneapi/dpl/pstl/iterator_impl.h

+        return *this;
+    }
+
+    reference operator*() const noexcept { return *__iter; }


To tell the truth, I really don't understand an essence of that wrapper over _Iter.
That wrapper repeats the all standard RA iterator functionality, including dereferencing. If _Iter is not accessible on a device, direct_iterator also is not accessible on a device... So, what's an essence here?

@MikeDvorskiy Sorry for being late getting back to you; this slipped past my inbox.

The idea here is that you have a range/iterator accessible on the device. Let's say a std::span<int> to device memory. Then, you create a view based on that range. For example:

template <typename T> auto sum_times_two(std::span<T> x) { auto z = x | std::ranges::views::transform( [](auto &&elem) { return elem*2; }); oneapi::dpl::execution::device_policy policy(q); return oneapi::dpl::experimental::reduce_async( policy, z.begin(), z.end(), T(0), std::plus()) .get(); }

This code works, but has terrible performance. The reason is that oneDPL does not know transform_view<...>::iterator is device accessible, so it copies all the elements one-by-one from the device to the host, then uses a buffer to copy it back to the device. We use direct_iterator to force oneDPL to use the iterator directly, since we know that it can be used directly on the device.

template <typename T> auto sum_times_two(std::span<T> x) { auto z = x | std::ranges::views::transform( [](auto &&elem) { return elem*2; }); oneapi::dpl::execution::device_policy policy(q); shp::__detail::direct_iterator d_first(z.begin()); shp::__detail::direct_iterator d_last(z.end()); return oneapi::dpl::experimental::reduce_async( policy, d_first, d_last, T(0), std::plus()) .get(); }

This example is a bit simplified. In the use case in distributed ranges, we have an actual device_ptr as the underlying iterator type, so we do know that the data lives on the device. It might be worth thinking about how we could integrate distributed range's concepts of device vs. host memory with distributed ranges, but I think there will always be some cases where a user wants to explicitly "promote" a range to being directly accessible on the device. Using a standard library view is a prime example of this, as we're unlikely to be able to hardwire locality information into a view without modifying the standard. (Or providing our own implementation of all views.)

masterleinad · 2024-10-22T20:07:11Z

What's left for driving this pull request to completion? Not being able to use custom device iterators is one of the limitations for us in https://github.com/kokkos/kokkos compared with thrust.

BenBrock · 2024-10-28T21:45:57Z

I think the primary blocker is just resources on the oneDPL team. This was going to be merged as part of #1479, but that's been delayed.

Maybe @akukanov, @rarutyun, or @MikeDvorskiy can comment on the possibility of accepting this PR individually to enable libraries like Kokkos?

masterleinad · 2024-10-29T13:43:21Z

Maybe @akukanov, @rarutyun, or @MikeDvorskiy can comment on the possibility of accepting this PR individually to enable libraries like Kokkos?

It would probably be already sufficient if is_passed_directly is officially supported although a customization point in the oneapi::dpl namespace would be preferable.

akukanov · 2024-11-13T20:15:15Z

Making is_passed_directly a public customization point in the oneDPL namespace sounds good to me; better than the proposed iterator wrapper.

I suggest to open a RFC discussion at https://github.com/oneapi-src/oneDPL/discussions and/or a design proposal following the process here https://github.com/oneapi-src/oneDPL/tree/main/rfcs. The goal is to have a dedicated design discussion of this idea. The only thing really needed to start is the motivating use cases, but if you have ideas/preferences for how you would customize this trait, that would be useful for the design. I hope it is not too much to ask you for :)

The eventual outcome should be a patch to the oneDPL specification that describes the new functionality, and a patch to this repo that implements it. But after the design is accepted in principle, we will take care of these unless you will want to stay involved.

BenBrock added 4 commits March 24, 2023 17:17

Implement direct_iterator and make_direct_iterator to support

b6ce547

direct use of device iterators in oneDPL algorithms.

Fix typos in device_iterator

2f955d4

Typos fix

9728854

Last typo

e48ae04

This was referenced Mar 25, 2023

Cannot use device iterators in oneDPL algorithms #855

Open

Algorithms execute incorrectly when used with cross-device memory #854

Open

MikeDvorskiy reviewed Mar 27, 2023

View reviewed changes

BenBrock added 2 commits March 27, 2023 09:29

Fix typo in test to actually use span

9a7a99e

Merge branch 'main' into implement-direct-iterator

da75e3a

MikeDvorskiy reviewed Mar 28, 2023

View reviewed changes

MikeDvorskiy reviewed Feb 22, 2024

View reviewed changes

BenBrock marked this pull request as ready for review October 28, 2024 21:46

masterleinad mentioned this pull request Nov 1, 2024

oneDPL: Sort on device using Kokkos::RandomAccessIterator kokkos/kokkos#7502

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `direct_iterator` and `make_direct_iterator` #861

Implement `direct_iterator` and `make_direct_iterator` #861

BenBrock commented Mar 25, 2023

MikeDvorskiy Mar 27, 2023

BenBrock Mar 27, 2023

MikeDvorskiy Mar 27, 2023

BenBrock Mar 27, 2023

MikeDvorskiy Mar 28, 2023 •

edited

Loading

BenBrock Mar 28, 2023

akukanov commented Mar 28, 2023

BenBrock commented Mar 28, 2023

MikeDvorskiy Feb 22, 2024

BenBrock Mar 4, 2024

masterleinad commented Oct 22, 2024

BenBrock commented Oct 28, 2024

masterleinad commented Oct 29, 2024

akukanov commented Nov 13, 2024 •

edited

Loading


		auto v_ref = std::reduce(v.begin(), v.end(), 0);

		dpl::make_direct_iterator d_first(p);


		std::span<T> x(p, n);

		dpl::make_direct_iterator s_first(x.begin());

Implement direct_iterator and make_direct_iterator #861

Are you sure you want to change the base?

Implement direct_iterator and make_direct_iterator #861

Conversation

BenBrock commented Mar 25, 2023

MikeDvorskiy Mar 27, 2023

Choose a reason for hiding this comment

BenBrock Mar 27, 2023

Choose a reason for hiding this comment

MikeDvorskiy Mar 27, 2023

Choose a reason for hiding this comment

BenBrock Mar 27, 2023

Choose a reason for hiding this comment

MikeDvorskiy Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

BenBrock Mar 28, 2023

Choose a reason for hiding this comment

akukanov commented Mar 28, 2023

BenBrock commented Mar 28, 2023

MikeDvorskiy Feb 22, 2024

Choose a reason for hiding this comment

BenBrock Mar 4, 2024

Choose a reason for hiding this comment

masterleinad commented Oct 22, 2024

BenBrock commented Oct 28, 2024

masterleinad commented Oct 29, 2024

akukanov commented Nov 13, 2024 • edited Loading

Implement `direct_iterator` and `make_direct_iterator` #861

Implement `direct_iterator` and `make_direct_iterator` #861

MikeDvorskiy Mar 28, 2023 •

edited

Loading

akukanov commented Nov 13, 2024 •

edited

Loading