Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement direct_iterator and make_direct_iterator #861

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions include/oneapi/dpl/pstl/iterator_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -887,6 +887,146 @@ class discard_iterator
difference_type __my_position_;
};

template <typename _Iter>
class direct_iterator
{
public:
typedef typename ::std::iterator_traits<_Iter>::value_type value_type;
typedef typename ::std::iterator_traits<_Iter>::difference_type difference_type;
typedef typename ::std::iterator_traits<_Iter>::reference reference;

using iterator_category = std::random_access_iterator_tag;

using iterator = direct_iterator<_Iter>;
using pointer = iterator;

using is_passed_directly = ::std::true_type;

direct_iterator(_Iter iter) noexcept : __iter(iter) {}
direct_iterator() noexcept = default;
~direct_iterator() noexcept = default;
direct_iterator(const direct_iterator&) noexcept = default;
direct_iterator&
operator=(const direct_iterator&) noexcept = default;

bool
operator==(const direct_iterator& other) const noexcept
{
return __iter == other.__iter;
}
bool
operator!=(const direct_iterator& other) const noexcept
{
return __iter != other.__iter;
}

iterator
operator+(difference_type offset) const noexcept
{
return iterator(__iter + offset);
}

iterator
operator-(difference_type offset) const noexcept
{
return iterator(__iter - offset);
}

difference_type
operator-(iterator other) const noexcept
{
return __iter - other.__iter;
}

bool
operator<(iterator other) const noexcept
{
return __iter < other.__iter;
}

bool
operator>(iterator other) const noexcept
{
return __iter > __iter;
}

bool
operator<=(iterator other) const noexcept
{
return __iter <= other.__iter;
}

bool
operator>=(iterator other) const noexcept
{
return __iter >= other.__iter;
}

iterator&
operator++() noexcept
{
++__iter;
return *this;
}

iterator
operator++(int) noexcept
{
iterator other = *this;
++(*this);
return other;
}

iterator&
operator--() noexcept
{
--__iter;
return *this;
}

iterator
operator--(int) noexcept
{
iterator other = *this;
--(*this);
return other;
}

iterator&
operator+=(difference_type offset) noexcept
{
__iter += offset;
return *this;
}

iterator&
operator-=(difference_type offset) noexcept
{
__iter -= offset;
return *this;
}

reference operator*() const noexcept { return *__iter; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To tell the truth, I really don't understand an essence of that wrapper over _Iter.
That wrapper repeats the all standard RA iterator functionality, including dereferencing. If _Iter is not accessible on a device, direct_iterator also is not accessible on a device... So, what's an essence here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeDvorskiy Sorry for being late getting back to you; this slipped past my inbox.

The idea here is that you have a range/iterator accessible on the device. Let's say a std::span<int> to device memory. Then, you create a view based on that range. For example:

template <typename T>
auto sum_times_two(std::span<T> x) {
  auto z = x
         | std::ranges::views::transform(
           [](auto &&elem) {
             return elem*2;
           });

  oneapi::dpl::execution::device_policy policy(q);

  return oneapi::dpl::experimental::reduce_async(
             policy, z.begin(), z.end(), T(0), std::plus())
      .get();
}

This code works, but has terrible performance. The reason is that oneDPL does not know transform_view<...>::iterator is device accessible, so it copies all the elements one-by-one from the device to the host, then uses a buffer to copy it back to the device. We use direct_iterator to force oneDPL to use the iterator directly, since we know that it can be used directly on the device.

template <typename T>
auto sum_times_two(std::span<T> x) {
  auto z = x
         | std::ranges::views::transform(
           [](auto &&elem) {
             return elem*2;
           });

  oneapi::dpl::execution::device_policy policy(q);

  shp::__detail::direct_iterator d_first(z.begin());
  shp::__detail::direct_iterator d_last(z.end());

  return oneapi::dpl::experimental::reduce_async(
             policy, d_first, d_last, T(0), std::plus())
      .get();
}

This example is a bit simplified. In the use case in distributed ranges, we have an actual device_ptr as the underlying iterator type, so we do know that the data lives on the device. It might be worth thinking about how we could integrate distributed range's concepts of device vs. host memory with distributed ranges, but I think there will always be some cases where a user wants to explicitly "promote" a range to being directly accessible on the device. Using a standard library view is a prime example of this, as we're unlikely to be able to hardwire locality information into a view without modifying the standard. (Or providing our own implementation of all views.)


reference operator[](difference_type offset) const noexcept { return reference(*(*this + offset)); }

friend iterator
operator+(difference_type n, iterator iter)
{
return iter.__iter + n;
}

private:
_Iter __iter;
};

template <typename Iterator>
direct_iterator<Iterator>
make_direct_iterator(Iterator iterator)
{
return direct_iterator<Iterator>(iterator);
}

} // namespace dpl
} // namespace oneapi

Expand Down
71 changes: 71 additions & 0 deletions test/general/direct_iterator.pass.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
// -*- C++ -*-
//===-- direct_iterator.pass.cpp ------------------------------------------===//
//
// Copyright (C) Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// This file incorporates work covered by the following copyright and permission
// notice:
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
//
//===----------------------------------------------------------------------===//

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/iterator>
#include <oneapi/dpl/functional>

#include "support/utils.h"

#include <iostream>
#include <vector>
#include <numeric>

#if __cpp_lib_span >= 202002L
#include <span>
#endif

int
main()
{
#if TEST_DPCPP_BACKEND_PRESENT
using T = int;

const int n = 1000;

std::vector<T> v(n);
std::iota(v.begin(), v.end(), 0);

sycl::queue q(sycl::default_selector_v);

T* p = sycl::malloc_device<T>(n, q);

q.memcpy(p, v.data(), n * sizeof(T)).wait();

auto v_ref = std::reduce(v.begin(), v.end(), 0);

dpl::make_direct_iterator d_first(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p is a pointer here.. A pointer is passed directly by oneDPL design. A pointer doesn't require a wrapper...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove this part of the test if you prefer. My idea was that this could serve as a temporary workaround for #854.

(Although as I mention in the issue, there is unfortunately a bug in the level zero that keeps this workaround from working on Intel multi-GPU systems.)

dpl::make_direct_iterator d_last(p + n);

auto v_dev = dpl::reduce(d_first, d_last, 0);

EXPECT_EQ(v_ref, v_dev);

#if __cpp_lib_span >= 202002L

std::span<T> x(p, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems span x is not used...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just fixed that typo to actually initialize s_first and s_last using x.begin() and x.end().


dpl::make_direct_iterator s_first(x.begin());
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can avoid this "identical" iterator-wrapper, by introducing just specialization for the trait oneapi::dpl::__internal::is_passed_directly<_Iter> ?

like oneapi::dpl::__internal::is_passed_directly<std::span::iterator>, with returns std::true_type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't always want to pass in std::span iterators directly, since they might not be accessible on the device. Suppose a user wrote the following:

std::vector<int> v(...);
std::span s(v);

// Runtime error, since `s.begin()` is a host iterator and cannot
// be used directly on the device.
dpl::reduce(policy, s.begin(), s.end());

You can't in general know whether a span is accessible on the device, and this holds for most ranges you might encounter. There are a lot of iterator types that users might want to pass into oneDPL directly, and I don't think we can automatically most of them. I will add a better motivating example below.

dpl::make_direct_iterator s_last(x.end());

auto s_dev = dpl::reduce(s_first, s_last, 0);

EXPECT_EQ(v_ref, s_dev);
#endif
#endif

return TestUtils::done(TEST_DPCPP_BACKEND_PRESENT);
}