Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace SYCL backend reduce_by_segment implementation with reduce-then-scan call #1915

Merged
merged 34 commits into from
Dec 19, 2024

Conversation

mmichel11
Copy link
Contributor

Summary

This PR implements a SYCL backend reduce_by_segment by using higher level calls to reduce-then-scan along with new specialty functors to achieve a segmented reduction. This PR is an initial step of porting the implementation to reduce-then-scan with optimization likely to follow. Future efforts may include additional modification to reduce-then-scan kernels.

Performance improves for all input sizes. For small inputs, we see 3-5x improvements and for very large sizes ~1.25x on GPU Series Max 1550. Please contact me if you would like to see performance data.

Description of changes

  • The SYCL reduce_by_segment implementation that was previously handwritten is replaced by a higher level call to our reduce-then-scan kernels. Several new callback functors for the reduce-then-scan kernel have been made to achieve this operation.
  • reduce_by_segment.pass was encountering linker crashes due to the large number of test cases being compiled growing past the maximum size of the binary's data region. SYCL testing has been trimmed down with regards to USM device and shared testing which resolves this issue. Instead of running each test with a device and shared USM allocation, every other test switches the USM type.
  • ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION has been removed as the SYCL implementation has been replaced, and we are no longer impacted by this issue.
  • The legacy reduce_by_segment implementation is used as a fallback for when the sub-group size, device, and trivial copyability constraints cannot be satisfied.

Future work

Future efforts on reduce_by_segment may built on top of this implementation and the reduce-then-scan kernels to better handle first and last element cases.

@mmichel11 mmichel11 force-pushed the dev/mmichel11/rts_reduce_by_segment branch from 66ead80 to 53adeb8 Compare October 22, 2024 19:33
@mmichel11 mmichel11 marked this pull request as ready for review October 23, 2024 18:42
@mmichel11
Copy link
Contributor Author

I have made some design changes based on offline discussion. There is quite a bit of code movement that has happened, so here is a summary of the recently made changes:

  • An iterator-based __pattern_reduce_by_segment has been added to algorithm_impl_hetero.h. Previously, we just had a range-based version. This resolves the issue of calling range-based patterns from iterator-based algorithms.
  • The fallback reduce_by_segment implementation based on high-level copy_if and parallel_for calls has been moved down a level from algorithm_ranges_impl_hetero.h to dpcpp/parallel_backend_sycl.h so that we can fallback on this implementation when reduce-then-scan cannot be used in __parallel_reduce_by_segment. This pattern has been implemented synchronously as parallel pattern calls cannot currently depend on each other.
  • Due to observed performance issues for compilers prior to icpx 2025.0, the reduce-then-scan path must be disabled in this case. The known-identity based implementation has been added back to avoid introducing a performance regression for icpx 2024.2.1 and prior. It has been moved into dpcpp/parallel_backend_sycl_reduce_by_segment.h so the implementations are not split across several directories.

@mmichel11 mmichel11 force-pushed the dev/mmichel11/rts_reduce_by_segment branch from 5643e6c to fdf6a39 Compare November 6, 2024 15:53
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this implementation and thing it is basically good to go.
I think more comments are necessary, and there is some potential for future gains.

I probably want to look a bit further in to minor details before approving with another pass but at a high level I think this is in good shape.

namespace __par_backend_hetero
{

template <typename... Name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed in an outside editor that the only changes to this from the previous location are only cosmetic. I'm not looking deeply into this files changes otherwise since it has already been reviewed and it was just moved.

Comment on lines 849 to 860
{
const _KeyType& __next_key = __in_keys[__id + 1];
return oneapi::dpl::__internal::make_tuple(
oneapi::dpl::__internal::make_tuple(std::size_t{0}, _ValueType{__in_vals[__id]}),
!__binary_pred(__current_key, __next_key), __next_key, __current_key);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible to avoid the __id == 0 case, in a similar way to unique. It is a little more complicated because we would need to set up the carry-in appropriately, but I think its possible and could provide some branch avoiding (and tuple shrinking) gains in the helpers.
If you think its possible to do this, lets leave it as an issue to be explored in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and have opened #1958.

Copy link
Contributor

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments. I have looked at primarily the fallback algorithms and intend to focus on the reduce-then-scan implementation next.

sycl::nd_item<1> __item) {
auto __group = __item.get_group();
std::size_t __group_id = __item.get_group(0);
std::size_t __local_id = __item.get_local_id(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably safely be a uint32_t. Or, you can remove this variable and replace its only usage with if (__group.leader().

Copy link
Contributor Author

@mmichel11 mmichel11 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this type to be std::uint32_t

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level I agree with the changes in this PR, but there are still a few remaining nit picks outstanding.

I have run out of time before my time off to get into the small details like sizes of types and forwarding of references, things like that. The clang format suggestions can be ignored as of now.

So, I wont hit approve officially but I think this is very close and trust @adamfidel / others to be able to get it across the finish line and have no objections to merging with another approval.

@mmichel11 mmichel11 requested a review from adamfidel December 4, 2024 14:36
Copy link
Contributor

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

I think overall the approach is good and I don't mind approving once these few small comments have been addressed.

adamfidel
adamfidel previously approved these changes Dec 6, 2024
Copy link
Contributor

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. The clang-format changes can be ignored in my opinion.

@mmichel11
Copy link
Contributor Author

I have fixed a compilation issue identified in CI. With some configurations, specifying the full namespace oneapi::dpl::__ranges when calling __get_sycl_range is needed to avoid compilation errors.

I am rerunning CI to make sure there are no more issues.

mmichel11 and others added 24 commits December 12, 2024 08:18
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
* An iterator based __pattern_reduce_by_segment is added
* Due to compiler issues prior to icpx 2025.0, the reduce-then-scan path
  is disabled and the previous handcrafted SYCL implementation is restored to
  prevent performance regressions with older compilers
* The previous range-based fallback implementation has been moved to the
  SYCL backend along with the handcrafted SYCL version

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@mmichel11 mmichel11 force-pushed the dev/mmichel11/rts_reduce_by_segment branch from ef5e927 to b290747 Compare December 12, 2024 16:20
@mmichel11 mmichel11 added this to the 2022.8.0 milestone Dec 17, 2024
Copy link
Contributor

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the recent changes to fix issues discovered in CI.

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after minor changes. Thanks and sorry for the delay

@mmichel11 mmichel11 merged commit 7150d5c into main Dec 19, 2024
21 of 22 checks passed
@mmichel11 mmichel11 deleted the dev/mmichel11/rts_reduce_by_segment branch December 19, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants