Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable vectorized global loads for the reduction algorithms #1470

Merged
merged 30 commits into from
May 22, 2024
Merged
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
1f1ee5d
Dynamic number of items per work-item
julianmi Feb 6, 2024
eeb4120
Enable vectorization
julianmi Feb 27, 2024
41fd20e
Code cleanup
julianmi Feb 27, 2024
3630da8
Fix WG adjustment
julianmi Feb 27, 2024
c612f61
Restructure vectorized reduction
julianmi Feb 28, 2024
e029fd7
Add single item process path
julianmi Feb 29, 2024
f5e8271
Template vector width
julianmi Feb 29, 2024
530e13d
Reduce branch divergence
julianmi Mar 19, 2024
d36acbc
Enable 32-bit addressing
julianmi Mar 22, 2024
42c9aab
Fix merge issues
julianmi Mar 22, 2024
556c184
Centralize tuning parameters
julianmi Mar 25, 2024
b8d067a
Cleanup diff
julianmi Mar 27, 2024
5c9a43e
Fix CPU backend issue
julianmi Apr 3, 2024
14c44df
Address review feedback
julianmi Apr 12, 2024
6fc2581
Fix merge issue
julianmi Apr 15, 2024
1db434d
Address review comments
julianmi Apr 15, 2024
ebd0da3
Remove another inline statement
julianmi Apr 15, 2024
f7ee7f1
Remove unintentional formatting changes
julianmi Apr 17, 2024
2ef7d5b
Remove move statement and ::std
julianmi Apr 18, 2024
17d5bce
Address review feedback
julianmi Apr 22, 2024
291b1c6
Address review feedback
julianmi Apr 24, 2024
fdaf78d
Update is_device_copyable trait
julianmi Apr 25, 2024
27a01f4
Update transform_reduce signature also in test
julianmi Apr 25, 2024
70939d6
Add missing out-of-bounds check
julianmi Apr 26, 2024
0245bbd
Improve bounds check based on review comments
julianmi Apr 26, 2024
48bf347
Further bounds check improvements
julianmi Apr 26, 2024
90c308b
Add check for shorter addressing support
julianmi May 21, 2024
3d93a31
Use static assert instead
julianmi May 21, 2024
3ebc344
Address review comments
julianmi May 21, 2024
27fd437
Rename union storeage based on review discussion
julianmi May 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 13 additions & 17 deletions include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,19 @@ class __reduce_mid_work_group_kernel;
template <typename... _Name>
class __reduce_kernel;

// Storage helper since _Tp may not have a default constructor.
template <typename _Tp>
union __storage
Copy link
Contributor

@danhoeflinger danhoeflinger May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to lift this type definition out, we probably need to rename it as well. (trying to think of a good name...)

Copy link
Contributor

@SergeyKopienko SergeyKopienko May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally if we are going to lift this type definition out, we may cover the case when we have array of elements too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the array of elements, perhaps that reaches too far beyond the scope of this PR, but maybe something like __delayed_ctor_storage?

I think we need something which describes its purpose.

Copy link
Contributor

@SergeyKopienko SergeyKopienko May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__optional_ctor_storage ?
__lazy_ctor_storage ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how far we want to go in the context of this PR, but this trick is also used
https://github.com/oneapi-src/oneDPL/blob/a9aabb2b94020634f8f9961471f8af0a2aeb60ea/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h#L169 and https://github.com/oneapi-src/oneDPL/blob/a9aabb2b94020634f8f9961471f8af0a2aeb60ea/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h#L597

If we are lifting this, it would be great to unify all the use to a single type. Then future improvements can be had by all, and it will improve readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the first is the array case Sergey was referring to, I'd be fine with leaving that one out for now to limit the scope of the PR if it makes it significantly more complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make additional changes with it in some separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for this PR, lets just rename it, we can unify, etc. in a separate PR.
My vote is for __lazy_ctor_storage because I think optional advertises more functionality than is provided here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this discussion. I agree that larger changes are outside the scope of this PR and change the naming to __lazy_ctor_storage.

{
_Tp __v;
__storage() {}
};

// Adjust number of sequential operations per work-item based on the vector size. Single elements are kept to
// improve performance of small arrays or remainder loops.
template <std::uint8_t _VecSize, typename _Size>
auto
__adjust_iters_per_work_item(_Size __iters_per_work_item) -> _Size
_Size
__adjust_iters_per_work_item(_Size __iters_per_work_item)
{
if (__iters_per_work_item > 1)
return oneapi::dpl::__internal::__dpl_ceiling_div(__iters_per_work_item, _VecSize) * _VecSize;
Expand All @@ -68,11 +76,7 @@ __work_group_reduce_kernel(const _NDItemId __item_id, const _Size __n, const _Si
{
auto __local_idx = __item_id.get_local_id(0);
const _Size __group_size = __item_id.get_local_range().size();
union __storage
{
_Tp __v;
__storage() {}
} __result;
__storage<_Tp> __result;
// 1. Initialization (transform part). Fill local memory
__transform_pattern(__item_id, __n, __iters_per_work_item, /*global_offset*/ (_Size)0, __is_full,
/*__n_groups*/ (_Size)1, __result, __acc...);
Expand Down Expand Up @@ -100,11 +104,7 @@ __device_reduce_kernel(const _NDItemId __item_id, const _Size __n, const _Size _
auto __local_idx = __item_id.get_local_id(0);
auto __group_idx = __item_id.get_group(0);
const _Size __group_size = __item_id.get_local_range().size();
union __storage
{
_Tp __v;
__storage() {}
} __result;
__storage<_Tp> __result;
// 1. Initialization (transform part). Fill local memory
__transform_pattern(__item_id, __n, __iters_per_work_item, /*global_offset*/ (_Size)0, __is_full, __n_groups,
__result, __acc...);
Expand Down Expand Up @@ -394,11 +394,7 @@ struct __parallel_transform_reduce_impl
// 1. Initialization (transform part). Fill local memory
_Size __n_items;
const bool __is_full = __n == __size_per_work_group * __n_groups;
union __storage
{
_Tp __v;
__storage() {}
} __result;
__storage<_Tp> __result;
if (__is_first)
{
__transform_pattern1(__item_id, __n, __iters_per_work_item, /*global_offset*/ (_Size)0,
Expand Down
Loading