Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable vectorized global loads for the reduction algorithms #1470

Merged
merged 30 commits into from
May 22, 2024

Conversation

julianmi
Copy link
Contributor

Vectorization is performance critical on SIMD architectures. This patch enables vectorization by unrolling vector size wide loop iterations on both coalesced (commutative algorithms) and consecutive (non-commutative algorithms) loads. Coalesced loads will then load vectors of consecutive elements. This change improves the coalesced loads on Intel SIMD GPUs without decreasing the throughput on SIMT GPUs. Coalesced loads are therefore enabled on SPIR-V backends as well. min_element and max_element continue using consecutive loads on SPIR-V backends due to the performance penalty of the required index check when using coalesced global loads.

Secondly, the vectorization enables dynamic number of elements to be processed per work-item. Launch parameter tuning with compile time constants is therefore not needed anymore. This reduces the number of template instantiations from 13 to 3, which improves the compile times significantly (e.g., half the time for sycl_iterator_reduce.pass).

Thirdly, branch divergence is minimized by adding a flag showing whether the work-group can process full sequences of the input array. If so, branching withing the inner kernel can be removed. If not, all work-items in a group follow the same boundary-checked implementation.

@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch 2 times, most recently from 2adb25a to 76097ed Compare April 4, 2024 13:38
@julianmi julianmi marked this pull request as ready for review April 4, 2024 15:09
@julianmi julianmi added this to the 2022.6.0 milestone Apr 5, 2024

// Empirically found tuning parameters for typical devices.
constexpr _Size __max_iters_per_work_item = 32;
constexpr ::std::size_t __max_work_group_size = 256;
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, is better to make a request to a device, like oneapi::dpl::__internal::__max_work_group_size(...) and oneapi::dpl::__internal::__max_sub_group_size(...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are empirically found to achieve the highest throughput. The device-specific work-group limits are checked a couple of lines down.

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments I have are mostly stylistic to better understand the flow of the code.

@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 76097ed to cd2dba5 Compare April 15, 2024 14:28
@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 111fc65 to e39f642 Compare April 16, 2024 14:23
@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 8e37910 to e6983fc Compare April 18, 2024 13:45
scalar_reduction_remainder(const _Size __start_idx, const _Size __adjusted_n, const _Size __max_iters, _Res& __res,
const _Acc&... __acc) const
{
const _Size __no_iters = std::min(static_cast<_Size>(__adjusted_n - __start_idx), __max_iters);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<_Size> here and everywhere looks very suspicious and bulky...
Probably, we can pass a "right" integer type as _Size? Or/and use auto when it is applicable and doesn't break correctness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The _Size type is provided by SYCL and I don't think we should change it. Instead, I've used auto and added some temporaries to overcome this.

const _Size __global_idx = __item_id.get_global_id(0);
if (__iters_per_work_item == 1)
{
new (&__res.__v) _Tp(__unary_op(__global_idx, __acc...));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to __local_idx is not used within this scope, the definition of __local_idx (line 247) may be moved down after if operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few small comments and am ready to approve once these are considered.

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved
__adjust_iters_per_work_item(_Size __iters_per_work_item) -> _Size
{
if (__iters_per_work_item > 1)
return ((__iters_per_work_item + _VecSize - 1) / _VecSize) * _VecSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be written with __dpl_ceiling_div like:

return __dpl_ceiling_div(__iters_per_work_item, _VecSize) * _VecSize;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies for this late review. Mostly minor things.
I'm still wrapping my head around things fully so I probably shouldn't be the approver. However hopefully some of these comments can help. Continuing to look as time permits as well.

@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch from afd0d90 to 4d1cbf2 Compare April 24, 2024 10:46
@julianmi julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 933790f to 48bf347 Compare May 13, 2024 10:55
@julianmi julianmi requested a review from danhoeflinger May 13, 2024 10:55
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor comment. Otherwise, I think this looks good, especially now since we have some time to react to any issues which may arise before a release.

I went through the PR with fresh eyes and couldn't find any real issues.

danhoeflinger
danhoeflinger previously approved these changes May 21, 2024
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably good to at least check for objections from others who have reviewed this PR before merging.
(and wait for green CI)

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented May 21, 2024

@julianmi, how do you think, should we introduce some type for the union

    union __storage
    {
        _Tp __v;
        __storage() {}
    };

?

@julianmi
Copy link
Contributor Author

@julianmi, how do you think, should we introduce some type for the union

    union __storage
    {
        _Tp __v;
        __storage() {}
    };

?

I've added a union type to reduce the code duplication.

SergeyKopienko
SergeyKopienko previously approved these changes May 21, 2024
Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mmichel11
mmichel11 previously approved these changes May 21, 2024
Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -47,11 +47,19 @@ class __reduce_mid_work_group_kernel;
template <typename... _Name>
class __reduce_kernel;

// Storage helper since _Tp may not have a default constructor.
template <typename _Tp>
union __storage
Copy link
Contributor

@danhoeflinger danhoeflinger May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to lift this type definition out, we probably need to rename it as well. (trying to think of a good name...)

Copy link
Contributor

@SergeyKopienko SergeyKopienko May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally if we are going to lift this type definition out, we may cover the case when we have array of elements too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the array of elements, perhaps that reaches too far beyond the scope of this PR, but maybe something like __delayed_ctor_storage?

I think we need something which describes its purpose.

Copy link
Contributor

@SergeyKopienko SergeyKopienko May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__optional_ctor_storage ?
__lazy_ctor_storage ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how far we want to go in the context of this PR, but this trick is also used
https://github.com/oneapi-src/oneDPL/blob/a9aabb2b94020634f8f9961471f8af0a2aeb60ea/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h#L169 and https://github.com/oneapi-src/oneDPL/blob/a9aabb2b94020634f8f9961471f8af0a2aeb60ea/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h#L597

If we are lifting this, it would be great to unify all the use to a single type. Then future improvements can be had by all, and it will improve readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the first is the array case Sergey was referring to, I'd be fine with leaving that one out for now to limit the scope of the PR if it makes it significantly more complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make additional changes with it in some separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for this PR, lets just rename it, we can unify, etc. in a separate PR.
My vote is for __lazy_ctor_storage because I think optional advertises more functionality than is provided here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this discussion. I agree that larger changes are outside the scope of this PR and change the naming to __lazy_ctor_storage.

@julianmi julianmi dismissed stale reviews from mmichel11 and SergeyKopienko via 27fd437 May 21, 2024 16:05
Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@julianmi julianmi merged commit b694665 into main May 22, 2024
20 checks passed
@julianmi julianmi deleted the dev/julianmi/reduce_vectorization branch May 22, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants