Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Place Esimd Radix Sort and Tests #1439

Merged
merged 20 commits into from
Apr 23, 2024

Conversation

danhoeflinger
Copy link
Contributor

@danhoeflinger danhoeflinger commented Mar 11, 2024

  • Adding out of place support to ESIMD radix sort [by key] (high level refactoring).
  • Adding tests to test out of place sort [by key].
  • Removing some unnecessary extra kernel names (Odd vs Even should be the same kernel, just with different runtime params)

APIs are disambiguated between iterator and range interfaces using enable_if to check if a type is an iterator. Documentation updates will be made in a separate PR.

@danhoeflinger danhoeflinger marked this pull request as ready for review March 12, 2024 13:01
radix_sort_out_of_place(sycl::queue __q, _KeysRng1&& __keys_rng, _KeysRng2&& __keys_rng_out, _KernelParam __param = {})
{
__impl::__check_esimd_sort_params<__radix_bits, _KernelParam::data_per_workitem, _KernelParam::workgroup_size>();
if (__keys_rng.size() <= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you wrote here <= ?
Usually we use the code like

    if (__n == 0)
        return;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.


__event_chain = __radix_sort_onesweep_scan_submitter<__stage_count, __bin_count, _EsimdRadixSortScan>()(
__q, __mem_holder.__global_hist_ptr(), __n, __event_chain);

for (::std::uint32_t __stage = 0; __stage < __stage_count; __stage++)
auto __submit_iteration = [&](auto __input_, auto __output_, auto __p_global_hist_, auto __p_group_hists_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the constructions like this may decrease performance, I think.
Sometimes they are required, of course. But in this place it's sooks like for me as the simple code style.
Of course, it's only my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to reduce repetitive code to make more apparent what is actually different between these invocations of __radix_sort_onesweep_submitter, but I don't think it is required technically.

Also, my changes unroll the first iteration of the submission before the loop, as its inputs / outputs are special cased so there are three redundant calls, rather than just two.

I can go back to something more similar to the previous way of doing it without the lambda. I don't have a strong preference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the lambda __submit_iteration.

template <::std::uint8_t __radix_bits, ::std::uint16_t __data_per_work_item, ::std::uint16_t __work_group_size,
typename _RngPack, typename _MemHolder>
void
__allocate_temp(_MemHolder& __mem_holder, ::std::size_t __n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to return _MemHolder instance from this method?
In this case we are able to not declare some their methods as const...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally abstracted these two functions out to be separate functions to be able to reuse from multiple places, but then later unified out_of_place and in_place versions together anyway pulling that decision up the stack, so they don't even need to be separate functions. I'll return them both to be within __onesweep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restored this to be within the __onesweep function.

auto expected_first = oneapi::dpl::make_zip_iterator(std::begin(expected_keys), std::begin(expected_values));
std::stable_sort(expected_first, expected_first + size, CompareKey<isAscending>{});

std::string parameters_msg = ", n: " + std::to_string(size) + ", sizeof(key): " + std::to_string(sizeof(KeyT)) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use std::strstream you may to build this message without a lot of std::to_string calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

using _EsimRadixSortKernel =
oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_provider<__esimd_radix_sort_one_wg<_KernelName>>;

return __radix_sort_one_wg_submitter<__is_ascending, __radix_bits, __data_per_work_item, __work_group_size, _KeyT,
_EsimRadixSortKernel>()(__q, ::std::forward<_RngPack>(__pack), __n);
_EsimRadixSortKernel>()(__q, __pack_in, __pack_out, __n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you not use std::forward for pass __pack_in and __pack_out ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me besides _out_of_place function name part, which is mentioned in the description.

Comment on lines 263 to 276
if (__stage % 2 != 0)
{
__event_chain = __radix_sort_onesweep_submitter<__is_ascending, __radix_bits, __data_per_work_item,
__work_group_size, _EsimdRadixSortSweep>()(
__q, __virt_pack1, __virt_pack2, __p_global_hist, __p_group_hists, __sweep_work_group_count, __n,
__stage, __event_chain);
}
else
{
__event_chain = __radix_sort_onesweep_submitter<__is_ascending, __radix_bits, __data_per_work_item,
__work_group_size, _EsimdRadixSortSweep>()(
__q, __virt_pack2, __virt_pack1, __p_global_hist, __p_group_hists, __sweep_work_group_count, __n,
__stage, __event_chain);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I miss something obvious, these two kernel submissions will not generate identical kernels, and so should really have distinct names (like with the Even/Odd suffixes before the patch).

Copy link
Contributor Author

@danhoeflinger danhoeflinger Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now in hindsight, I think you are correct, because the types of __input_pack, __virt_pack1 and __virt_pack2 may not match.

However, if they do match, do you agree that these would generate an identical kernel?

My proposal would be to merely include the types of the input & output in the kernel name, that way they aren't required to generate two identical kernels with different names when the types match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing some testing with no unnamed lambda, and seeing the issue. I'll provide an update when it is fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the issue as proposed by including the decayed RngPack typenames in the kernel name.
I've also fixed kernel name clashing in the tests for this and the in-place version of esimd radix sort, and added a test utility to generate a new kernel_params struct with a new kernelname.

Thanks for catching this.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

danhoeflinger commented Apr 19, 2024

The PR looks good to me besides _out_of_place function name part, which is mentioned in the description.

@dmitriy-sobolev
I've updated the APIs to use enable_if to disambiguate the range and iterator versions. If we decide to remove the range interfaces, it will be an easy change from the current state of the PR. Please take a look at the __is_iterator_type_v check.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach to get rid of out_of_place suffix looks good to me. I've left some minor comments, though.

include/oneapi/dpl/pstl/utils.h Outdated Show resolved Hide resolved
Comment on lines 219 to 221
using _EsimdRadixSortSweepInitial = oneapi::dpl::__par_backend_hetero::__internal::__kernel_name_provider<
::std::conditional_t<__has_values, __esimd_radix_sort_onesweep_by_key<std::decay_t<_RngPack1>, std::decay_t<_RngPack2>, _KernelName>,
__esimd_radix_sort_onesweep<std::decay_t<_RngPack1>, std::decay_t<_RngPack2>, _KernelName>>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strictly opposed of that approach, but there is a drawback: kernel names will become longer (I guess, much longer).

The motivation of the change is to separate kernel names for the potential arrangements of _RngPack1, _RngPack2, _RngPack3, isn't it? If so, what do you think about using some constant instead, e.g.:

  • __esimd_radix_sort_onesweep<0, ...> instead of __esimd_radix_sort_onesweep<std::decay_t<_RngPack1>, std::decay_t<_RngPack2>, ...>,
  • __esimd_radix_sort_onesweep<1, ...> instead of __esimd_radix_sort_onesweep<std::decay_t<_RngPack3>, std::decay_t<_RngPack2>, ...>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to have a unique kernel name if and only if we have a unique kernel. We want to avoid same mangled name error, but also we don't want to add extra kernel compilation only to support a differently named kernel when the types match exactly.

Lets say _RngPack1 is the exact same as _RngPack3, we then want __esimd_radix_sort_onesweep<std::decay_t<_RngPack1>, std::decay_t<_RngPack2>, ...>
and __esimd_radix_sort_onesweep<std::decay_t<_RngPack3>, std::decay_t<_RngPack2>, ...> to be the exact same as well. If not, I think we require an extra kernel compilation only to satisfy having a different name.

If all three exactly match, we only want one kernel and one kernel name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, thanks for describing.

Probably, it is still possible to use shorter kernel names. I will think about it separately as it is rather a "nice to have" thing.

Currently, I see the following options:

Rng1 type Rng2 type Rng3 type Kernels Constants
A B C A->B, B->C, C->B 0,1, 2
A A A A->A, A->A, A->A 3, 3, 3
A A B A->A, A->B, B->A 3, 0, 4
A B A A->B, B->A, A->B 0, 4, 0
B A A B->A, A->A, A->A 4, 3, 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in the idea, but to create such constants, you would need to basically check is_same<decay_t<_RngPack1>, decay_t<_RngPack2>> for all three combos and assign constants appropriately.
The code is ugly either way. Perhaps the shorter kernel name is worth it.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the code once again. I'm ready to approve once minor issues noted above have been solved.

Signed-off-by: Dan Hoeflinger <[email protected]>
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danhoeflinger danhoeflinger merged commit 6e7b290 into main Apr 23, 2024
20 checks passed
@danhoeflinger danhoeflinger deleted the dev/dhoeflin/out_of_place_esimd_radixsort branch April 23, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants