Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements of __parallel_find_or + __device_backend_tag #1617

Closed
wants to merge 28 commits into from

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Jun 7, 2024

In this PR we made some performance improvements:

  • some performance improvements for __parallel_find_or + __device_backend_tag for the usage with the __parallel_or_tag.

The changes for

@SergeyKopienko SergeyKopienko added this to the 2022.7.0 milestone Jun 7, 2024
@SergeyKopienko SergeyKopienko changed the title Performance improvements Performance improvements of __pattern_any_of, __pattern_find_if and __parallel_find_or Jun 7, 2024
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 3c8a24f to 5c64105 Compare June 7, 2024 09:24
// Set local atomic value to global atomic
if (__local_idx == 0 && __comp(__found_local.load(), __found.load()))
if (__local_idx == 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are able to remove this extra __comp(__found_local.load(), __found.load()) call
due it's already implemented in for-loop below.

@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 302dd52 to 6309293 Compare June 10, 2024 08:56
@SergeyKopienko SergeyKopienko marked this pull request as draft June 10, 2024 09:12
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 6309293 to 58e2415 Compare June 10, 2024 10:27
@SergeyKopienko SergeyKopienko marked this pull request as ready for review June 10, 2024 10:27
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 58e2415 to 14c178f Compare June 10, 2024 10:42
@SergeyKopienko
Copy link
Contributor Author

SergeyKopienko commented Jun 10, 2024

I am going to remove all comments like // Point # after receive some reviews.
It may help to compare differences between some specializations for the __parallel_or_tag and for the __parallel_find_forward_tag, __parallel_find_backward_tag

@SergeyKopienko
Copy link
Contributor Author

SergeyKopienko commented Jun 10, 2024

@MikeDvorskiy, how do you think, does it make sense to apply this changes for range's implementations too?

Actually, we should no duplicate the changes. The iterator based __pattern_any_of and range based __pattern_any_of recall a range based corresponding hetero backend pattern: oneapi::dpl::__par_backend_hetero::__parallel_find_or.

So, with right code design perspective you should modify oneapi::dpl::__par_backend_hetero::__parallel_find_or or another hetero backend new find pattern (reduction based f.e).

@SergeyKopienko SergeyKopienko changed the title Performance improvements of __pattern_any_of, __pattern_find_if and __parallel_find_or Performance improvements of __parallel_find_or Jun 11, 2024
@SergeyKopienko SergeyKopienko changed the title Performance improvements of __parallel_find_or Performance improvements of __parallel_find_or + __device_backend_tag Jun 11, 2024
// Specialization for __parallel_or_tag
template <typename _ExecutionPolicy, typename _Brick, typename... _Ranges>
bool
__parallel_find_or(oneapi::dpl::__internal::__device_backend_tag, _ExecutionPolicy&& __exec, _Brick __f,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, did you "split" __parallel_find_or pattern into two with "OR search logic" and "first/last search logic".
In that cause I would propose to have different names for patterns: __parallel_find_or and __parallel_find_first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeDvorskiy, how about __parallel_find_or and __parallel_find_entry ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the standard and cppreference uses the "first" word.
As variant
__parallel_find_first
and
__parallel_find_any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: __parallel_find_or has been renamed to __parallel_find_first and __parallel_find_any
And some staff too.

// Point #A1 - not required

// Point #A2 - rewritten
_AtomicType __found_local = __init_value;
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I see __found_local variable is created and accessible just for the current "thread" (work item), it is not placed in SLM or global memory and is not shared between work items. So, there is no concurrent access here and we don't need use any atomic, a simple bool type is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In global atomic type bool is absent.
So really it's one of integer types here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we keep this strategy, I'd be in favor of using a different name for the type _AtomicType is a very confusing typename for something which is not being used as an atomic.
I'd prefer to define two aliases for the same type for clarity here if you wanted to use _AtomicType for the global atomic, and have it share a type with the _LocalStatusType.


// Point #A3 - rewritten
constexpr auto __comp = typename _BrickTag::_Compare{};
__pred(__item_id, __n_iter, __wgroup_size, __comp, __found_local, __brick_tag, __rngs...);
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a solution is found, the current work item may don't compute __pred any more with "find_or" algorithm perspective. Did you try to skip __pred computation and estimate performance change? I understand that in case of simple predicate such checks might lead to overheads... But in any case, can we get a performance gain here?

Comment on lines 1146 to 1069
#if _ONEDPL_COMPILE_KERNEL
auto __kernel = __internal::__kernel_compiler<_FindOrKernel>::__compile(__exec);
__wgroup_size = ::std::min(__wgroup_size, oneapi::dpl::__internal::__kernel_work_group_size(__exec, __kernel));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this workaround _ONEDPL_COMPILE_KERNEL is unnecessary here. My understanding of this kernel compilation bundle step is that it is a workaround to prevent very large allocations from occurring on CPU targets which may return very large __max_work_group_size if left unchecked. Since we are not allocating based upon workgroup size, I don't think this is needed in the first place.

If we want some reasonable limit on workgroup size, this can be replaced by a simple "reasonable" maximum workgroup size like 1024 as is done in histogram.

    ::std::size_t __max_wgroup_size = oneapi::dpl::__internal::__max_work_group_size(__exec);
    ::std::uint16_t __work_group_size = ::std::min(::std::size_t(1024), __max_wgroup_size);

I intend to investigate if we can move away from this workaround altogether in oneDPL, but I have not done so yet. For the time being though I'd prefer not to propagate it unless we have some specific motivation for doing so which I am unaware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeDvorskiy, what do you think about _ONEDPL_COMPILE_KERNEL and etc.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was (is) that we want to know "right" maximus work group size, so the all processed values by WG will be fit into SLM memory. It leads to conclusion that it needs just for those kernels where we use SLM.

bool
operator()(const _LocalAtomic& __found_local, const _GlobalAtomic& __found) const
operator()(const _AtomicType __found_local, const _AtomicType __found) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that this is actually never called in the current PR?

the __parallel_or_tag overload takes __comp as a parameter but discards it without use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, fixed.

// Set found state result to global atomic
if (__found_local != __init_value)
{
__found.fetch_or(__found_local);
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, here __found_local is not atomic, just a bool/int type and each work item computes predicate and "touches" (updates) the global atomic only if the current predicate is true. I guess such approach may be effective if an input sequence has "few" solutions.
I think if an input sequence has many solutions, each time a work item will "touch" global atomic to update. An access to global atomic is very "expensive". I guess we might have a big performance penalty in that case.

…on of __pattern_any_of on __parallel_transform_reduce

Signed-off-by: Sergey Kopienko <[email protected]>
…on of __pattern_find_if on __parallel_transform_reduce

Signed-off-by: Sergey Kopienko <[email protected]>
… as unused anymoreinclude/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - performance optimization of __parallel_find_or + __device_backend_tag for the usage with __parallel_or_tag
… extra call of __comp(__found_local.load(), __found.load())

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…lf review comment: let's use __brick_tag instead of __parallel_or_tag{}

Signed-off-by: Sergey Kopienko <[email protected]>
… in comments: __typle_type -> __tuple_type

Signed-off-by: Sergey Kopienko <[email protected]>
…view comment: remove local variable _IterSize __current_iter

Signed-off-by: Sergey Kopienko <[email protected]>
…ementation of __pattern_find_if on __parallel_transform_reduce"

This reverts commit f97df31.
…ementation of __pattern_any_of on __parallel_transform_reduce"

This reverts commit d659866.
…predicates __find_if_unary_transform_op, __find_if_binary_reduce_op"

This reverts commit 2d89714.
…ix error in comments: __typle_type -> __tuple_type"

This reverts commit 67eedc1.
…view comment: the __parallel_or_tag overload takes __comp as a parameter but discards it without use.

Signed-off-by: Sergey Kopienko <[email protected]>
…t and __parallel_find_any

Signed-off-by: Sergey Kopienko <[email protected]>
…t and __parallel_find_any

Signed-off-by: Sergey Kopienko <[email protected]>
… extra comments "Point #..."

Signed-off-by: Sergey Kopienko <[email protected]>
…e store call instead of fetch_or

Signed-off-by: Sergey Kopienko <[email protected]>
…view comment: do not use type name "_AtomicType" for local state variable

Signed-off-by: Sergey Kopienko <[email protected]>
…fy __parallel_or_tag

Signed-off-by: Sergey Kopienko <[email protected]>
(cherry picked from commit f9948b3)
… some local variables inside __parallel_find_any

Signed-off-by: Sergey Kopienko <[email protected]>
(cherry picked from commit 315b091)
…e __parallel_find_any on parallel_for_work_group + parallel_for_work_item

Signed-off-by: Sergey Kopienko <[email protected]>
(cherry picked from commit a7c749f)
…ed only when __found_in_any_item_inside_group is false

Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 34fa73a to d2581b2 Compare June 19, 2024 13:13
@SergeyKopienko SergeyKopienko deleted the dev/skopienko/parallel_find_or_to_DEV branch July 9, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants