Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617

SergeyKopienko · 2024-06-07T09:12:25Z

In this PR we made some performance improvements:

some performance improvements for __parallel_find_or + __device_backend_tag for the usage with the __parallel_or_tag.

The changes for

__pattern_any_of(__hetero_tag has been implemented on __parallel_transform_reduce;
__pattern_find_if(__hetero_tag has been implemented on __parallel_transform_reduce;
has been extracted to the separate PR Performance improvements of __pattern_any_of, __pattern_find_if #1622
and rolled back in this PR.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

SergeyKopienko · 2024-06-07T14:37:19Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

                    // Set local atomic value to global atomic
-                    if (__local_idx == 0 && __comp(__found_local.load(), __found.load()))
+                    if (__local_idx == 0)


I believe we are able to remove this extra __comp(__found_local.load(), __found.load()) call
due it's already implemented in for-loop below.

SergeyKopienko · 2024-06-10T12:27:07Z

I am going to remove all comments like // Point # after receive some reviews.
It may help to compare differences between some specializations for the __parallel_or_tag and for the __parallel_find_forward_tag, __parallel_find_backward_tag

SergeyKopienko · 2024-06-10T13:31:42Z

@MikeDvorskiy, how do you think, does it make sense to apply this changes for range's implementations too?

Actually, we should no duplicate the changes. The iterator based __pattern_any_of and range based __pattern_any_of recall a range based corresponding hetero backend pattern: oneapi::dpl::__par_backend_hetero::__parallel_find_or.

So, with right code design perspective you should modify oneapi::dpl::__par_backend_hetero::__parallel_find_or or another hetero backend new find pattern (reduction based f.e).

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

MikeDvorskiy · 2024-06-12T12:27:00Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+// Specialization for __parallel_or_tag
+template <typename _ExecutionPolicy, typename _Brick, typename... _Ranges>
+bool
+__parallel_find_or(oneapi::dpl::__internal::__device_backend_tag, _ExecutionPolicy&& __exec, _Brick __f,


Essentially, did you "split" __parallel_find_or pattern into two with "OR search logic" and "first/last search logic".
In that cause I would propose to have different names for patterns: __parallel_find_or and __parallel_find_first.

@MikeDvorskiy, how about __parallel_find_or and __parallel_find_entry ?

the standard and cppreference uses the "first" word.
As variant
__parallel_find_first
and
__parallel_find_any

Fixed: __parallel_find_or has been renamed to __parallel_find_first and __parallel_find_any
And some staff too.

MikeDvorskiy · 2024-06-12T12:32:54Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                    // Point #A1 - not required
+
+                    // Point #A2 - rewritten
+                    _AtomicType __found_local = __init_value;


As far as I see __found_local variable is created and accessible just for the current "thread" (work item), it is not placed in SLM or global memory and is not shared between work items. So, there is no concurrent access here and we don't need use any atomic, a simple bool type is enough.

In global atomic type bool is absent.
So really it's one of integer types here.

Assuming we keep this strategy, I'd be in favor of using a different name for the type _AtomicType is a very confusing typename for something which is not being used as an atomic.
I'd prefer to define two aliases for the same type for clarity here if you wanted to use _AtomicType for the global atomic, and have it share a type with the _LocalStatusType.

MikeDvorskiy · 2024-06-12T12:46:57Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+
+                    // Point #A3 - rewritten
+                    constexpr auto __comp = typename _BrickTag::_Compare{};
+                    __pred(__item_id, __n_iter, __wgroup_size, __comp, __found_local, __brick_tag, __rngs...);


if a solution is found, the current work item may don't compute __pred any more with "find_or" algorithm perspective. Did you try to skip __pred computation and estimate performance change? I understand that in case of simple predicate such checks might lead to overheads... But in any case, can we get a performance gain here?

danhoeflinger · 2024-06-12T19:46:39Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+#if _ONEDPL_COMPILE_KERNEL
+    auto __kernel = __internal::__kernel_compiler<_FindOrKernel>::__compile(__exec);
+    __wgroup_size = ::std::min(__wgroup_size, oneapi::dpl::__internal::__kernel_work_group_size(__exec, __kernel));
+#endif


I think that this workaround _ONEDPL_COMPILE_KERNEL is unnecessary here. My understanding of this kernel compilation bundle step is that it is a workaround to prevent very large allocations from occurring on CPU targets which may return very large __max_work_group_size if left unchecked. Since we are not allocating based upon workgroup size, I don't think this is needed in the first place.

If we want some reasonable limit on workgroup size, this can be replaced by a simple "reasonable" maximum workgroup size like 1024 as is done in histogram.

::std::size_t __max_wgroup_size = oneapi::dpl::__internal::__max_work_group_size(__exec); ::std::uint16_t __work_group_size = ::std::min(::std::size_t(1024), __max_wgroup_size);

I intend to investigate if we can move away from this workaround altogether in oneDPL, but I have not done so yet. For the time being though I'd prefer not to propagate it unless we have some specific motivation for doing so which I am unaware of.

@MikeDvorskiy, what do you think about _ONEDPL_COMPILE_KERNEL and etc.?

My understanding was (is) that we want to know "right" maximus work group size, so the all processed values by WG will be fit into SLM memory. It leads to conclusion that it needs just for those kernels where we use SLM.

danhoeflinger · 2024-06-12T20:12:14Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

        bool
-        operator()(const _LocalAtomic& __found_local, const _GlobalAtomic& __found) const
+        operator()(const _AtomicType __found_local, const _AtomicType __found) const


Am I correct that this is actually never called in the current PR?

the __parallel_or_tag overload takes __comp as a parameter but discards it without use.

You are correct, fixed.

MikeDvorskiy · 2024-06-13T08:18:04Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                    // Set found state result to global atomic
+                    if (__found_local != __init_value)
+                    {
+                        __found.fetch_or(__found_local);


As far as I can see, here __found_local is not atomic, just a bool/int type and each work item computes predicate and "touches" (updates) the global atomic only if the current predicate is true. I guess such approach may be effective if an input sequence has "few" solutions.
I think if an input sequence has many solutions, each time a work item will "touch" global atomic to update. An access to global atomic is very "expensive". I guess we might have a big performance penalty in that case.

…on of __pattern_any_of on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>

…on of __pattern_find_if on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>

… as unused anymoreinclude/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - performance optimization of __parallel_find_or + __device_backend_tag for the usage with __parallel_or_tag

… extra call of __comp(__found_local.load(), __found.load()) Signed-off-by: Sergey Kopienko <[email protected]>

… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>

Signed-off-by: Sergey Kopienko <[email protected]>

…lf review comment: let's use __brick_tag instead of __parallel_or_tag{} Signed-off-by: Sergey Kopienko <[email protected]>

…omment Signed-off-by: Sergey Kopienko <[email protected]>

… in comments: __typle_type -> __tuple_type Signed-off-by: Sergey Kopienko <[email protected]>

…view comment: remove local variable _IterSize __current_iter Signed-off-by: Sergey Kopienko <[email protected]>

… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>

…review comment" This reverts commit b2e73df.

This reverts commit 8c6e80f.

…ementation of __pattern_find_if on __parallel_transform_reduce" This reverts commit f97df31.

…ementation of __pattern_any_of on __parallel_transform_reduce" This reverts commit d659866.

…predicates __find_if_unary_transform_op, __find_if_binary_reduce_op" This reverts commit 2d89714.

…ix error in comments: __typle_type -> __tuple_type" This reverts commit 67eedc1.

…view comment: the __parallel_or_tag overload takes __comp as a parameter but discards it without use. Signed-off-by: Sergey Kopienko <[email protected]>

…t and __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]>

… extra comments "Point #..." Signed-off-by: Sergey Kopienko <[email protected]>

…e store call instead of fetch_or Signed-off-by: Sergey Kopienko <[email protected]>

…view comment: do not use type name "_AtomicType" for local state variable Signed-off-by: Sergey Kopienko <[email protected]>

…fy __parallel_or_tag Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit f9948b3)

… some local variables inside __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit 315b091)

…e __parallel_find_any on parallel_for_work_group + parallel_for_work_item Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit a7c749f)

…ed only when __found_in_any_item_inside_group is false Signed-off-by: Sergey Kopienko <[email protected]>

SergeyKopienko requested review from dmitriy-sobolev and MikeDvorskiy June 7, 2024 09:12

SergeyKopienko added this to the 2022.7.0 milestone Jun 7, 2024

SergeyKopienko changed the title ~~Performance improvements~~ Performance improvements of __pattern_any_of, __pattern_find_if and __parallel_find_or Jun 7, 2024

SergeyKopienko requested a review from danhoeflinger June 7, 2024 09:14

SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 3c8a24f to 5c64105 Compare June 7, 2024 09:24

SergeyKopienko commented Jun 7, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko commented Jun 7, 2024

View reviewed changes

SergeyKopienko requested review from adamfidel and akukanov June 7, 2024 15:19

SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 302dd52 to 6309293 Compare June 10, 2024 08:56

SergeyKopienko marked this pull request as draft June 10, 2024 09:12

SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 6309293 to 58e2415 Compare June 10, 2024 10:27

SergeyKopienko marked this pull request as ready for review June 10, 2024 10:27

SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 58e2415 to 14c178f Compare June 10, 2024 10:42

MikeDvorskiy reviewed Jun 10, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h Outdated Show resolved Hide resolved

SergeyKopienko requested a review from MikeDvorskiy June 10, 2024 15:43

adamfidel reviewed Jun 10, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko commented Jun 10, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Outdated Show resolved Hide resolved

SergeyKopienko changed the title ~~Performance improvements of __pattern_any_of, __pattern_find_if and __parallel_find_or~~ Performance improvements of __parallel_find_or Jun 11, 2024

SergeyKopienko changed the title ~~Performance improvements of __parallel_find_or~~ Performance improvements of __parallel_find_or + __device_backend_tag Jun 11, 2024

MikeDvorskiy reviewed Jun 12, 2024

View reviewed changes

MikeDvorskiy mentioned this pull request Jun 12, 2024

Performance improvements of __pattern_any_of, __pattern_find_if #1622

Closed

danhoeflinger reviewed Jun 12, 2024

View reviewed changes

MikeDvorskiy reviewed Jun 13, 2024

View reviewed changes

SergeyKopienko added 27 commits June 19, 2024 15:12

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - implementati…

b95e518

…on of __pattern_any_of on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - implementati…

f2a0344

…on of __pattern_find_if on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - remove…

d1bd118

… as unused anymoreinclude/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - performance optimization of __parallel_find_or + __device_backend_tag for the usage with __parallel_or_tag

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - remove…

e958b10

… extra call of __comp(__found_local.load(), __found.load()) Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - remove…

77642ab

… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>

Apply GitHUB clang fromat

86c9ac0

Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - fix se…

e96f6ca

…lf review comment: let's use __brick_tag instead of __parallel_or_tag{} Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - fix review c…

a707cb1

…omment Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h - fix error…

b078d7f

… in comments: __typle_type -> __tuple_type Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - fix re…

d04610b

…view comment: remove local variable _IterSize __current_iter Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - remove…

d9e7b4d

… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>

Revert "include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - fix …

91f3de3

…review comment" This reverts commit b2e73df.

Revert "Apply GitHUB clang fromat"

d0dbb93

This reverts commit 8c6e80f.

Revert "include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - impl…

dec2a04

…ementation of __pattern_find_if on __parallel_transform_reduce" This reverts commit f97df31.

Revert "include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - impl…

c50af52

…ementation of __pattern_any_of on __parallel_transform_reduce" This reverts commit d659866.

Revert "include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h - new …

d851105

…predicates __find_if_unary_transform_op, __find_if_binary_reduce_op" This reverts commit 2d89714.

Revert "include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h - f…

c074778

…ix error in comments: __typle_type -> __tuple_type" This reverts commit 67eedc1.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - fix re…

b92fce3

…view comment: the __parallel_or_tag overload takes __comp as a parameter but discards it without use. Signed-off-by: Sergey Kopienko <[email protected]>

Fix review comment: rename __parallel_find_or to __parallel_find_firs…

7ea5ba6

…t and __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]>

Fix review comment: rename __parallel_find_or to __parallel_find_firs…

b9c26a8

…t and __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - remove…

225d5f6

… extra comments "Point #..." Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - restor…

d20a4cd

…e store call instead of fetch_or Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - fix re…

9a69cc1

…view comment: do not use type name "_AtomicType" for local state variable Signed-off-by: Sergey Kopienko <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - simpli…

30fb5a6

…fy __parallel_or_tag Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit f9948b3)

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - rename…

aa94007

… some local variables inside __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit 315b091)

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - rewrit…

63eb820

…e __parallel_find_any on parallel_for_work_group + parallel_for_work_item Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit a7c749f)

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - run pr…

d2581b2

…ed only when __found_in_any_item_inside_group is false Signed-off-by: Sergey Kopienko <[email protected]>

SergeyKopienko force-pushed the dev/skopienko/parallel_find_or_to_DEV branch from 34fa73a to d2581b2 Compare June 19, 2024 13:13

SergeyKopienko closed this Jul 9, 2024

SergeyKopienko deleted the dev/skopienko/parallel_find_or_to_DEV branch July 9, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617

Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617

SergeyKopienko commented Jun 7, 2024 •

edited

Loading

SergeyKopienko Jun 7, 2024

SergeyKopienko commented Jun 10, 2024 •

edited

Loading

SergeyKopienko commented Jun 10, 2024 •

edited by MikeDvorskiy

Loading

MikeDvorskiy Jun 12, 2024

SergeyKopienko Jun 12, 2024

MikeDvorskiy Jun 13, 2024

SergeyKopienko Jun 13, 2024

MikeDvorskiy Jun 12, 2024 •

edited

Loading

SergeyKopienko Jun 12, 2024

danhoeflinger Jun 13, 2024

MikeDvorskiy Jun 12, 2024 •

edited

Loading

danhoeflinger Jun 12, 2024

SergeyKopienko Jun 13, 2024

MikeDvorskiy Jun 14, 2024

danhoeflinger Jun 12, 2024

SergeyKopienko Jun 13, 2024

MikeDvorskiy Jun 13, 2024 •

edited

Loading

Performance improvements of __parallel_find_or + __device_backend_tag #1617

Performance improvements of __parallel_find_or + __device_backend_tag #1617

Conversation

SergeyKopienko commented Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

SergeyKopienko commented Jun 10, 2024 • edited Loading

SergeyKopienko commented Jun 10, 2024 • edited by MikeDvorskiy Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeDvorskiy Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeDvorskiy Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeDvorskiy Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617

Performance improvements of `__parallel_find_or` + `__device_backend_tag` #1617

SergeyKopienko commented Jun 7, 2024 •

edited

Loading

SergeyKopienko commented Jun 10, 2024 •

edited

Loading

SergeyKopienko commented Jun 10, 2024 •

edited by MikeDvorskiy

Loading

MikeDvorskiy Jun 12, 2024 •

edited

Loading

MikeDvorskiy Jun 12, 2024 •

edited

Loading

MikeDvorskiy Jun 13, 2024 •

edited

Loading