-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements of __parallel_find_or
+ __device_backend_tag
#1617
Conversation
__pattern_any_of
, __pattern_find_if
and __parallel_find_or
3c8a24f
to
5c64105
Compare
// Set local atomic value to global atomic | ||
if (__local_idx == 0 && __comp(__found_local.load(), __found.load())) | ||
if (__local_idx == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we are able to remove this extra __comp(__found_local.load(), __found.load())
call
due it's already implemented in for
-loop below.
302dd52
to
6309293
Compare
6309293
to
58e2415
Compare
58e2415
to
14c178f
Compare
I am going to remove all comments like |
@MikeDvorskiy, how do you think, does it make sense to apply this changes for range's implementations too? Actually, we should no duplicate the changes. The iterator based So, with right code design perspective you should modify |
__pattern_any_of
, __pattern_find_if
and __parallel_find_or
__parallel_find_or
__parallel_find_or
__parallel_find_or
+ __device_backend_tag
// Specialization for __parallel_or_tag | ||
template <typename _ExecutionPolicy, typename _Brick, typename... _Ranges> | ||
bool | ||
__parallel_find_or(oneapi::dpl::__internal::__device_backend_tag, _ExecutionPolicy&& __exec, _Brick __f, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, did you "split" __parallel_find_or
pattern into two with "OR search logic" and "first/last search logic".
In that cause I would propose to have different names for patterns: __parallel_find_or
and __parallel_find_first
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeDvorskiy, how about __parallel_find_or
and __parallel_find_entry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the standard and cppreference uses the "first" word.
As variant
__parallel_find_first
and
__parallel_find_any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: __parallel_find_or
has been renamed to __parallel_find_first
and __parallel_find_any
And some staff too.
// Point #A1 - not required | ||
|
||
// Point #A2 - rewritten | ||
_AtomicType __found_local = __init_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see __found_local
variable is created and accessible just for the current "thread" (work item), it is not placed in SLM or global memory and is not shared between work items. So, there is no concurrent access here and we don't need use any atomic, a simple bool type is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In global atomic type bool
is absent.
So really it's one of integer types here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we keep this strategy, I'd be in favor of using a different name for the type _AtomicType
is a very confusing typename for something which is not being used as an atomic.
I'd prefer to define two aliases for the same type for clarity here if you wanted to use _AtomicType
for the global atomic, and have it share a type with the _LocalStatusType
.
|
||
// Point #A3 - rewritten | ||
constexpr auto __comp = typename _BrickTag::_Compare{}; | ||
__pred(__item_id, __n_iter, __wgroup_size, __comp, __found_local, __brick_tag, __rngs...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a solution is found, the current work item may don't compute __pred
any more with "find_or" algorithm perspective. Did you try to skip __pred
computation and estimate performance change? I understand that in case of simple predicate such checks might lead to overheads... But in any case, can we get a performance gain here?
#if _ONEDPL_COMPILE_KERNEL | ||
auto __kernel = __internal::__kernel_compiler<_FindOrKernel>::__compile(__exec); | ||
__wgroup_size = ::std::min(__wgroup_size, oneapi::dpl::__internal::__kernel_work_group_size(__exec, __kernel)); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this workaround _ONEDPL_COMPILE_KERNEL
is unnecessary here. My understanding of this kernel compilation bundle step is that it is a workaround to prevent very large allocations from occurring on CPU targets which may return very large __max_work_group_size
if left unchecked. Since we are not allocating based upon workgroup size, I don't think this is needed in the first place.
If we want some reasonable limit on workgroup size, this can be replaced by a simple "reasonable" maximum workgroup size like 1024
as is done in histogram.
::std::size_t __max_wgroup_size = oneapi::dpl::__internal::__max_work_group_size(__exec);
::std::uint16_t __work_group_size = ::std::min(::std::size_t(1024), __max_wgroup_size);
I intend to investigate if we can move away from this workaround altogether in oneDPL, but I have not done so yet. For the time being though I'd prefer not to propagate it unless we have some specific motivation for doing so which I am unaware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeDvorskiy, what do you think about _ONEDPL_COMPILE_KERNEL
and etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was (is) that we want to know "right" maximus work group size, so the all processed values by WG will be fit into SLM memory. It leads to conclusion that it needs just for those kernels where we use SLM.
bool | ||
operator()(const _LocalAtomic& __found_local, const _GlobalAtomic& __found) const | ||
operator()(const _AtomicType __found_local, const _AtomicType __found) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that this is actually never called in the current PR?
the __parallel_or_tag
overload takes __comp
as a parameter but discards it without use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, fixed.
// Set found state result to global atomic | ||
if (__found_local != __init_value) | ||
{ | ||
__found.fetch_or(__found_local); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, here __found_local
is not atomic, just a bool/int type and each work item computes predicate and "touches" (updates) the global atomic only if the current predicate is true. I guess such approach may be effective if an input sequence has "few" solutions.
I think if an input sequence has many solutions, each time a work item will "touch" global atomic to update. An access to global atomic is very "expensive". I guess we might have a big performance penalty in that case.
…on of __pattern_any_of on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>
…on of __pattern_find_if on __parallel_transform_reduce Signed-off-by: Sergey Kopienko <[email protected]>
… as unused anymoreinclude/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h - performance optimization of __parallel_find_or + __device_backend_tag for the usage with __parallel_or_tag
… extra call of __comp(__found_local.load(), __found.load()) Signed-off-by: Sergey Kopienko <[email protected]>
… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…lf review comment: let's use __brick_tag instead of __parallel_or_tag{} Signed-off-by: Sergey Kopienko <[email protected]>
…omment Signed-off-by: Sergey Kopienko <[email protected]>
… in comments: __typle_type -> __tuple_type Signed-off-by: Sergey Kopienko <[email protected]>
…view comment: remove local variable _IterSize __current_iter Signed-off-by: Sergey Kopienko <[email protected]>
… extra auto keyword Signed-off-by: Sergey Kopienko <[email protected]>
…review comment" This reverts commit b2e73df.
This reverts commit 8c6e80f.
…ementation of __pattern_find_if on __parallel_transform_reduce" This reverts commit f97df31.
…ementation of __pattern_any_of on __parallel_transform_reduce" This reverts commit d659866.
…predicates __find_if_unary_transform_op, __find_if_binary_reduce_op" This reverts commit 2d89714.
…ix error in comments: __typle_type -> __tuple_type" This reverts commit 67eedc1.
…view comment: the __parallel_or_tag overload takes __comp as a parameter but discards it without use. Signed-off-by: Sergey Kopienko <[email protected]>
…t and __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]>
…t and __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]>
… extra comments "Point #..." Signed-off-by: Sergey Kopienko <[email protected]>
…e store call instead of fetch_or Signed-off-by: Sergey Kopienko <[email protected]>
…view comment: do not use type name "_AtomicType" for local state variable Signed-off-by: Sergey Kopienko <[email protected]>
…fy __parallel_or_tag Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit f9948b3)
… some local variables inside __parallel_find_any Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit 315b091)
…e __parallel_find_any on parallel_for_work_group + parallel_for_work_item Signed-off-by: Sergey Kopienko <[email protected]> (cherry picked from commit a7c749f)
…ed only when __found_in_any_item_inside_group is false Signed-off-by: Sergey Kopienko <[email protected]>
34fa73a
to
d2581b2
Compare
In this PR we made some performance improvements:
__parallel_find_or
+__device_backend_tag
for the usage with the__parallel_or_tag
.The changes for
__pattern_any_of(__hetero_tag
has been implemented on__parallel_transform_reduce
;__pattern_find_if(__hetero_tag
has been implemented on__parallel_transform_reduce
;has been extracted to the separate PR Performance improvements of
__pattern_any_of
,__pattern_find_if
#1622and rolled back in this PR.