-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune the amount of groups in __parallel_find_or
pattern
#1723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
- It is stated that the performance is better for larger input sizes. Does this have any affect on smaller input sizes?
- For which devices do we see a performance benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a simpler tuning approach you might want to try is to fix __iters_per_work_item
to powers of two based on the input sizes. This might generate more optimized code and would remove the need for the complex approach of calculating __n_groups
.
b1544cd
to
50e4789
Compare
@julianmi, @danhoeflinger, @adamfidel implementation has been updated.
We still have good perf profit for a lot of sizes. |
6130b17
to
44bcf79
Compare
@danhoeflinger, @julianmi, @adamfidel Could you please take a look again? |
…iters_per_work_item > 1 Signed-off-by: Sergey Kopienko <[email protected]>
aedac45
to
e8f59e8
Compare
Signed-off-by: Sergey Kopienko <[email protected]>
Co-authored-by: Alexey Kukanov <[email protected]>
…dpl::__internal::__device_backend_tag>::operator() Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Co-authored-by: Alexey Kukanov <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…ve loop in __parallel_find_or_nd_range_tuner<oneapi::dpl::__internal::__device_backend_tag>::operator()
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
5d266b3
to
4922f46
Compare
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
Co-authored-by: Alexey Kukanov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In this PR we tune the amount of groups in
__parallel_find_or
pattern.This approach give us some performance boost on bigger data sizes.