Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SYCL backend
__parallel_for
performance for large input sizes #1870base: main
Are you sure you want to change the base?
Improve SYCL backend
__parallel_for
performance for large input sizes #1870Changes from 15 commits
66989b9
5ddb258
d5d9661
6e72d26
bedb2b3
94800af
e4cbceb
87e6571
1a9309f
9a65ec0
429a81b
3995861
d034fd3
4caa4f4
7ddf27d
543ab58
cabb234
417e9fb
3b44c6d
eb932c0
075c030
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be better to specify
__stride_recommender
as template param type with ability to change from caller side if required?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, leaving it as is currently is the best although maybe I should rename the function since stride is a loaded term.
We want to enforce a good access pattern. Work-group strides to enable coalescing is a good general choice for devices, and sub-group strides are used in the oneAPI GPU optimization guide and show slightly better performance for SPIR-V compiled targets. I do not see a need to change this as we should always use the best performing stride. I could see future improvements modifying
__stride_recommender
itself, but I do not see a need to accept a templated functor at this point. Do you see a potential use case for this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in the other comment regarding
uint8_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar change has been committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that
__large_submitter
able to calculate to best size in all cases and it doesn't depends on Kernel code and it's logic at all. So may be we should have ability to customize this condition check to?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an interesting point. I'm trying to think about how different workloads / algorithms would impact this number...
I think the decision is largely dependent on overhead vs memory bandwidth optimization. Its possible that more computation would make this less important because we are less reliant on memory bandwidth. However, depending on user provided call-ables in the library makes this very difficult to make good decisions, unless there are some APIs which we know are always very high computationally that use parallel_for internally (I don't know of any).
Another aspect to consider, the larger the size of the minimum type size in the input ranges, the fewer iterations would be run by the large submitter. At the limit, I imagine there is no advantage of the large submitter when we only would run a single iteration. This is knowable at compile time, perhaps we should detect this case and always choose the small submitter when a single iteration would be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that the performance difference at this "estimated point" is very small between the two versions so computing an optimal point to switch for each algorithm does not bring us any significant benefit and would likely be overtuned for a specific case / architecture, so I am in favor of leaving it as it currently is. The improvements are really observed once we scale to large sizes.
With regards to @danhoeflinger's comment, I agree that a user providing a heavily compute intensive operator might minimize the observed benefit if we become compute bound. I do not think we have encountered such a case yet, but I do not think there is a high risk of performance loss in this scenario although there may be small differences around this estimated point.
Good point on the case where we load one element per item. I will look into adding this as a compile time decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Thanks for the reply.