-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[oneDPL] A fix - added missed synchronization between two SYCL patterns. #1261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that the long term fix is to use explicit dependencies between these pattern calls, and that sort of refactor is necessary.
However, I think that we may be able to improve the stopgap for the __pattern_stable_partition
case as described in my comment.
__pattern_walk2( | ||
__par_backend_hetero::make_wrapped_policy<copy_back_wrapper>(__exec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it seems that these two __pattern_walk2
calls are independent of each other (don't require a specific ordering), but both do need to be waited on before returning from __pattern_stable_partition
.
Instead of enforcing ordering here, could we make them both async calls and then wait on the policy's queue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here...
Yes, we have to call wait for two events here. But we still don't have such oneDPL API.
Usage of SYCL API directly is not allowed on this oneDPL code layer. (as alliterative of queue::wait
, there is possibility to call event::wait(list of events)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. I agree that similar to the others, this requires a rewrite / refactor to improve by returning some event from __pattern_walk2
rather than the simple true / false async we have.
auto __out_end = __pattern_walk2</*_IsSync=*/::std::false_type>( | ||
auto __out_end = __pattern_walk2( | ||
__par_backend_hetero::make_wrapped_policy<__initial_copy_1>(__exec), __first, __last, __out_first, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This __pattern_walk2
and the subsequent __pattern_sort
do have a dependency with __out_first
, where __pattern_walk2
needs to be finished before __pattern_sort
can begin.
My first thought is that the accessors involved here provide the implicit data dependency to order these kernels appropriately and make this work as a pipeline (as described in the _IsSync
comment in the definition of __pattern_walk2
).
However, with USM pointers as input data, I believe we lose these implicit dependencies and the ordering of kernels. Is that also your understanding (and the reason this extra synchronization is required)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the case is more complicated what we see...
- USM: ideal solution - via
depends_on(first_event)
(preferable) or wait() on the first event (future). - hetero iterators - you are right - we have implicit SYCL data access dependency (data dependency graph) and we don't need explicit synch
- host iterators: due to the first pattern contains a temporary
sycl::buffer
in its own scope, we cannot rely on SYCL data dependency graph. On the second pattern call we have got another temporarysycl::buffer
and SYCL Runtime will not be aware that it is the same data, I think.
Regarding (3) - I think, to fix it we have to re-write __pattern_partial_sort_copy
by calling SYCL backend patterns directly.
And there is
4) In case of host iterators we don't need call wait() even on the last pattern call Because, we have implicit wait (and copying data back to the host) on sycl::buffer destructor. (I "love"(a kind of sarcasm) SYCL:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. For us to rely on implicit dependencies for (3), we need only a single call which is reused to __get_sycl_range
. However, this is something that should be important anyway to avoid extra copies to and from the device with host data.
Perhaps that rewrite does not belong in this PR, but it seems that may be something we need to audit. If we are using multiple high level patterns with host iterators, we may be copying data to and from the device for each pattern unnecessarily.
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer()); | ||
|
||
const auto __shift = __new_first - __first; | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec), | ||
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n, | ||
__buf.all_view(), __temp_rng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This falls in a similar category as the partial_sort_copy
case above.
For buffers, I believe the implicit data dependencies of the accessor to __temp_buf
should order these appropriately, and the wait on the last __parallel_for
should remove any race with the return.
However, considering USM pointer inputs I think we lose those and this can be a bug that requries the wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, wait is redundant here, because __temp_rng
is a temporary sycl::buffer
.
Moreover, I guess we don't need the wait below, on line 1638, due to a destructor of the temporary buffer is blocking.
(To tell the truth I dont like situation when we have to know data types (USM/buffer/host/etc) and SYCL sync rules and other details.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not a great situation, but you are correct because __temp_buf
is a temporary sycl::buffer
these waits are unnecessary. Assuming we remove both wait()
, we should add comments for each to describe the implicit synchronization as it is not clear. (I missed it even with this topic in my mind when looking back at this PR)
b4c25d3
to
3030539
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking another look through this, I agree that this should be applied as is (since we cannot improve it further without significant refactoring to provide some method of pipelining patterns via events. It may result in a slow down in some usages, but will fix bugs in others, which is the preferred option between the two.
We should create an issue to properly describe the intricacies of the requirements for a refactor, specifically from this topic:
https://github.com/oneapi-src/oneDPL/pull/1261/files/2a9d0f0af2f64d582f9b0d770d94fe3085ea4112#r1383348739
@@ -1621,20 +1624,23 @@ __pattern_rotate(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterator | |||
auto __buf = __keep(__first, __last); | |||
auto __temp_buf = oneapi::dpl::__par_backend_hetero::__buffer<_ExecutionPolicy, _Tp>(__exec, __n); | |||
|
|||
auto __temp_rng = | |||
auto __temp_rng_w = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I dont see any sense to change variable name here. I would suggest to undo the change. (line 1627 and 1634)
__brick_copy<__hetero_tag<_BackendTag>, _ExecutionPolicy>{}); | ||
auto __out_end = | ||
__pattern_walk2(__tag, __par_backend_hetero::make_wrapped_policy<__initial_copy_1>(__exec), __first, __last, | ||
__out_first, __brick_copy<__hetero_tag<_BackendTag>, _ExecutionPolicy>{}); | ||
|
||
// Use regular sort as partial_sort isn't required to be stable | ||
__pattern_sort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case USM pointer we have to wait a result of __pattern_walk2 on 1489...
But if we do it, we have got perf degradation with sycl_iterator type (sycl::buffer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, since we are not wanting to do some sort of large refactor to use event dependencies to pipeline patterns at this point...
What about something like this to avoid the perf degradation:
Create something like __type_requires_exterior_synchronization<__out_first>
which would resolve to true_type
for anything for non-sycl backend, and for sycl backend it could return bool_constant<!is_sycl_iterator_v<__out_first>>
.
We can do this without mentioning sycl at this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still leaves a lot to be desired vs the refactor (for host_iterators especially), but it at least removes the degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_sycl_iterator_v
...
What if we have a "fancy pipe" over a sycl_iterator? or what should we do in case of mix? zip(zip(sycl_iter, usm_ptr), usm_ptr) and so on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would miss these, its not a perfect solution. It provides some improvement though for raw sycl_iterator or legacy hetero iterators. The better solution is using events which can be passed as dependencies to patterns and then into sycl kernels.
The solution described could be made better by adding a __contains_sycl_iterator
which has specializations for our fancy iterators to pass through and check if any iterator child __contains_sycl_iterator
and would therefore provide the synchronization.
zip and permutation would return the OR of their children iterators, transform would return its base's result. If it is not a fancy iterator, it would just be defined as described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually its a bit more complicated because we need the access pattern to actually dictate a dependency. Probably permutation iterator should only pass-through to its base, not its map, because the map will be read-only.
It is of course the user of this structure's responsibility to make sure the usage of the sequence will generate the synchronization, but permutation iterator wont require write access for it's map even if it is used in write mode.
Read-only followed by read-only wouldn't require an ordering. We need to be most careful to avoid believing there is a synchronization where there isn't one.
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer()); | ||
|
||
const auto __shift = __new_first - __first; | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec), | ||
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n, | ||
__buf.all_view(), __temp_rng); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, wait is redundant here, because __temp_rng
is a temporary sycl::buffer
.
Moreover, I guess we don't need the wait below, on line 1638, due to a destructor of the temporary buffer is blocking.
(To tell the truth I dont like situation when we have to know data types (USM/buffer/host/etc) and SYCL sync rules and other details.)
4f12be6
to
4b7fde2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I agree with your removal of these synchronizations
From
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:buf-sync-rules
"When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed, then copy the contents of the buffer back to the host memory (if required) and then return."
It doesn't matter what mode the accessor is in, it will always block on the destructor.
Other than the performance degradation for that one case, this looks good. We can discuss in that thread if we want to try to fix that in this PR or leave as an issue / TODO.
I've Ieft a proper //TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…synch, + comments
48fe95d
to
513dcd8
Compare
…ns, removed unnecessary sync, comments #1261
In some algorithm patterns there are calls of more then one SYCL backend patterns. As far as SYCL backend patterns are asynchronous, we have to make synch between the them. It might be set SYCL dependencies or "wait" of the first pattern at least.
"Setting SYCL dependencies" is more preferable and right approach. But, currently, SYCL backend API doesn't have possibility to pass the any dependencies. It should be re-designed in the future... But now we can apply simple fix - to wait of the first SYCL backend pattern call.