Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oneDPL] A fix - added missed synchronization between two SYCL patterns. #1261

Merged
merged 1 commit into from
Apr 5, 2024

Conversation

MikeDvorskiy
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy commented Nov 3, 2023

In some algorithm patterns there are calls of more then one SYCL backend patterns. As far as SYCL backend patterns are asynchronous, we have to make synch between the them. It might be set SYCL dependencies or "wait" of the first pattern at least.
"Setting SYCL dependencies" is more preferable and right approach. But, currently, SYCL backend API doesn't have possibility to pass the any dependencies. It should be re-designed in the future... But now we can apply simple fix - to wait of the first SYCL backend pattern call.

@MikeDvorskiy MikeDvorskiy changed the title [oneDPL] A fix - added missed synchronization between two async SYCL patterns. [oneDPL] A fix - added missed synchronization between two SYCL patterns. Nov 3, 2023
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that the long term fix is to use explicit dependencies between these pattern calls, and that sort of refactor is necessary.

However, I think that we may be able to improve the stopgap for the __pattern_stable_partition case as described in my comment.

Comment on lines 1242 to 1243
__pattern_walk2(
__par_backend_hetero::make_wrapped_policy<copy_back_wrapper>(__exec),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems that these two __pattern_walk2 calls are independent of each other (don't require a specific ordering), but both do need to be waited on before returning from __pattern_stable_partition.

Instead of enforcing ordering here, could we make them both async calls and then wait on the policy's queue here?

Copy link
Contributor Author

@MikeDvorskiy MikeDvorskiy Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here...
Yes, we have to call wait for two events here. But we still don't have such oneDPL API.
Usage of SYCL API directly is not allowed on this oneDPL code layer. (as alliterative of queue::wait, there is possibility to call event::wait(list of events))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. I agree that similar to the others, this requires a rewrite / refactor to improve by returning some event from __pattern_walk2 rather than the simple true / false async we have.

Comment on lines 1414 to 1415
auto __out_end = __pattern_walk2</*_IsSync=*/::std::false_type>(
auto __out_end = __pattern_walk2(
__par_backend_hetero::make_wrapped_policy<__initial_copy_1>(__exec), __first, __last, __out_first,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __pattern_walk2 and the subsequent __pattern_sort do have a dependency with __out_first, where __pattern_walk2 needs to be finished before __pattern_sort can begin.

My first thought is that the accessors involved here provide the implicit data dependency to order these kernels appropriately and make this work as a pipeline (as described in the _IsSync comment in the definition of __pattern_walk2).

However, with USM pointers as input data, I believe we lose these implicit dependencies and the ordering of kernels. Is that also your understanding (and the reason this extra synchronization is required)?

Copy link
Contributor Author

@MikeDvorskiy MikeDvorskiy Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the case is more complicated what we see...

  1. USM: ideal solution - via depends_on(first_event) (preferable) or wait() on the first event (future).
  2. hetero iterators - you are right - we have implicit SYCL data access dependency (data dependency graph) and we don't need explicit synch
  3. host iterators: due to the first pattern contains a temporary sycl::buffer in its own scope, we cannot rely on SYCL data dependency graph. On the second pattern call we have got another temporary sycl::buffer and SYCL Runtime will not be aware that it is the same data, I think.

Regarding (3) - I think, to fix it we have to re-write __pattern_partial_sort_copy by calling SYCL backend patterns directly.

And there is
4) In case of host iterators we don't need call wait() even on the last pattern call Because, we have implicit wait (and copying data back to the host) on sycl::buffer destructor. (I "love"(a kind of sarcasm) SYCL:-)

Copy link
Contributor

@danhoeflinger danhoeflinger Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. For us to rely on implicit dependencies for (3), we need only a single call which is reused to __get_sycl_range. However, this is something that should be important anyway to avoid extra copies to and from the device with host data.

Perhaps that rewrite does not belong in this PR, but it seems that may be something we need to audit. If we are using multiple high level patterns with host iterators, we may be copying data to and from the device for each pattern unnecessarily.

oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer());

const auto __shift = __new_first - __first;
oneapi::dpl::__par_backend_hetero::__parallel_for(
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n,
__buf.all_view(), __temp_rng);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This falls in a similar category as the partial_sort_copy case above.

For buffers, I believe the implicit data dependencies of the accessor to __temp_buf should order these appropriately, and the wait on the last __parallel_for should remove any race with the return.

However, considering USM pointer inputs I think we lose those and this can be a bug that requries the wait()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, wait is redundant here, because __temp_rng is a temporary sycl::buffer.
Moreover, I guess we don't need the wait below, on line 1638, due to a destructor of the temporary buffer is blocking.

(To tell the truth I dont like situation when we have to know data types (USM/buffer/host/etc) and SYCL sync rules and other details.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not a great situation, but you are correct because __temp_buf is a temporary sycl::buffer these waits are unnecessary. Assuming we remove both wait(), we should add comments for each to describe the implicit synchronization as it is not clear. (I missed it even with this topic in my mind when looking back at this PR)

@akukanov akukanov added the follow through PRs/issues that should be completed/resolved label Nov 30, 2023
@ValentinaKats ValentinaKats added this to the 2022.6.0 milestone Feb 5, 2024
@SergeyKopienko SergeyKopienko force-pushed the dev/mdvorski/fix_missed_synch_alg_patterns branch from b4c25d3 to 3030539 Compare March 22, 2024 16:28
danhoeflinger
danhoeflinger previously approved these changes Mar 25, 2024
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking another look through this, I agree that this should be applied as is (since we cannot improve it further without significant refactoring to provide some method of pipelining patterns via events. It may result in a slow down in some usages, but will fix bugs in others, which is the preferred option between the two.

We should create an issue to properly describe the intricacies of the requirements for a refactor, specifically from this topic:
https://github.com/oneapi-src/oneDPL/pull/1261/files/2a9d0f0af2f64d582f9b0d770d94fe3085ea4112#r1383348739

@@ -1621,20 +1624,23 @@ __pattern_rotate(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterator
auto __buf = __keep(__first, __last);
auto __temp_buf = oneapi::dpl::__par_backend_hetero::__buffer<_ExecutionPolicy, _Tp>(__exec, __n);

auto __temp_rng =
auto __temp_rng_w =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I dont see any sense to change variable name here. I would suggest to undo the change. (line 1627 and 1634)

__brick_copy<__hetero_tag<_BackendTag>, _ExecutionPolicy>{});
auto __out_end =
__pattern_walk2(__tag, __par_backend_hetero::make_wrapped_policy<__initial_copy_1>(__exec), __first, __last,
__out_first, __brick_copy<__hetero_tag<_BackendTag>, _ExecutionPolicy>{});

// Use regular sort as partial_sort isn't required to be stable
__pattern_sort(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case USM pointer we have to wait a result of __pattern_walk2 on 1489...
But if we do it, we have got perf degradation with sycl_iterator type (sycl::buffer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, since we are not wanting to do some sort of large refactor to use event dependencies to pipeline patterns at this point...

What about something like this to avoid the perf degradation:
Create something like __type_requires_exterior_synchronization<__out_first> which would resolve to true_type for anything for non-sycl backend, and for sycl backend it could return bool_constant<!is_sycl_iterator_v<__out_first>>.

We can do this without mentioning sycl at this level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves a lot to be desired vs the refactor (for host_iterators especially), but it at least removes the degradation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sycl_iterator_v...
What if we have a "fancy pipe" over a sycl_iterator? or what should we do in case of mix? zip(zip(sycl_iter, usm_ptr), usm_ptr) and so on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would miss these, its not a perfect solution. It provides some improvement though for raw sycl_iterator or legacy hetero iterators. The better solution is using events which can be passed as dependencies to patterns and then into sycl kernels.

The solution described could be made better by adding a __contains_sycl_iterator which has specializations for our fancy iterators to pass through and check if any iterator child __contains_sycl_iterator and would therefore provide the synchronization.
zip and permutation would return the OR of their children iterators, transform would return its base's result. If it is not a fancy iterator, it would just be defined as described above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually its a bit more complicated because we need the access pattern to actually dictate a dependency. Probably permutation iterator should only pass-through to its base, not its map, because the map will be read-only.

It is of course the user of this structure's responsibility to make sure the usage of the sequence will generate the synchronization, but permutation iterator wont require write access for it's map even if it is used in write mode.

Read-only followed by read-only wouldn't require an ordering. We need to be most careful to avoid believing there is a synchronization where there isn't one.

oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer());

const auto __shift = __new_first - __first;
oneapi::dpl::__par_backend_hetero::__parallel_for(
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n,
__buf.all_view(), __temp_rng);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, wait is redundant here, because __temp_rng is a temporary sycl::buffer.
Moreover, I guess we don't need the wait below, on line 1638, due to a destructor of the temporary buffer is blocking.

(To tell the truth I dont like situation when we have to know data types (USM/buffer/host/etc) and SYCL sync rules and other details.)

@MikeDvorskiy MikeDvorskiy force-pushed the dev/mdvorski/fix_missed_synch_alg_patterns branch 4 times, most recently from 4f12be6 to 4b7fde2 Compare March 26, 2024 14:09
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I agree with your removal of these synchronizations
From
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:buf-sync-rules

"When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed, then copy the contents of the buffer back to the host memory (if required) and then return."
It doesn't matter what mode the accessor is in, it will always block on the destructor.

Other than the performance degradation for that one case, this looks good. We can discuss in that thread if we want to try to fix that in this PR or leave as an issue / TODO.

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h Outdated Show resolved Hide resolved
@MikeDvorskiy
Copy link
Contributor Author

leave as an issue / TODO.

I've Ieft a proper //TODO

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MikeDvorskiy MikeDvorskiy force-pushed the dev/mdvorski/fix_missed_synch_alg_patterns branch from 48fe95d to 513dcd8 Compare April 5, 2024 10:50
@MikeDvorskiy MikeDvorskiy merged commit 2f84319 into main Apr 5, 2024
19 of 20 checks passed
@MikeDvorskiy MikeDvorskiy deleted the dev/mdvorski/fix_missed_synch_alg_patterns branch April 5, 2024 12:17
MikeDvorskiy added a commit that referenced this pull request Apr 8, 2024
SergeyKopienko pushed a commit that referenced this pull request Apr 9, 2024
…L patterns, removed unnecessary sync, comments #1261"

This reverts commit 7115fb4.
SergeyKopienko pushed a commit that referenced this pull request Apr 9, 2024
…L patterns, removed unnecessary sync, comments #1261"

This reverts commit 7115fb4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
follow through PRs/issues that should be completed/resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants