Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use __result_and_scratch_storage within scan kernels #1770

Merged
merged 6 commits into from
Aug 8, 2024

Conversation

danhoeflinger
Copy link
Contributor

@danhoeflinger danhoeflinger commented Aug 6, 2024

This PR adjust existing scan kernels to use __result_and_scratch_storage rather than a temporary buffer for both scratch space and for the return within the resulting future.

This is important because:

  1. It should be faster (as evidenced from previous work with reduce)
  2. It serves as plumbing in preparation for the upcoming reduce_then_scan kernels which utilize this structure for temporary and return storage. Since we have runtime branches selecting which algorithm to use (existing scan, single wg scan, single wg copy_if, and soon reduce_then_scan), these all need to have the same return future type. This means they should all use __result_and_scratch_storage for their return (and therefore also any temporary storage).

Single WG scan does not have a return or temporary storage, so it simply creates a dummy __result_and_scratch_storage to make a consistent type for the return, and no changes are needed in the kernel.

For copy_if, unique and partition scan kernels, we need to write to the output result storage. Previous to to PR, we simply used the last element of the temporary storage buffer in the returned future.

This PR also avoids double dereferences, by passing pointers around and using them as pointers, rather than as const lvalue references to "accessors". (as mentioned in #1751) Until this PR, these actually were accessors, but now they are pointers.


This PR is targeted to #1769, to allow for a clean diff, and is a part of the following sequence of PRs meant to be merged in order:

#1769 [MERGED] Relocate __lazy_ctor_storage to utils header
#1770 Use __result_and_scratch_storage within scan kernels (This PR)
#1762 Add reduce_then_scan algorithm for transform scan family
#1763 Make Copy_if family of APIs use reduce_then_scan algorithm
#1764 Make Partition family of APIs use reduce_then_scan algorithm
#1765 Make Unique family of APIs use reduce_then_scan algorithm

This work is a collaboration between @mmichel11 @adamfidel and @danhoeflinger, and based upon an original prototype by Ted Painter.

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Aug 7, 2024

I believe we should pass __scratch_container param by reference at
https://github.com/oneapi-src/oneDPL/blob/aec12a4c55ac5de94753f8e3259b50cd6befb0c3/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h#L202
Due it's deep copy is heavy (smart-pointers inside and so on).

And may be the same in __parallel_transform_reduce_work_group_kernel_submitter::operator - pass __scratch_container by reference or moreover as r-value due it's never used after this call on caller side.

@danhoeflinger
Copy link
Contributor Author

danhoeflinger commented Aug 7, 2024

I believe we should pass __scratch_container param by reference at

https://github.com/oneapi-src/oneDPL/blob/aec12a4c55ac5de94753f8e3259b50cd6befb0c3/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h#L202

Due it's deep copy is heavy (smart-pointers inside and so on).
And may be the same in __parallel_transform_reduce_work_group_kernel_submitter::operator - pass __scratch_container by reference or moreover as r-value due it's never used after this call on caller side.

It is a deep copy in the location you mention, although I don't think its that expensive since it is only smart pointers, not a copy of the data they point to. In that case, I think we do use it after the fact (to send it to the __future), so an rvalue isn't appropriate until the __future itself.
I'm not sure its in the scope of this PR to change that to a reference, but I'm happy to add it if we have consensus on it (@julianmi, etc.).

The only deep copies we have in this PR are into __future, and we could possible std::move into the future, but changes would be needed in __future to accommodate that.
@julianmi, what do you think of that change? (edit: I'm not sure if __future changes are required... investigating)

@SergeyKopienko
Copy link
Contributor

SergeyKopienko commented Aug 7, 2024

I believe we should pass __scratch_container param by reference at
https://github.com/oneapi-src/oneDPL/blob/aec12a4c55ac5de94753f8e3259b50cd6befb0c3/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h#L202

Due it's deep copy is heavy (smart-pointers inside and so on).
And may be the same in __parallel_transform_reduce_work_group_kernel_submitter::operator - pass __scratch_container by reference or moreover as r-value due it's never used after this call on caller side.

It is a deep copy in the location you mention, although I don't think its that expensive since it is only smart pointers, not a copy of the data they point to. In that case, I think we do use it after the fact (to send it to the __future), so an rvalue isn't appropriate until the __future itself. I'm not sure its in the scope of this PR to change that to a reference, but I'm happy to add it if we have consensus on it (@julianmi, etc.).

The only deep copies we have in this PR are into __future, and we could possible std::move into the future, but changes would be needed in __future to accommodate that. @julianmi, what do you think of that change? (edit: I'm not sure if __future changes are required... investigating)

Exactly, under deep copy I mean copy of smart-pointers here with all atomic counter operations in their copy implementations. But we are able to avoid these at all.
Agree that "deep" probably a little bit incorrect word there but you understood my idea, I believe.

@julianmi
Copy link
Contributor

julianmi commented Aug 7, 2024

I believe we should pass __scratch_container param by reference at
https://github.com/oneapi-src/oneDPL/blob/aec12a4c55ac5de94753f8e3259b50cd6befb0c3/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h#L202

Due it's deep copy is heavy (smart-pointers inside and so on).
And may be the same in __parallel_transform_reduce_work_group_kernel_submitter::operator - pass __scratch_container by reference or moreover as r-value due it's never used after this call on caller side.

It is a deep copy in the location you mention, although I don't think its that expensive since it is only smart pointers, not a copy of the data they point to. In that case, I think we do use it after the fact (to send it to the __future), so an rvalue isn't appropriate until the __future itself. I'm not sure its in the scope of this PR to change that to a reference, but I'm happy to add it if we have consensus on it (@julianmi, etc.).

The only deep copies we have in this PR are into __future, and we could possible std::move into the future, but changes would be needed in __future to accommodate that. @julianmi, what do you think of that change? (edit: I'm not sure if __future changes are required... investigating)

Deep copy: I agree that it is probably not that expensive but still unneeded. I will change it in a separate PR. Let's focus on getting the reduce-then-scan implementation in.

std::move: I think this might be useful. This is probably also out of scope for this PR. I will investigate and prepare a patch.

SergeyKopienko
SergeyKopienko previously approved these changes Aug 7, 2024
Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danhoeflinger
Copy link
Contributor Author

danhoeflinger commented Aug 7, 2024

Deep copy: I agree that it is probably not that expensive but still unneeded. I will change it in a separate PR. Let's focus on getting the reduce-then-scan implementation in.

std::move: I think this might be useful. This is probably also out of scope for this PR. I will investigate and prepare a patch.

Thanks for taking a look at these.
Here is a godbolt starting to look at the future changes which could be helpful. https://godbolt.org/z/WvqK7G37z
I think we can std::move right now, and it wont cause errors, but as implemented it wont help avoid a copy. I think we can avoid a copy with something like what is in the godbolt, but it requires explicit template parameter specification of the __future type. This is not necessarily a bad thing...

julianmi
julianmi previously approved these changes Aug 7, 2024
Copy link
Contributor

@julianmi julianmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danhoeflinger
Copy link
Contributor Author

danhoeflinger commented Aug 7, 2024

Waiting to merge this until #1769 is merged to avoid rebase complexity, unless that PR gets stuck for a long time.

Base automatically changed from dev/shared/reduce_then_scan_plumbing to main August 8, 2024 14:02
@adamfidel adamfidel dismissed stale reviews from julianmi and SergeyKopienko August 8, 2024 14:02

The base branch was changed.

@danhoeflinger danhoeflinger force-pushed the dev/shared/scan_result_and_scratch_storage branch from 47a227c to c471ed5 Compare August 8, 2024 14:06
Signed-off-by: Dan Hoeflinger <[email protected]>
Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danhoeflinger danhoeflinger merged commit 0dee46a into main Aug 8, 2024
21 checks passed
@danhoeflinger danhoeflinger deleted the dev/shared/scan_result_and_scratch_storage branch August 8, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants