Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KT] Single Pass Copy_if Kernel Template #1616

Draft
wants to merge 136 commits into
base: main
Choose a base branch
from

Conversation

danhoeflinger
Copy link
Contributor

@danhoeflinger danhoeflinger commented Jun 6, 2024

This PR adds a pair of APIs (iterator and range variants) for :
oneapi::dpl::experimental::kt::gpu::copy_if
which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.

Other additions within this PR

  • Tests for these new APIs

Additional notable details:

  • Refactor of oneDPL mainline copy_if single workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel template
  • Refactor of scan kernel template to share lookback phase, allocation manager with copy_if
  • Adjust lookback phase to rely upon the last subgroup / last work-item rather than the first subgroup / first work-item to do operations which we want only a single subgroup or work-item to do. This enables propagation of "running" scan values without extra intra-workgroup communication for copy_if. I don't believe this change negatively impacts scan KT.

Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel

joeatodd and others added 30 commits November 29, 2023 12:35
Atomic flags and the values used in Scan_kt separated to avoid truncating the range to 30bit values, and prepare for a more general scan implementation.
* Improved Scan_kt: templated parameters, ballot, wgsize calculation.

- Changed number of workgroups calculation from next power of two to
next multiple of wgsize
- Improved group_ballot by using the class member functions
- Using kernel_param struct to determine wgsize and elems per work item.
…ree of the device memory (#18)

* Single memory allocation for device_memory

* async free of device memory

---------

Co-authored-by: Joe Todd <[email protected]>
Co-authored-by: AidanBeltonS <[email protected]>
* Refactored cooperative_loopback and memory implementation detail

* renamed load_counter to fetch_add_counter

* Removed dynamic tile counter from the scan memory struct

* scratch memory Reordering

* Fixed wrong values returned in LoopbackScanMemory.get_value

* Improved Class and variable naming
* Implemented atomic64 version of the scan_kt pass

* Removed repeated offset calculation for tile id atomic flag

* Loopback -> Lookback. Removed unused var.
TODO: we still allocate & initialize the memory for the counter
Also use #pragma unroll for now
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback.

I do not have an understanding of why this might be. copy_if needs to use the last here to take advantage of the location of the data which needs to be communicated. I've adjusted the shared helper function for lookback to allow the individual algorithm to dictate the active subgroup, workitem and source for the broadcast, and this repaired the performance regression for scan.

@danhoeflinger danhoeflinger added this to the 2022.7.0 milestone Jun 10, 2024
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've briefly checked the API and the tests. I am going to dive deeper.

include/oneapi/dpl/experimental/kt/single_pass_scan.h Outdated Show resolved Hide resolved
include/oneapi/dpl/experimental/kt/kernel_param.h Outdated Show resolved Hide resolved
include/oneapi/dpl/experimental/kt/single_pass_scan.h Outdated Show resolved Hide resolved
Comment on lines 62 to 69
// Integer numbers are generated even for floating point types in order to avoid rounding errors,
// and simplify the final check
using substitute_t = std::conditional_t<std::is_signed_v<T>, std::int64_t, std::uint64_t>;

std::default_random_engine gen{seed};
substitute_t start = std::is_signed_v<T> ? -10 : 0;
substitute_t end = std::is_signed_v<T> ? 10 : 20;
std::uniform_int_distribution<substitute_t> dist(start, end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data generation can be simpler and cover the whole range of numbers, because there will be no overflows due to copy_if nature. There is no need to use substitute_t logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also remove the comment?

    // Integer numbers are generated even for floating point types in order to avoid rounding errors,
    // and simplify the final check

test/kt/single_pass_copy_if.cpp Outdated Show resolved Hide resolved
return __prev_event;
}
}

} // namespace __impl

template <typename _InRng, typename _OutRng, typename _NumCopiedRng, typename _UnaryPredicate, typename _KernelParam>
sycl::event
copy_if(sycl::queue __queue, _InRng&& __in_rng, _OutRng&& __out_rng, _NumCopiedRng&& __num_rng, _UnaryPredicate __pred,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is natural to see __num_rng passed as a pointer rather than a range, because it is expected to hold one element. Will that API be able to digest __num_rng as a pointer?

This is definitely not a blocker, because it can be work-arounded on a caller side (e.g. wrapping it with views::subrange).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention here is to allow a buffer of a single element to be easily used here. However, I agree that an iterator / pointer may suit it better, given that it should be a single element. I don't think this current API works with just a pointer as it is written.

I don't think we should force it to be a pointer, because I don't think we make using buffers difficult, but I think an iterator could be fine (which includes USM pointers, of course). It would just require users to call oneapi::dpl::begin(num_copied_buf) first if they want to use a buffer.

I think thre may also be utility in creating a single_device_element type which could wrap a buffer, iterator, USM pointer to represent an single element which can be made available on the device. Its something I've considered also for the init value of scan, where currently we only allow a host-side ValueType value. If we also had an specialized overload for a single_device_element which would allow users to skip the host to device copy. The other place such a type is interesting is for the return of reduce for some KT version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having taken a fresher look, I think that the existing approach below is fine:

sycl::buffer<T> buf(input.data(), input.size());
sycl::buffer<T> buf_out(size);
sycl::buffer<std::size_t> buf_num_copied(&num_copied, 1);
oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();

I previously thought about this one as an alternative:

sycl::buffer<T> buf(input.data(), input.size());
sycl::buffer<T> buf_out(size);
auto buf_num_copied = sycl::malloc_device<T>(1, q);
oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();

Now, it does not look like a good idea to me.
Users are more likely to stick to either buffer or malloc_device for all the arguments. I know examples when an external pool of memory of memory is used for all the data, and only pointers are passed. So, it does not justify the additional complexity. Moreover, there are workarounds.

Discussing API of a new experimental feature involves much speculation, sometimes with no strong ground like this one. I hope it is useful at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thre may also be utility in creating a single_device_element type...

Yes, it will be useful if it simplifies the usage of the features and helps to avoid performance-punishing pitfalls. It would be good to see these cases, though.
What I can add is that introducing extra logic for passing init values may not be beneficial, because the latency of init value copying may be hidden by the kernel launch. It is a kernel argument, in contrast to return values of reduce or scan KT algorithms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I can add is that introducing extra logic for passing init values may not be beneficial

Actually, it may allow passing non-trivially-copyable type of init value in case the algorithm handles data, which is constructed on a device. I do not know such use-cases, though.

test/kt/single_pass_copy_if.cpp Outdated Show resolved Hide resolved
Comment on lines 66 to 75
if constexpr (std::is_integral_v<T>)
{
std::uniform_int_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::generate(input, input + size, [&] { return dist(gen); });
}
else
{
std::uniform_real_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::generate(input, input + size, [&] { return dist(gen); });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things to improve:

  • Use lowest instead of min (min is ~0 for float types).
  • Make sure that duplicates are generated.

You can take a look at test\kt\esimd_radix_sort_utils.h:generate_data(), it is done there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch with lowest(), I know that but don't always remember that :) .

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

I dont mind just including this utility header and call this generation function, no need to duplicate this functionality. While I don't understand the need for duplicates, they don't harm anything here.

Since we may use it here, can you explain the motivation for generating real numbers as it is done in test\kt\esimd_radix_sort_utils.h:generate_data()? Specifically:

  1. Why limit to 1e12 rather than numerical_limits<T>::max()?
  2. Why separate the sign into a separate distribution rather than std::uniform_real_distribution<T> dist_real(-log2(1e12), log2(1e12));?
  3. Why use log2 followed by exp2 ?

If they perhaps have limited interest to things outside of radix sort, we may be able to make them tunable parameters.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

Yeah, you are right, it's not relevant for copy_if. Sorry to bother you with that.

Why limit to 1e12 rather than numerical_limits::max()?

The distribution is uniform and floating point types can hold very large values, so the values generated with max will gravitate towards edges. This is what I got with min and max: 9.43047e+307 4.73343e+307 3.52862e+307 9.20991e+307 4.6219e+307 1.466e+308 8.12609e+307 4.44764e+307.

The trick with exponent allows making a more fair distribution in terms of underlying bit representation (again, it is useful for radix sort), and it appears to provide more real-world values. This is what I got with log2(1e12) (almost 40, basically): -1.97263e+06 32710.5 -1.40546e+06 -19.2773 265671 181.673 -4.88076e+09 6.98599e+08.

Why separate the sign into a separate distribution rather than std::uniform_real_distribution dist_real(-log2(1e12), log2(1e12));

That's because of exp2 usage below. It provides only positive values.

Why use log2 followed by exp2

Imaging that with the exponent transformation, each green tick on the picture below has equal probability:

image

This picture is from https://en.wikipedia.org/wiki/Floating-point_arithmetic. It mimics the distribution of representable values of a floating point type.

It would be better to use log2(numeric_limits::max) instead of log2(1e12) and leave a comment, though. 1e12 is the maximum value after log2 and exp2 transformations. This pair is complementary and that is why it is used together.

Meanwhile, I also checked std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max() and got a sequence of inf values: https://godbolt.org/z/YzzrzzY6e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I understand, I suppose for floating point it makes sense to "center" the values on some smaller range, rather than being very likely to end up with gigantic positive or negative values. I can buy the argument that this is more likely to be similar to a real case.

If I understand correctly, the log2 exp2 is to provide a distribution which is more uniform in bit representation for radix sort. Its actually an interesting question if this is the "correct" thing to do when testing or timing radix sort for floating point numbers. This will make it act most like a uniformly distributed integral type, but one could also argue that real world data "doesn't care" about bit distribution and that just making a uniform distribution of real numbers is more appropriate, for instance uniform(-1e12,1e12). On the other hand, I have no real evidence that one distribution is more likely in user code than another for floating point data.

Very interesting that you get inf numbers when using lowest() and max(). Seems like a bug, but I can also understand why using the type's limits in the distribution would make for difficult calculations that may involve the "size" of the range.

I will try to unify the data generation, and perhaps make optional some of the radix sort specific stuff.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Dan. I agree with your suggestions.

I guess it is fine to have such a distribution for radix sort tests to cover as many bits combinations as possible, assuming the generated data does not look too artificial. This is because radix sort does many bit-related manipulations, and it should help to catch cases like this one: https://github.com/oneapi-src/oneDPL/blob/b47ff69d059e72384b01df040423c916f21ca835/include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h#L100-L104

The benchmarks is a different story and uniform(-1e12,1e12) looks better. As for copy_if, uniform(-1e12,1e12) looks better too.

Signed-off-by: Dan Hoeflinger <[email protected]>
__group_copy_if_fits_in_slm(const sycl::queue& __queue, _Size __n, std::size_t __n_uniform)
{
using _SingleGroupInvoker = __invoke_single_group_copy_if<_Size>;
::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);
const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);

@danhoeflinger
Copy link
Contributor Author

danhoeflinger commented Aug 8, 2024

I suggest prioritizing #1762 and #1763 over this PR for now. If those go through, the performance of the oneDPL main copy_if API will supersede this KT. If we see significant risk that the above mentioned PRs will miss the release, we should pivot to merge this PR.
Once those merge, I will remove this from the release milestone, until further improvements can be incorporated into this KT which enable it to have value on its own.

@danhoeflinger
Copy link
Contributor Author

At this point, I think this PR is more difficult to land in 2022.7.0 than the first two reduce_then_scan PRs, and provides worse performance, so I'm pulling this from the milestone, and converting to draft.
This may resurface with concepts from reduce_then_scan combined with the lookback to provide enhanced performance over mainline oneDPL, but for now we shouldn't be prioritizing this.

@danhoeflinger danhoeflinger marked this pull request as draft August 26, 2024 19:05
@danhoeflinger danhoeflinger removed this from the 2022.7.0 milestone Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants