[KT] Single Pass Copy_if Kernel Template #1616

danhoeflinger · 2024-06-06T14:06:30Z

This PR adds a pair of APIs (iterator and range variants) for :
oneapi::dpl::experimental::kt::gpu::copy_if
which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.

Other additions within this PR

Tests for these new APIs

Additional notable details:

Refactor of oneDPL mainline copy_if single workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel template
Refactor of scan kernel template to share lookback phase, allocation manager with copy_if
Adjust lookback phase to rely upon the last subgroup / last work-item rather than the first subgroup / first work-item to do operations which we want only a single subgroup or work-item to do. This enables propagation of "running" scan values without extra intra-workgroup communication for copy_if. I don't believe this change negatively impacts scan KT.

Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel

Atomic flags and the values used in Scan_kt separated to avoid truncating the range to 30bit values, and prepare for a more general scan implementation.

* Improved Scan_kt: templated parameters, ballot, wgsize calculation. - Changed number of workgroups calculation from next power of two to next multiple of wgsize - Improved group_ballot by using the class member functions - Using kernel_param struct to determine wgsize and elems per work item.

…ree of the device memory (#18) * Single memory allocation for device_memory * async free of device memory --------- Co-authored-by: Joe Todd <[email protected]> Co-authored-by: AidanBeltonS <[email protected]>

* Refactored cooperative_loopback and memory implementation detail * renamed load_counter to fetch_add_counter * Removed dynamic tile counter from the scan memory struct * scratch memory Reordering * Fixed wrong values returned in LoopbackScanMemory.get_value * Improved Class and variable naming

* Implemented atomic64 version of the scan_kt pass * Removed repeated offset calculation for tile id atomic flag * Loopback -> Lookback. Removed unused var.

TODO: we still allocate & initialize the memory for the counter

Also use #pragma unroll for now

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger · 2024-06-07T12:00:54Z

Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback.

I do not have an understanding of why this might be. copy_if needs to use the last here to take advantage of the location of the data which needs to be communicated. I've adjusted the shared helper function for lookback to allow the individual algorithm to dictate the active subgroup, workitem and source for the broadcast, and this repaired the performance regression for scan.

dmitriy-sobolev

I've briefly checked the API and the tests. I am going to dive deeper.

include/oneapi/dpl/experimental/kt/single_pass_scan.h

include/oneapi/dpl/experimental/kt/kernel_param.h

include/oneapi/dpl/experimental/kt/single_pass_scan.h

dmitriy-sobolev · 2024-06-14T14:50:07Z

test/kt/single_pass_copy_if.cpp

+    // Integer numbers are generated even for floating point types in order to avoid rounding errors,
+    // and simplify the final check
+    using substitute_t = std::conditional_t<std::is_signed_v<T>, std::int64_t, std::uint64_t>;
+
+    std::default_random_engine gen{seed};
+    substitute_t start = std::is_signed_v<T> ? -10 : 0;
+    substitute_t end = std::is_signed_v<T> ? 10 : 20;
+    std::uniform_int_distribution<substitute_t> dist(start, end);


Data generation can be simpler and cover the whole range of numbers, because there will be no overflows due to copy_if nature. There is no need to use substitute_t logic.

I simplified this, thanks.

Could you also remove the comment?

// Integer numbers are generated even for floating point types in order to avoid rounding errors, // and simplify the final check

test/kt/single_pass_copy_if.cpp

dmitriy-sobolev · 2024-06-14T15:19:25Z

include/oneapi/dpl/experimental/kt/single_pass_scan.h

        return __prev_event;
    }
 }

 } // namespace __impl

+template <typename _InRng, typename _OutRng, typename _NumCopiedRng, typename _UnaryPredicate, typename _KernelParam>
+sycl::event
+copy_if(sycl::queue __queue, _InRng&& __in_rng, _OutRng&& __out_rng, _NumCopiedRng&& __num_rng, _UnaryPredicate __pred,


It is natural to see __num_rng passed as a pointer rather than a range, because it is expected to hold one element. Will that API be able to digest __num_rng as a pointer?

This is definitely not a blocker, because it can be work-arounded on a caller side (e.g. wrapping it with views::subrange).

My intention here is to allow a buffer of a single element to be easily used here. However, I agree that an iterator / pointer may suit it better, given that it should be a single element. I don't think this current API works with just a pointer as it is written.

I don't think we should force it to be a pointer, because I don't think we make using buffers difficult, but I think an iterator could be fine (which includes USM pointers, of course). It would just require users to call oneapi::dpl::begin(num_copied_buf) first if they want to use a buffer.

I think thre may also be utility in creating a single_device_element type which could wrap a buffer, iterator, USM pointer to represent an single element which can be made available on the device. Its something I've considered also for the init value of scan, where currently we only allow a host-side ValueType value. If we also had an specialized overload for a single_device_element which would allow users to skip the host to device copy. The other place such a type is interesting is for the return of reduce for some KT version.

Having taken a fresher look, I think that the existing approach below is fine:

sycl::buffer<T> buf(input.data(), input.size()); sycl::buffer<T> buf_out(size); sycl::buffer<std::size_t> buf_num_copied(&num_copied, 1); oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();

I previously thought about this one as an alternative:

sycl::buffer<T> buf(input.data(), input.size()); sycl::buffer<T> buf_out(size); auto buf_num_copied = sycl::malloc_device<T>(1, q); oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();

Now, it does not look like a good idea to me.
Users are more likely to stick to either buffer or malloc_device for all the arguments. I know examples when an external pool of memory of memory is used for all the data, and only pointers are passed. So, it does not justify the additional complexity. Moreover, there are workarounds.

Discussing API of a new experimental feature involves much speculation, sometimes with no strong ground like this one. I hope it is useful at least.

I think thre may also be utility in creating a single_device_element type...

Yes, it will be useful if it simplifies the usage of the features and helps to avoid performance-punishing pitfalls. It would be good to see these cases, though.
What I can add is that introducing extra logic for passing init values may not be beneficial, because the latency of init value copying may be hidden by the kernel launch. It is a kernel argument, in contrast to return values of reduce or scan KT algorithms.

What I can add is that introducing extra logic for passing init values may not be beneficial

Actually, it may allow passing non-trivially-copyable type of init value in case the algorithm handles data, which is constructed on a device. I do not know such use-cases, though.

test/kt/single_pass_copy_if.cpp

Signed-off-by: Dan Hoeflinger <[email protected]>

dmitriy-sobolev · 2024-06-18T17:07:45Z

test/kt/single_pass_copy_if.cpp

+    if constexpr (std::is_integral_v<T>)
+    {
+        std::uniform_int_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
+        std::generate(input, input + size, [&] { return dist(gen); });
+    }
+    else
+    {
+        std::uniform_real_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
+        std::generate(input, input + size, [&] { return dist(gen); });
+    }


Two things to improve:

Use lowest instead of min (min is ~0 for float types).

Make sure that duplicates are generated.

You can take a look at test\kt\esimd_radix_sort_utils.h:generate_data(), it is done there.

Yes, good catch with lowest(), I know that but don't always remember that :) .

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

I dont mind just including this utility header and call this generation function, no need to duplicate this functionality. While I don't understand the need for duplicates, they don't harm anything here.

Since we may use it here, can you explain the motivation for generating real numbers as it is done in test\kt\esimd_radix_sort_utils.h:generate_data()? Specifically:

Why limit to 1e12 rather than numerical_limits<T>::max()?

Why separate the sign into a separate distribution rather than std::uniform_real_distribution<T> dist_real(-log2(1e12), log2(1e12));?

Why use log2 followed by exp2 ?

If they perhaps have limited interest to things outside of radix sort, we may be able to make them tunable parameters.

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

Yeah, you are right, it's not relevant for copy_if. Sorry to bother you with that.

Why limit to 1e12 rather than numerical_limits::max()?

The distribution is uniform and floating point types can hold very large values, so the values generated with max will gravitate towards edges. This is what I got with min and max: 9.43047e+307 4.73343e+307 3.52862e+307 9.20991e+307 4.6219e+307 1.466e+308 8.12609e+307 4.44764e+307.

The trick with exponent allows making a more fair distribution in terms of underlying bit representation (again, it is useful for radix sort), and it appears to provide more real-world values. This is what I got with log2(1e12) (almost 40, basically): -1.97263e+06 32710.5 -1.40546e+06 -19.2773 265671 181.673 -4.88076e+09 6.98599e+08.

Why separate the sign into a separate distribution rather than std::uniform_real_distribution dist_real(-log2(1e12), log2(1e12));

That's because of exp2 usage below. It provides only positive values.

Why use log2 followed by exp2

Imaging that with the exponent transformation, each green tick on the picture below has equal probability:

This picture is from https://en.wikipedia.org/wiki/Floating-point_arithmetic. It mimics the distribution of representable values of a floating point type.

It would be better to use log2(numeric_limits::max) instead of log2(1e12) and leave a comment, though. 1e12 is the maximum value after log2 and exp2 transformations. This pair is complementary and that is why it is used together.

Meanwhile, I also checked std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max() and got a sequence of inf values: https://godbolt.org/z/YzzrzzY6e.

OK I think I understand, I suppose for floating point it makes sense to "center" the values on some smaller range, rather than being very likely to end up with gigantic positive or negative values. I can buy the argument that this is more likely to be similar to a real case.

If I understand correctly, the log2 exp2 is to provide a distribution which is more uniform in bit representation for radix sort. Its actually an interesting question if this is the "correct" thing to do when testing or timing radix sort for floating point numbers. This will make it act most like a uniformly distributed integral type, but one could also argue that real world data "doesn't care" about bit distribution and that just making a uniform distribution of real numbers is more appropriate, for instance uniform(-1e12,1e12). On the other hand, I have no real evidence that one distribution is more likely in user code than another for floating point data.

Very interesting that you get inf numbers when using lowest() and max(). Seems like a bug, but I can also understand why using the type's limits in the distribution would make for difficult calculations that may involve the "size" of the range.

I will try to unify the data generation, and perhaps make optional some of the radix sort specific stuff.

Thanks, Dan. I agree with your suggestions.

I guess it is fine to have such a distribution for radix sort tests to cover as many bits combinations as possible, assuming the generated data does not look too artificial. This is because radix sort does many bit-related manipulations, and it should help to catch cases like this one:

oneDPL/include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h

Lines 100 to 104 in b47ff69

// std::numeric_limits<_T>::max and std::numeric_limits<_T>::lowest cannot be used as an idenentity for

// performing radix sort of floating point numbers.

// They do not set the smallest exponent bit (i.e. the max is 7F7FFFFF for 32bit float),

// thus such an identity is not guaranteed to be put at the end of the sorted sequence after each radix sort stage,

// e.g. 00FF0000 numbers will be pushed out by 7F7FFFFF identities when sorting 16-23 bits.

The benchmarks is a different story and uniform(-1e12,1e12) looks better. As for copy_if, uniform(-1e12,1e12) looks better too.

Signed-off-by: Dan Hoeflinger <[email protected]>

SergeyKopienko · 2024-08-02T08:35:33Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+__group_copy_if_fits_in_slm(const sycl::queue& __queue, _Size __n, std::size_t __n_uniform)
+{
+    using _SingleGroupInvoker = __invoke_single_group_copy_if<_Size>;
+    ::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);


Suggested change

::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);

const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);

danhoeflinger · 2024-08-08T16:39:20Z

I suggest prioritizing #1762 and #1763 over this PR for now. If those go through, the performance of the oneDPL main copy_if API will supersede this KT. If we see significant risk that the above mentioned PRs will miss the release, we should pivot to merge this PR.
Once those merge, I will remove this from the release milestone, until further improvements can be incorporated into this KT which enable it to have value on its own.

danhoeflinger · 2024-08-26T19:05:19Z

At this point, I think this PR is more difficult to land in 2022.7.0 than the first two reduce_then_scan PRs, and provides worse performance, so I'm pulling this from the milestone, and converting to draft.
This may resurface with concepts from reduce_then_scan combined with the lookback to provide enhanced performance over mainline oneDPL, but for now we shouldn't be prioritizing this.

joeatodd and others added 30 commits November 29, 2023 12:35

Enable pragma unroll for open-source DPC++

e5c121b

clang-format

7c1cb0f

Start of single-pass scan kernel template

154161f

Fix hang in inclusive scan

16ec5ad

Debug statements for scan kernel template

bd89601

Update scan kernel template test

60a69fc

Only have a single work-item per group query for previous tile status

d526f04

First attempt at parallel lookback

09e9bbf

Working cooperative lookback

30e0da7

Fix correctness issue with non-power-of-2 sizes

2311929

Scan_kt Flags and Values separated (#15)

0f58c07

Atomic flags and the values used in Scan_kt separated to avoid truncating the range to 30bit values, and prepare for a more general scan implementation.

Scan_kt: Single memory allocation for device_memory (#17) and async f…

3de596e

…ree of the device memory (#18) * Single memory allocation for device_memory * async free of device memory --------- Co-authored-by: Joe Todd <[email protected]> Co-authored-by: AidanBeltonS <[email protected]>

Replace sycl::range with sycl::nd_range for fill

2d6ff78

Bug fix

124a912

Global to local then perform op

d716bbd

Update based on feedback

6a474c7

[Scan_kt] Atomic64 flags + value implementation (#25)

69cc2fa

* Implemented atomic64 version of the scan_kt pass * Removed repeated offset calculation for tile id atomic flag * Loopback -> Lookback. Removed unused var.

constexpr, types and remove an unneeded check

b5851ce

Correct static_cast ?

c9736c1

Defer group comms in lookback

0e450f7

Disable dynamic tile ID by default

95b5552

TODO: we still allocate & initialize the memory for the counter

Reduce from register sums instead of local mem

3f30ec8

Also use #pragma unroll for now

Unrolled version of joint_inclusive_scan

c147f05

Update include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_scan.h

ab69568

Co-authored-by: Alberto Cabrera Pérez <[email protected]>

Add TODO

b992b84

Changing fill kernel for a memset

37726be

Single wg implementation

d7c3c78

Add phase 1

e42e68d

danhoeflinger added 6 commits June 5, 2024 14:05

fix range to check

f15c759

adjust data generation

b195e4d

Signed-off-by: Dan Hoeflinger <[email protected]>

better fix for non-full case

4528cbb

Signed-off-by: Dan Hoeflinger <[email protected]>

removing old test

831f9c9

Signed-off-by: Dan Hoeflinger <[email protected]>

undo change to unroll version check

7be4594

Signed-off-by: Dan Hoeflinger <[email protected]>

formatting

68c258d

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger requested review from joeatodd, reble, timmiesmith, akukanov, dmitriy-sobolev, AidanBeltonS, mmichel11 and adamfidel June 6, 2024 14:06

danhoeflinger added 3 commits June 6, 2024 16:48

allowing alg to dictate active threads

347bcf5

Signed-off-by: Dan Hoeflinger <[email protected]>

bugfix for indexes

62f58e2

clang format

5c4bd74

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger added this to the 2022.7.0 milestone Jun 10, 2024

dmitriy-sobolev reviewed Jun 14, 2024

View reviewed changes

danhoeflinger added 3 commits June 18, 2024 10:53

address reviewer comments

98324ed

Signed-off-by: Dan Hoeflinger <[email protected]>

simplify data generation and cutoff calculation.

4377c2d

Signed-off-by: Dan Hoeflinger <[email protected]>

strip out single workgroup opt out

496ed4d

Signed-off-by: Dan Hoeflinger <[email protected]>

dmitriy-sobolev reviewed Jun 18, 2024

View reviewed changes

minimal data generation changes

2d9e8a7

Signed-off-by: Dan Hoeflinger <[email protected]>

SergeyKopienko reviewed Aug 2, 2024

View reviewed changes

danhoeflinger marked this pull request as draft August 26, 2024 19:05

danhoeflinger removed this from the 2022.7.0 milestone Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KT] Single Pass Copy_if Kernel Template #1616

[KT] Single Pass Copy_if Kernel Template #1616

danhoeflinger commented Jun 6, 2024 •

edited

Loading

danhoeflinger commented Jun 7, 2024

dmitriy-sobolev left a comment

dmitriy-sobolev Jun 14, 2024

danhoeflinger Jun 18, 2024

dmitriy-sobolev Jun 18, 2024

dmitriy-sobolev Jun 14, 2024

danhoeflinger Jun 18, 2024

dmitriy-sobolev Jul 2, 2024

dmitriy-sobolev Jul 2, 2024

dmitriy-sobolev Jul 2, 2024

dmitriy-sobolev Jun 18, 2024

danhoeflinger Jun 18, 2024

dmitriy-sobolev Jul 2, 2024 •

edited

Loading

danhoeflinger Jul 2, 2024

dmitriy-sobolev Jul 2, 2024 •

edited

Loading

SergeyKopienko Aug 2, 2024

danhoeflinger commented Aug 8, 2024 •

edited

Loading

danhoeflinger commented Aug 26, 2024

	// std::numeric_limits<_T>::max and std::numeric_limits<_T>::lowest cannot be used as an idenentity for
	// performing radix sort of floating point numbers.
	// They do not set the smallest exponent bit (i.e. the max is 7F7FFFFF for 32bit float),
	// thus such an identity is not guaranteed to be put at the end of the sorted sequence after each radix sort stage,
	// e.g. 00FF0000 numbers will be pushed out by 7F7FFFFF identities when sorting 16-23 bits.

	::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);
	const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue);

[KT] Single Pass Copy_if Kernel Template #1616

Are you sure you want to change the base?

[KT] Single Pass Copy_if Kernel Template #1616

Conversation

danhoeflinger commented Jun 6, 2024 • edited Loading

danhoeflinger commented Jun 7, 2024

dmitriy-sobolev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger commented Aug 8, 2024 • edited Loading

danhoeflinger commented Aug 26, 2024

danhoeflinger commented Jun 6, 2024 •

edited

Loading

dmitriy-sobolev Jul 2, 2024 •

edited

Loading

dmitriy-sobolev Jul 2, 2024 •

edited

Loading

danhoeflinger commented Aug 8, 2024 •

edited

Loading