-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KT] Single Pass Copy_if Kernel Template #1616
base: main
Are you sure you want to change the base?
Conversation
Atomic flags and the values used in Scan_kt separated to avoid truncating the range to 30bit values, and prepare for a more general scan implementation.
* Improved Scan_kt: templated parameters, ballot, wgsize calculation. - Changed number of workgroups calculation from next power of two to next multiple of wgsize - Improved group_ballot by using the class member functions - Using kernel_param struct to determine wgsize and elems per work item.
…ree of the device memory (#18) * Single memory allocation for device_memory * async free of device memory --------- Co-authored-by: Joe Todd <[email protected]> Co-authored-by: AidanBeltonS <[email protected]>
* Refactored cooperative_loopback and memory implementation detail * renamed load_counter to fetch_add_counter * Removed dynamic tile counter from the scan memory struct * scratch memory Reordering * Fixed wrong values returned in LoopbackScanMemory.get_value * Improved Class and variable naming
* Implemented atomic64 version of the scan_kt pass * Removed repeated offset calculation for tile id atomic flag * Loopback -> Lookback. Removed unused var.
TODO: we still allocate & initialize the memory for the counter
Also use #pragma unroll for now
Co-authored-by: Alberto Cabrera Pérez <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback. I do not have an understanding of why this might be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've briefly checked the API and the tests. I am going to dive deeper.
test/kt/single_pass_copy_if.cpp
Outdated
// Integer numbers are generated even for floating point types in order to avoid rounding errors, | ||
// and simplify the final check | ||
using substitute_t = std::conditional_t<std::is_signed_v<T>, std::int64_t, std::uint64_t>; | ||
|
||
std::default_random_engine gen{seed}; | ||
substitute_t start = std::is_signed_v<T> ? -10 : 0; | ||
substitute_t end = std::is_signed_v<T> ? 10 : 20; | ||
std::uniform_int_distribution<substitute_t> dist(start, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data generation can be simpler and cover the whole range of numbers, because there will be no overflows due to copy_if
nature. There is no need to use substitute_t
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also remove the comment?
// Integer numbers are generated even for floating point types in order to avoid rounding errors,
// and simplify the final check
return __prev_event; | ||
} | ||
} | ||
|
||
} // namespace __impl | ||
|
||
template <typename _InRng, typename _OutRng, typename _NumCopiedRng, typename _UnaryPredicate, typename _KernelParam> | ||
sycl::event | ||
copy_if(sycl::queue __queue, _InRng&& __in_rng, _OutRng&& __out_rng, _NumCopiedRng&& __num_rng, _UnaryPredicate __pred, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is natural to see __num_rng
passed as a pointer rather than a range, because it is expected to hold one element. Will that API be able to digest __num_rng
as a pointer?
This is definitely not a blocker, because it can be work-arounded on a caller side (e.g. wrapping it with views::subrange
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention here is to allow a buffer of a single element to be easily used here. However, I agree that an iterator / pointer may suit it better, given that it should be a single element. I don't think this current API works with just a pointer as it is written.
I don't think we should force it to be a pointer, because I don't think we make using buffers difficult, but I think an iterator could be fine (which includes USM pointers, of course). It would just require users to call oneapi::dpl::begin(num_copied_buf)
first if they want to use a buffer.
I think thre may also be utility in creating a single_device_element
type which could wrap a buffer, iterator, USM pointer to represent an single element which can be made available on the device. Its something I've considered also for the init
value of scan, where currently we only allow a host-side ValueType value. If we also had an specialized overload for a single_device_element
which would allow users to skip the host to device copy. The other place such a type is interesting is for the return of reduce
for some KT version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having taken a fresher look, I think that the existing approach below is fine:
sycl::buffer<T> buf(input.data(), input.size());
sycl::buffer<T> buf_out(size);
sycl::buffer<std::size_t> buf_num_copied(&num_copied, 1);
oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();
I previously thought about this one as an alternative:
sycl::buffer<T> buf(input.data(), input.size());
sycl::buffer<T> buf_out(size);
auto buf_num_copied = sycl::malloc_device<T>(1, q);
oneapi::dpl::experimental::kt::gpu::copy_if(q, buf, buf_out, buf_num_copied, pred, param).wait();
Now, it does not look like a good idea to me.
Users are more likely to stick to either buffer
or malloc_device
for all the arguments. I know examples when an external pool of memory of memory is used for all the data, and only pointers are passed. So, it does not justify the additional complexity. Moreover, there are workarounds.
Discussing API of a new experimental feature involves much speculation, sometimes with no strong ground like this one. I hope it is useful at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thre may also be utility in creating a single_device_element type...
Yes, it will be useful if it simplifies the usage of the features and helps to avoid performance-punishing pitfalls. It would be good to see these cases, though.
What I can add is that introducing extra logic for passing init
values may not be beneficial, because the latency of init
value copying may be hidden by the kernel launch. It is a kernel argument, in contrast to return values of reduce
or scan
KT algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I can add is that introducing extra logic for passing init values may not be beneficial
Actually, it may allow passing non-trivially-copyable type of init
value in case the algorithm handles data, which is constructed on a device. I do not know such use-cases, though.
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
test/kt/single_pass_copy_if.cpp
Outdated
if constexpr (std::is_integral_v<T>) | ||
{ | ||
std::uniform_int_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); | ||
std::generate(input, input + size, [&] { return dist(gen); }); | ||
} | ||
else | ||
{ | ||
std::uniform_real_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max()); | ||
std::generate(input, input + size, [&] { return dist(gen); }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things to improve:
- Use
lowest
instead ofmin
(min
is ~0 for float types). - Make sure that duplicates are generated.
You can take a look at test\kt\esimd_radix_sort_utils.h:generate_data()
, it is done there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch with lowest()
, I know that but don't always remember that :) .
I don't understand why it is interesting to have duplicates for copy_if
with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.
I dont mind just including this utility header and call this generation function, no need to duplicate this functionality. While I don't understand the need for duplicates, they don't harm anything here.
Since we may use it here, can you explain the motivation for generating real numbers as it is done in test\kt\esimd_radix_sort_utils.h:generate_data()
? Specifically:
- Why limit to
1e12
rather thannumerical_limits<T>::max()
? - Why separate the sign into a separate distribution rather than
std::uniform_real_distribution<T> dist_real(-log2(1e12), log2(1e12));
? - Why use
log2
followed byexp2
?
If they perhaps have limited interest to things outside of radix sort, we may be able to make them tunable parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.
Yeah, you are right, it's not relevant for copy_if
. Sorry to bother you with that.
Why limit to 1e12 rather than numerical_limits::max()?
The distribution is uniform and floating point types can hold very large values, so the values generated with max
will gravitate towards edges. This is what I got with min
and max
: 9.43047e+307 4.73343e+307 3.52862e+307 9.20991e+307 4.6219e+307 1.466e+308 8.12609e+307 4.44764e+307
.
The trick with exponent allows making a more fair distribution in terms of underlying bit representation (again, it is useful for radix sort), and it appears to provide more real-world values. This is what I got with log2(1e12)
(almost 40, basically): -1.97263e+06 32710.5 -1.40546e+06 -19.2773 265671 181.673 -4.88076e+09 6.98599e+08
.
Why separate the sign into a separate distribution rather than std::uniform_real_distribution dist_real(-log2(1e12), log2(1e12));
That's because of exp2
usage below. It provides only positive values.
Why use log2 followed by exp2
Imaging that with the exponent transformation, each green tick on the picture below has equal probability:
This picture is from https://en.wikipedia.org/wiki/Floating-point_arithmetic. It mimics the distribution of representable values of a floating point type.
It would be better to use log2(numeric_limits::max)
instead of log2(1e12)
and leave a comment, though. 1e12
is the maximum value after log2 and exp2 transformations. This pair is complementary and that is why it is used together.
Meanwhile, I also checked std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max()
and got a sequence of inf
values: https://godbolt.org/z/YzzrzzY6e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I understand, I suppose for floating point it makes sense to "center" the values on some smaller range, rather than being very likely to end up with gigantic positive or negative values. I can buy the argument that this is more likely to be similar to a real case.
If I understand correctly, the log2
exp2
is to provide a distribution which is more uniform in bit representation for radix sort. Its actually an interesting question if this is the "correct" thing to do when testing or timing radix sort for floating point numbers. This will make it act most like a uniformly distributed integral type, but one could also argue that real world data "doesn't care" about bit distribution and that just making a uniform distribution of real numbers is more appropriate, for instance uniform(-1e12,1e12). On the other hand, I have no real evidence that one distribution is more likely in user code than another for floating point data.
Very interesting that you get inf
numbers when using lowest()
and max()
. Seems like a bug, but I can also understand why using the type's limits in the distribution would make for difficult calculations that may involve the "size" of the range.
I will try to unify the data generation, and perhaps make optional some of the radix sort specific stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Dan. I agree with your suggestions.
I guess it is fine to have such a distribution for radix sort tests to cover as many bits combinations as possible, assuming the generated data does not look too artificial. This is because radix sort does many bit-related manipulations, and it should help to catch cases like this one:
oneDPL/include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h
Lines 100 to 104 in b47ff69
// std::numeric_limits<_T>::max and std::numeric_limits<_T>::lowest cannot be used as an idenentity for | |
// performing radix sort of floating point numbers. | |
// They do not set the smallest exponent bit (i.e. the max is 7F7FFFFF for 32bit float), | |
// thus such an identity is not guaranteed to be put at the end of the sorted sequence after each radix sort stage, | |
// e.g. 00FF0000 numbers will be pushed out by 7F7FFFFF identities when sorting 16-23 bits. |
The benchmarks is a different story and uniform(-1e12,1e12)
looks better. As for copy_if, uniform(-1e12,1e12)
looks better too.
Signed-off-by: Dan Hoeflinger <[email protected]>
__group_copy_if_fits_in_slm(const sycl::queue& __queue, _Size __n, std::size_t __n_uniform) | ||
{ | ||
using _SingleGroupInvoker = __invoke_single_group_copy_if<_Size>; | ||
::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue); | |
const std::size_t __max_wg_size = oneapi::dpl::__internal::__max_work_group_size(__queue); |
I suggest prioritizing #1762 and #1763 over this PR for now. If those go through, the performance of the oneDPL main copy_if API will supersede this KT. If we see significant risk that the above mentioned PRs will miss the release, we should pivot to merge this PR. |
At this point, I think this PR is more difficult to land in 2022.7.0 than the first two |
This PR adds a pair of APIs (iterator and range variants) for :
oneapi::dpl::experimental::kt::gpu::copy_if
which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.
Other additions within this PR
Additional notable details:
copy_if
single workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel templatescan
kernel template to share lookback phase, allocation manager withcopy_if
copy_if
. I don't believe this change negatively impactsscan
KT.Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel