Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KT] Single Pass Copy_if Kernel Template #1616

Draft
wants to merge 136 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
e5c121b
Enable pragma unroll for open-source DPC++
joeatodd Nov 29, 2023
7c1cb0f
clang-format
joeatodd Nov 29, 2023
154161f
Start of single-pass scan kernel template
adamfidel Aug 18, 2023
16ec5ad
Fix hang in inclusive scan
adamfidel Aug 24, 2023
bd89601
Debug statements for scan kernel template
adamfidel Aug 31, 2023
60a69fc
Update scan kernel template test
adamfidel Sep 6, 2023
d526f04
Only have a single work-item per group query for previous tile status
adamfidel Sep 14, 2023
09e9bbf
First attempt at parallel lookback
adamfidel Sep 18, 2023
30e0da7
Working cooperative lookback
adamfidel Sep 22, 2023
2311929
Fix correctness issue with non-power-of-2 sizes
adamfidel Oct 25, 2023
0f58c07
Scan_kt Flags and Values separated (#15)
Alcpz Nov 7, 2023
8af98d6
Refactored Scan_kt code (#16)
Alcpz Nov 7, 2023
3de596e
Scan_kt: Single memory allocation for device_memory (#17) and async f…
Alcpz Nov 8, 2023
2d6ff78
Replace sycl::range with sycl::nd_range for fill
joeatodd Nov 8, 2023
124a912
Bug fix
joeatodd Nov 8, 2023
d716bbd
Global to local then perform op
Nov 8, 2023
6a474c7
Update based on feedback
Nov 10, 2023
ba7be34
Refactored cooperative_loopback and memory implementation (#24)
Alcpz Nov 21, 2023
69cc2fa
[Scan_kt] Atomic64 flags + value implementation (#25)
Alcpz Nov 23, 2023
b5851ce
constexpr, types and remove an unneeded check
joeatodd Nov 23, 2023
c9736c1
Correct static_cast ?
joeatodd Nov 23, 2023
0e450f7
Defer group comms in lookback
joeatodd Nov 23, 2023
95b5552
Disable dynamic tile ID by default
joeatodd Nov 23, 2023
3f30ec8
Reduce from register sums instead of local mem
joeatodd Nov 23, 2023
c147f05
Unrolled version of joint_inclusive_scan
joeatodd Nov 23, 2023
ab69568
Update include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_scan.h
joeatodd Nov 23, 2023
b992b84
Add TODO
joeatodd Nov 23, 2023
37726be
Changing fill kernel for a memset
Alcpz Nov 28, 2023
d7c3c78
Single wg implementation
Alcpz Nov 29, 2023
e42e68d
Add phase 1
Nov 21, 2023
54c0ae9
Add phase 2
Nov 27, 2023
ba543ed
Add phase 3
Nov 28, 2023
cdf74d0
Add count datatype _SizeT
Dec 5, 2023
c5670d8
Move away from atomics
Dec 5, 2023
8918b42
Sort out test logic
Dec 5, 2023
625f315
Remove unnecessary load and store functions
Dec 5, 2023
ca7a830
Release scratch mem
Dec 6, 2023
25238eb
Add single wg copy if
Dec 8, 2023
2f2ccb2
Fix unrolls and use memset
Dec 8, 2023
021fb9a
apply changes to single wg
Dec 8, 2023
c4b05a4
Remove unused variables
Dec 8, 2023
5d1ed8e
Clang-format copy_if_kt commits
Dec 8, 2023
6382acd
Merge pull request #28 from AidanBeltonS/dev/aidan/copy_if
AidanBeltonS Dec 8, 2023
70e751a
Enable pragma unroll for open-source DPC++
joeatodd Nov 29, 2023
a9fdaa3
Start of single-pass scan kernel template
adamfidel Aug 18, 2023
dfef06f
Fix hang in inclusive scan
adamfidel Aug 24, 2023
555f6f9
Debug statements for scan kernel template
adamfidel Aug 31, 2023
10cfc68
Update scan kernel template test
adamfidel Sep 6, 2023
53faf10
Only have a single work-item per group query for previous tile status
adamfidel Sep 14, 2023
dc63d16
First attempt at parallel lookback
adamfidel Sep 18, 2023
f8c3f2b
Working cooperative lookback
adamfidel Sep 22, 2023
1d72d3f
Fix correctness issue with non-power-of-2 sizes
adamfidel Oct 25, 2023
567a50e
Scan_kt Flags and Values separated (#15)
Alcpz Nov 7, 2023
0c91640
Refactored Scan_kt code (#16)
Alcpz Nov 7, 2023
78d2d7d
Scan_kt: Single memory allocation for device_memory (#17) and async f…
Alcpz Nov 8, 2023
55dc287
Replace sycl::range with sycl::nd_range for fill
joeatodd Nov 8, 2023
37bfd1d
Bug fix
joeatodd Nov 8, 2023
21038df
Global to local then perform op
Nov 8, 2023
bdcc9c9
Update based on feedback
Nov 10, 2023
9717e09
Refactored cooperative_loopback and memory implementation (#24)
Alcpz Nov 21, 2023
8d23836
[Scan_kt] Atomic64 flags + value implementation (#25)
Alcpz Nov 23, 2023
c3c3218
constexpr, types and remove an unneeded check
joeatodd Nov 23, 2023
d257702
Correct static_cast ?
joeatodd Nov 23, 2023
43e17ba
Defer group comms in lookback
joeatodd Nov 23, 2023
e5b3ca4
Disable dynamic tile ID by default
joeatodd Nov 23, 2023
ab346da
Reduce from register sums instead of local mem
joeatodd Nov 23, 2023
f87573c
Unrolled version of joint_inclusive_scan
joeatodd Nov 23, 2023
621adf7
Update include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_scan.h
joeatodd Nov 23, 2023
b8c837f
Add TODO
joeatodd Nov 23, 2023
8367be7
Changing fill kernel for a memset
Alcpz Nov 28, 2023
02ff9f3
Single wg implementation
Alcpz Nov 29, 2023
25a93ff
Add phase 1
Nov 21, 2023
aea6009
Add phase 2
Nov 27, 2023
d5c2cb5
Add phase 3
Nov 28, 2023
1f574b8
Add count datatype _SizeT
Dec 5, 2023
16ef9c2
Move away from atomics
Dec 5, 2023
45a1fb7
Sort out test logic
Dec 5, 2023
cec32d7
Remove unnecessary load and store functions
Dec 5, 2023
b7d659c
Release scratch mem
Dec 6, 2023
fdb1824
Add single wg copy if
Dec 8, 2023
1df5fbb
Fix unrolls and use memset
Dec 8, 2023
d8b77fe
apply changes to single wg
Dec 8, 2023
5b53de6
Remove unused variables
Dec 8, 2023
acc4f9b
Clang-format copy_if_kt commits
Dec 8, 2023
faae196
partial changes (fix and amend)
danhoeflinger May 13, 2024
9d39fc6
refactor to share lookback and memory mgr
danhoeflinger May 24, 2024
1b10214
formatting
danhoeflinger May 24, 2024
593c218
remove launder in favor of lazy ctor union
danhoeflinger May 24, 2024
5a1752c
distinguishing kernel names
danhoeflinger May 24, 2024
a0576c3
uglify
danhoeflinger May 24, 2024
c5065a7
format
danhoeflinger May 24, 2024
6a7291a
change single wg scan to submitter and kernel operator
danhoeflinger May 24, 2024
1b78dcd
change scan to submitter and kernel operator
danhoeflinger May 29, 2024
246dbf3
formatting
danhoeflinger May 29, 2024
3023fec
remove unnecessary variable
danhoeflinger May 29, 2024
52f6d82
renaming public APIs
danhoeflinger May 29, 2024
8b504d3
sync with scan for asychronicity
danhoeflinger May 29, 2024
6f60f10
sycl::event returns
danhoeflinger May 29, 2024
0dfcd15
naming and minor fixes
danhoeflinger May 29, 2024
12db722
removing single_wg public api
danhoeflinger May 29, 2024
4850fcf
temporarily disable single wg version
danhoeflinger May 29, 2024
a8ccbd1
wait after call for async algs
danhoeflinger May 30, 2024
70ad489
reenable single wg
danhoeflinger May 30, 2024
e78b72c
only need single phase for single wg
danhoeflinger May 30, 2024
2ca4287
reusing single workgroup copy_if from oneDPL main
danhoeflinger May 31, 2024
7538ed6
add option to opt out of compiling single wg
danhoeflinger May 31, 2024
293d724
adding opt out for single wg inclusive scan
danhoeflinger May 31, 2024
565ba3b
remove single_wg kt, in favor of main oneDPL version
danhoeflinger May 31, 2024
b3fbfe2
trying scalar version of copy_if
danhoeflinger Jun 4, 2024
e49bc9f
fix
danhoeflinger Jun 4, 2024
92438ee
fix
danhoeflinger Jun 4, 2024
5495da2
full sum
danhoeflinger Jun 4, 2024
a7ca1b5
switching arg to const ref
danhoeflinger Jun 4, 2024
2d53ead
branch by tile, not by workitem
danhoeflinger Jun 4, 2024
f26aff0
removing unused block_strided version
danhoeflinger Jun 4, 2024
c0ab651
range API and formatting
danhoeflinger Jun 4, 2024
517c341
removing unnecessary stuff
danhoeflinger Jun 4, 2024
427a5f4
naming consistency
danhoeflinger Jun 4, 2024
e9091e1
formatting
danhoeflinger Jun 4, 2024
7315d2b
reverting overreach
danhoeflinger Jun 4, 2024
0983043
upgrading tests to match scan, cmake
danhoeflinger Jun 5, 2024
e09ccae
test bugfix
danhoeflinger Jun 5, 2024
96acd30
bugfix for non-full case
danhoeflinger Jun 5, 2024
f15c759
fix range to check
danhoeflinger Jun 5, 2024
b195e4d
adjust data generation
danhoeflinger Jun 5, 2024
4528cbb
better fix for non-full case
danhoeflinger Jun 5, 2024
831f9c9
removing old test
danhoeflinger Jun 5, 2024
7be4594
undo change to unroll version check
danhoeflinger Jun 6, 2024
68c258d
formatting
danhoeflinger Jun 6, 2024
347bcf5
allowing alg to dictate active threads
danhoeflinger Jun 6, 2024
62f58e2
bugfix for indexes
danhoeflinger Jun 6, 2024
5c4bd74
clang format
danhoeflinger Jun 6, 2024
98324ed
address reviewer comments
danhoeflinger Jun 18, 2024
4377c2d
simplify data generation and cutoff calculation.
danhoeflinger Jun 18, 2024
496ed4d
strip out single workgroup opt out
danhoeflinger Jun 18, 2024
2d9e8a7
minimal data generation changes
danhoeflinger Jun 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 18 additions & 15 deletions test/kt/single_pass_copy_if.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,18 @@ generate_copy_if_data(T* input, std::size_t size, std::uint32_t seed)
{
// Integer numbers are generated even for floating point types in order to avoid rounding errors,
// and simplify the final check
using substitute_t = std::conditional_t<std::is_signed_v<T>, std::int64_t, std::uint64_t>;

std::default_random_engine gen{seed};
substitute_t start = std::is_signed_v<T> ? -10 : 0;
substitute_t end = std::is_signed_v<T> ? 10 : 20;
std::uniform_int_distribution<substitute_t> dist(start, end);
std::generate(input, input + size, [&] { return dist(gen); });

if constexpr (std::is_integral_v<T>)
{
std::uniform_int_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::generate(input, input + size, [&] { return dist(gen); });
}
else
{
std::uniform_real_distribution<T> dist(std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::generate(input, input + size, [&] { return dist(gen); });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things to improve:

  • Use lowest instead of min (min is ~0 for float types).
  • Make sure that duplicates are generated.

You can take a look at test\kt\esimd_radix_sort_utils.h:generate_data(), it is done there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch with lowest(), I know that but don't always remember that :) .

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

I dont mind just including this utility header and call this generation function, no need to duplicate this functionality. While I don't understand the need for duplicates, they don't harm anything here.

Since we may use it here, can you explain the motivation for generating real numbers as it is done in test\kt\esimd_radix_sort_utils.h:generate_data()? Specifically:

  1. Why limit to 1e12 rather than numerical_limits<T>::max()?
  2. Why separate the sign into a separate distribution rather than std::uniform_real_distribution<T> dist_real(-log2(1e12), log2(1e12));?
  3. Why use log2 followed by exp2 ?

If they perhaps have limited interest to things outside of radix sort, we may be able to make them tunable parameters.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it is interesting to have duplicates for copy_if with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.

Yeah, you are right, it's not relevant for copy_if. Sorry to bother you with that.

Why limit to 1e12 rather than numerical_limits::max()?

The distribution is uniform and floating point types can hold very large values, so the values generated with max will gravitate towards edges. This is what I got with min and max: 9.43047e+307 4.73343e+307 3.52862e+307 9.20991e+307 4.6219e+307 1.466e+308 8.12609e+307 4.44764e+307.

The trick with exponent allows making a more fair distribution in terms of underlying bit representation (again, it is useful for radix sort), and it appears to provide more real-world values. This is what I got with log2(1e12) (almost 40, basically): -1.97263e+06 32710.5 -1.40546e+06 -19.2773 265671 181.673 -4.88076e+09 6.98599e+08.

Why separate the sign into a separate distribution rather than std::uniform_real_distribution dist_real(-log2(1e12), log2(1e12));

That's because of exp2 usage below. It provides only positive values.

Why use log2 followed by exp2

Imaging that with the exponent transformation, each green tick on the picture below has equal probability:

image

This picture is from https://en.wikipedia.org/wiki/Floating-point_arithmetic. It mimics the distribution of representable values of a floating point type.

It would be better to use log2(numeric_limits::max) instead of log2(1e12) and leave a comment, though. 1e12 is the maximum value after log2 and exp2 transformations. This pair is complementary and that is why it is used together.

Meanwhile, I also checked std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max() and got a sequence of inf values: https://godbolt.org/z/YzzrzzY6e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I understand, I suppose for floating point it makes sense to "center" the values on some smaller range, rather than being very likely to end up with gigantic positive or negative values. I can buy the argument that this is more likely to be similar to a real case.

If I understand correctly, the log2 exp2 is to provide a distribution which is more uniform in bit representation for radix sort. Its actually an interesting question if this is the "correct" thing to do when testing or timing radix sort for floating point numbers. This will make it act most like a uniformly distributed integral type, but one could also argue that real world data "doesn't care" about bit distribution and that just making a uniform distribution of real numbers is more appropriate, for instance uniform(-1e12,1e12). On the other hand, I have no real evidence that one distribution is more likely in user code than another for floating point data.

Very interesting that you get inf numbers when using lowest() and max(). Seems like a bug, but I can also understand why using the type's limits in the distribution would make for difficult calculations that may involve the "size" of the range.

I will try to unify the data generation, and perhaps make optional some of the radix sort specific stuff.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Dan. I agree with your suggestions.

I guess it is fine to have such a distribution for radix sort tests to cover as many bits combinations as possible, assuming the generated data does not look too artificial. This is because radix sort does many bit-related manipulations, and it should help to catch cases like this one:

// std::numeric_limits<_T>::max and std::numeric_limits<_T>::lowest cannot be used as an idenentity for
// performing radix sort of floating point numbers.
// They do not set the smallest exponent bit (i.e. the max is 7F7FFFFF for 32bit float),
// thus such an identity is not guaranteed to be put at the end of the sorted sequence after each radix sort stage,
// e.g. 00FF0000 numbers will be pushed out by 7F7FFFFF identities when sorting 16-23 bits.

The benchmarks is a different story and uniform(-1e12,1e12) looks better. As for copy_if, uniform(-1e12,1e12) looks better too.

}

#if _ENABLE_RANGES_TESTING
Expand All @@ -87,15 +92,12 @@ test_all_view(sycl::queue q, std::size_t size, Predicate pred, KernelParam param
sycl::buffer<std::size_t> buf_num_copied(&num_copied, 1);
auto out_end = std::copy_if(std::begin(ref), std::end(ref), std::begin(out_ref), pred);
std::size_t num_copied_ref = out_end - std::begin(out_ref);
{
sycl::buffer<T> buf(input.data(), input.size());
sycl::buffer<T> buf(input.data(), input.size());

oneapi::dpl::experimental::ranges::all_view<T, sycl::access::mode::read> view(buf);
oneapi::dpl::experimental::ranges::all_view<T, sycl::access::mode::read_write> view_out(buf_out);
oneapi::dpl::experimental::ranges::all_view<std::size_t, sycl::access::mode::write> view_num_copied(
buf_num_copied);
oneapi::dpl::experimental::kt::gpu::copy_if(q, view, view_out, view_num_copied, pred, param).wait();
}
oneapi::dpl::experimental::ranges::all_view<T, sycl::access::mode::read> view(buf);
oneapi::dpl::experimental::ranges::all_view<T, sycl::access::mode::write> view_out(buf_out);
oneapi::dpl::experimental::ranges::all_view<std::size_t, sycl::access::mode::write> view_num_copied(buf_num_copied);
oneapi::dpl::experimental::kt::gpu::copy_if(q, view, view_out, view_num_copied, pred, param).wait();

auto acc = buf_out.get_host_access();
auto num_copied_acc = buf_num_copied.get_host_access();
Expand Down Expand Up @@ -241,7 +243,8 @@ main()
auto q = TestUtils::get_test_queue();
bool run_test = can_run_test<decltype(params), TEST_TYPE>(q, params);

auto __predicate = __less_than_val<TEST_TYPE>{std::is_signed_v<TEST_TYPE> ? TEST_TYPE{0} : TEST_TYPE{10}};
TEST_TYPE cutoff = std::is_signed_v<TEST_TYPE> ? TEST_TYPE{0} : std::numeric_limits<TEST_TYPE>::max() / 2;
auto __predicate = __less_than_val<TEST_TYPE>{cutoff};
if (run_test)
{

Expand Down
Loading