-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KT] Single Pass Copy_if Kernel Template #1616
Draft
danhoeflinger
wants to merge
136
commits into
main
Choose a base branch
from
dev/dhoeflin/copy_if_kt
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
136 commits
Select commit
Hold shift + click to select a range
e5c121b
Enable pragma unroll for open-source DPC++
joeatodd 7c1cb0f
clang-format
joeatodd 154161f
Start of single-pass scan kernel template
adamfidel 16ec5ad
Fix hang in inclusive scan
adamfidel bd89601
Debug statements for scan kernel template
adamfidel 60a69fc
Update scan kernel template test
adamfidel d526f04
Only have a single work-item per group query for previous tile status
adamfidel 09e9bbf
First attempt at parallel lookback
adamfidel 30e0da7
Working cooperative lookback
adamfidel 2311929
Fix correctness issue with non-power-of-2 sizes
adamfidel 0f58c07
Scan_kt Flags and Values separated (#15)
Alcpz 8af98d6
Refactored Scan_kt code (#16)
Alcpz 3de596e
Scan_kt: Single memory allocation for device_memory (#17) and async f…
Alcpz 2d6ff78
Replace sycl::range with sycl::nd_range for fill
joeatodd 124a912
Bug fix
joeatodd d716bbd
Global to local then perform op
6a474c7
Update based on feedback
ba7be34
Refactored cooperative_loopback and memory implementation (#24)
Alcpz 69cc2fa
[Scan_kt] Atomic64 flags + value implementation (#25)
Alcpz b5851ce
constexpr, types and remove an unneeded check
joeatodd c9736c1
Correct static_cast ?
joeatodd 0e450f7
Defer group comms in lookback
joeatodd 95b5552
Disable dynamic tile ID by default
joeatodd 3f30ec8
Reduce from register sums instead of local mem
joeatodd c147f05
Unrolled version of joint_inclusive_scan
joeatodd ab69568
Update include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_scan.h
joeatodd b992b84
Add TODO
joeatodd 37726be
Changing fill kernel for a memset
Alcpz d7c3c78
Single wg implementation
Alcpz e42e68d
Add phase 1
54c0ae9
Add phase 2
ba543ed
Add phase 3
cdf74d0
Add count datatype _SizeT
c5670d8
Move away from atomics
8918b42
Sort out test logic
625f315
Remove unnecessary load and store functions
ca7a830
Release scratch mem
25238eb
Add single wg copy if
2f2ccb2
Fix unrolls and use memset
021fb9a
apply changes to single wg
c4b05a4
Remove unused variables
5d1ed8e
Clang-format copy_if_kt commits
6382acd
Merge pull request #28 from AidanBeltonS/dev/aidan/copy_if
AidanBeltonS 70e751a
Enable pragma unroll for open-source DPC++
joeatodd a9fdaa3
Start of single-pass scan kernel template
adamfidel dfef06f
Fix hang in inclusive scan
adamfidel 555f6f9
Debug statements for scan kernel template
adamfidel 10cfc68
Update scan kernel template test
adamfidel 53faf10
Only have a single work-item per group query for previous tile status
adamfidel dc63d16
First attempt at parallel lookback
adamfidel f8c3f2b
Working cooperative lookback
adamfidel 1d72d3f
Fix correctness issue with non-power-of-2 sizes
adamfidel 567a50e
Scan_kt Flags and Values separated (#15)
Alcpz 0c91640
Refactored Scan_kt code (#16)
Alcpz 78d2d7d
Scan_kt: Single memory allocation for device_memory (#17) and async f…
Alcpz 55dc287
Replace sycl::range with sycl::nd_range for fill
joeatodd 37bfd1d
Bug fix
joeatodd 21038df
Global to local then perform op
bdcc9c9
Update based on feedback
9717e09
Refactored cooperative_loopback and memory implementation (#24)
Alcpz 8d23836
[Scan_kt] Atomic64 flags + value implementation (#25)
Alcpz c3c3218
constexpr, types and remove an unneeded check
joeatodd d257702
Correct static_cast ?
joeatodd 43e17ba
Defer group comms in lookback
joeatodd e5b3ca4
Disable dynamic tile ID by default
joeatodd ab346da
Reduce from register sums instead of local mem
joeatodd f87573c
Unrolled version of joint_inclusive_scan
joeatodd 621adf7
Update include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_scan.h
joeatodd b8c837f
Add TODO
joeatodd 8367be7
Changing fill kernel for a memset
Alcpz 02ff9f3
Single wg implementation
Alcpz 25a93ff
Add phase 1
aea6009
Add phase 2
d5c2cb5
Add phase 3
1f574b8
Add count datatype _SizeT
16ef9c2
Move away from atomics
45a1fb7
Sort out test logic
cec32d7
Remove unnecessary load and store functions
b7d659c
Release scratch mem
fdb1824
Add single wg copy if
1df5fbb
Fix unrolls and use memset
d8b77fe
apply changes to single wg
5b53de6
Remove unused variables
acc4f9b
Clang-format copy_if_kt commits
faae196
partial changes (fix and amend)
danhoeflinger 9d39fc6
refactor to share lookback and memory mgr
danhoeflinger 1b10214
formatting
danhoeflinger 593c218
remove launder in favor of lazy ctor union
danhoeflinger 5a1752c
distinguishing kernel names
danhoeflinger a0576c3
uglify
danhoeflinger c5065a7
format
danhoeflinger 6a7291a
change single wg scan to submitter and kernel operator
danhoeflinger 1b78dcd
change scan to submitter and kernel operator
danhoeflinger 246dbf3
formatting
danhoeflinger 3023fec
remove unnecessary variable
danhoeflinger 52f6d82
renaming public APIs
danhoeflinger 8b504d3
sync with scan for asychronicity
danhoeflinger 6f60f10
sycl::event returns
danhoeflinger 0dfcd15
naming and minor fixes
danhoeflinger 12db722
removing single_wg public api
danhoeflinger 4850fcf
temporarily disable single wg version
danhoeflinger a8ccbd1
wait after call for async algs
danhoeflinger 70ad489
reenable single wg
danhoeflinger e78b72c
only need single phase for single wg
danhoeflinger 2ca4287
reusing single workgroup copy_if from oneDPL main
danhoeflinger 7538ed6
add option to opt out of compiling single wg
danhoeflinger 293d724
adding opt out for single wg inclusive scan
danhoeflinger 565ba3b
remove single_wg kt, in favor of main oneDPL version
danhoeflinger b3fbfe2
trying scalar version of copy_if
danhoeflinger e49bc9f
fix
danhoeflinger 92438ee
fix
danhoeflinger 5495da2
full sum
danhoeflinger a7ca1b5
switching arg to const ref
danhoeflinger 2d53ead
branch by tile, not by workitem
danhoeflinger f26aff0
removing unused block_strided version
danhoeflinger c0ab651
range API and formatting
danhoeflinger 517c341
removing unnecessary stuff
danhoeflinger 427a5f4
naming consistency
danhoeflinger e9091e1
formatting
danhoeflinger 7315d2b
reverting overreach
danhoeflinger 0983043
upgrading tests to match scan, cmake
danhoeflinger e09ccae
test bugfix
danhoeflinger 96acd30
bugfix for non-full case
danhoeflinger f15c759
fix range to check
danhoeflinger b195e4d
adjust data generation
danhoeflinger 4528cbb
better fix for non-full case
danhoeflinger 831f9c9
removing old test
danhoeflinger 7be4594
undo change to unroll version check
danhoeflinger 68c258d
formatting
danhoeflinger 347bcf5
allowing alg to dictate active threads
danhoeflinger 62f58e2
bugfix for indexes
danhoeflinger 5c4bd74
clang format
danhoeflinger 98324ed
address reviewer comments
danhoeflinger 4377c2d
simplify data generation and cutoff calculation.
danhoeflinger 496ed4d
strip out single workgroup opt out
danhoeflinger 2d9e8a7
minimal data generation changes
danhoeflinger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things to improve:
lowest
instead ofmin
(min
is ~0 for float types).You can take a look at
test\kt\esimd_radix_sort_utils.h:generate_data()
, it is done there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch with
lowest()
, I know that but don't always remember that :) .I don't understand why it is interesting to have duplicates for
copy_if
with this "less than cutoff" predicate, for radix sort it makes sense, but I don't see the need here.I dont mind just including this utility header and call this generation function, no need to duplicate this functionality. While I don't understand the need for duplicates, they don't harm anything here.
Since we may use it here, can you explain the motivation for generating real numbers as it is done in
test\kt\esimd_radix_sort_utils.h:generate_data()
? Specifically:1e12
rather thannumerical_limits<T>::max()
?std::uniform_real_distribution<T> dist_real(-log2(1e12), log2(1e12));
?log2
followed byexp2
?If they perhaps have limited interest to things outside of radix sort, we may be able to make them tunable parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right, it's not relevant for
copy_if
. Sorry to bother you with that.The distribution is uniform and floating point types can hold very large values, so the values generated with
max
will gravitate towards edges. This is what I got withmin
andmax
:9.43047e+307 4.73343e+307 3.52862e+307 9.20991e+307 4.6219e+307 1.466e+308 8.12609e+307 4.44764e+307
.The trick with exponent allows making a more fair distribution in terms of underlying bit representation (again, it is useful for radix sort), and it appears to provide more real-world values. This is what I got with
log2(1e12)
(almost 40, basically):-1.97263e+06 32710.5 -1.40546e+06 -19.2773 265671 181.673 -4.88076e+09 6.98599e+08
.That's because of
exp2
usage below. It provides only positive values.Imaging that with the exponent transformation, each green tick on the picture below has equal probability:
This picture is from https://en.wikipedia.org/wiki/Floating-point_arithmetic. It mimics the distribution of representable values of a floating point type.
It would be better to use
log2(numeric_limits::max)
instead oflog2(1e12)
and leave a comment, though.1e12
is the maximum value after log2 and exp2 transformations. This pair is complementary and that is why it is used together.Meanwhile, I also checked
std::numeric_limits<T>::lowest(), std::numeric_limits<T>::max()
and got a sequence ofinf
values: https://godbolt.org/z/YzzrzzY6e.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I understand, I suppose for floating point it makes sense to "center" the values on some smaller range, rather than being very likely to end up with gigantic positive or negative values. I can buy the argument that this is more likely to be similar to a real case.
If I understand correctly, the
log2
exp2
is to provide a distribution which is more uniform in bit representation for radix sort. Its actually an interesting question if this is the "correct" thing to do when testing or timing radix sort for floating point numbers. This will make it act most like a uniformly distributed integral type, but one could also argue that real world data "doesn't care" about bit distribution and that just making a uniform distribution of real numbers is more appropriate, for instance uniform(-1e12,1e12). On the other hand, I have no real evidence that one distribution is more likely in user code than another for floating point data.Very interesting that you get
inf
numbers when usinglowest()
andmax()
. Seems like a bug, but I can also understand why using the type's limits in the distribution would make for difficult calculations that may involve the "size" of the range.I will try to unify the data generation, and perhaps make optional some of the radix sort specific stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Dan. I agree with your suggestions.
I guess it is fine to have such a distribution for radix sort tests to cover as many bits combinations as possible, assuming the generated data does not look too artificial. This is because radix sort does many bit-related manipulations, and it should help to catch cases like this one: https://github.com/oneapi-src/oneDPL/blob/b47ff69d059e72384b01df040423c916f21ca835/include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h#L100-L104
The benchmarks is a different story and
uniform(-1e12,1e12)
looks better. As for copy_if,uniform(-1e12,1e12)
looks better too.