-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement seed finding Thunderdome in CUDA #410
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not look like it from my comments, but I like this proposal. I believe it will prove very useful once we get to making CKF as fast as we can.
I'm just a stickler for some of the technicalities...
public: | ||
std::size_t _size; | ||
std::array<link_type, N> _sps; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If client code is meant to access these variables now, they should really not be prefixed with _
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely correct, I plan on renaming these and making some more changes to the nseed
class during the development of this PR. Well spotted. 👍
|
||
namespace traccc::cuda { | ||
|
||
class seed_merging : public algorithm<std::pair<vecmem::unique_alloc_ptr<nseed<20>[]>, uint32_t>(const seed_collection_types::buffer&)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why nseed<20>
? I thought we were only going to merge 2 seeds together in this algorithm. At least for now. So why not nseed<4>
? 😕
At the same time the return type should really instead be vecmem::data::vector_buffer<nseed<4> >
. Since that type is literally what you wrote here. (An array with a known size.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, sorry if I did not explain properly. The five-iteration loop you commented on below means that seeds can become at most five spacepoints longer than they are now. For example, seed ABC can be merged with BCD to form ABCD in step 1, then that can merge with BCDE in step 2, etc. So In principle the seed capacity needs to be 3 + n where n is the number of merging steps Thunderdome rounds.
__global__ void convert_to_nseeds(seed_collection_types::view vf, nseed<N> * out, unsigned long long * out_n) { | ||
seed_collection_types::device vfd(vf); | ||
|
||
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < vf.size(); i += gridDim.x * blockDim.x) { | ||
out[i] = vfd[i]; | ||
atomicAdd(out_n, 1ULL); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this kernel... 😕
What's up with the atomicAdd(...)
? We know exactly how many (3-spacepoint) seeds we start with. As long as the kernel code doesn't have a bug in it, that's exactly how large out_n
is supposed to become.
At the same time, you'll need to put some amount of explanation here about that for-loop. I assume you're going for optimal cache line usage with it. Though on first look I don't understand why this access would yield better caching than just processing a single seed in every thread. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct that this doesn't make sense right now. What I want to do in the future is turn this kernel into a seed binning kernel that stores seeds into φ-bins to reduce the amount of work done. When I get that sorted I will need to do atomic space reservations. For now, with only one bin, this is indeed not really necessary.
seed_merging(const traccc::memory_resource& mr, stream& str); | ||
|
||
output_type operator()( | ||
const seed_collection_types::buffer&) const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input parameter needs to be seed_collection_type::const_view
. The algorithm doesn't need to know whether it's a buffer or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
vecmem::unique_alloc_ptr<nseed<20>[]> | ||
arr1 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000), | ||
arr2 = vecmem::make_unique_alloc<nseed<20>[]>(m_mr.main, 1000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with the 1M fixed numbers? 😕 Just ask the input view how many elements it has. I.e.
const unsigned int nseeds = m_copy.get_size(input_view);
Assuming that you switch to using a view, and that you ask the user to provide a vecmem::copy
object to the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the thing with this is that it is possible for the number of seeds to increase, at least in early steps of the algorithms. So I need a little bit of buffer space for that. But you're right I can do two times the number of initial seeds, for example.
siz1 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main), | ||
siz2 = vecmem::make_unique_alloc<unsigned long long>(m_mr.main); | ||
|
||
kernels::convert_to_nseeds<20><<<2048, 256>>>(i, arr1.get(), siz1.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to write new code with asynchronicity in mind. But this is just to note this, I'm fine with this PR not caring about asynchronicity yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make sure to add the appropriate synchronization points later. This algorithm is also a prime candidate for Dynamic Parallelism 2, which might reduce the number of synchronization points. 🙂
|
||
std::cout << "Step 0 has " << rc << " seeds." << std::endl; | ||
|
||
for (std::size_t i = 0; i < 5; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine 5 is a number that you came up with after some testing. That is fine, but it has to be made a configurable property of the algorithm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
|
||
for (std::size_t i = 0; i < 5; ++i) { | ||
CUDA_ERROR_CHECK(cudaMemset(siz2.get(), 0, sizeof(unsigned long long))); | ||
kernels::merge_nseeds<20, 20><<<rc, 256>>>(arr1.get(), siz1.get(), arr2.get(), siz2.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious. I thought the compiler would be able to deduce those template parameters automatically... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably can, actually! I added them explicitly for no real good reason, honestly.
std::swap(arr1, arr2); | ||
std::swap(siz1, siz2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. 😄 You should've mentioned that you're taking queues from the discussion we had with Beomki about AdePT's memory handling. 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to toot my own horn but I've been on this train for a while. 👼
Thanks a lot for taking a look, glad you like it! It's still very early development but you've raised some valid points. |
This draft pull request implements a Thunderdome-like fight to the death for seeds: "Two seeds enter, one seed leaves" . The idea is to allow seeds to "eat" other seeds, growing bigger themselves and eliminating other seeds from the seed list. This is designed to significantly reduce the combinatorics that the CKF has to deal with, thereby improving performance.
To give you an idea of what this does, here is an example of the seeding example without Thunderdoming on ten ⟨µ⟩ = 300 events:
A relatively conservative set of Thunderdome rules reduces the number of seeds for the CKF to deal with significantly:
If we turn up the Mad Max dial, we can reduce the number of seeds far more
Hopefully this reduction in the number of seeds will reduce CKF runtime. Of course, we will need to tweak the rules to reduce physics performance loss, but I think this will be possible.