Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use unsigned int instead of size_t for seed finding #82

Merged
merged 3 commits into from
Sep 6, 2021

Conversation

beomki-yeo
Copy link
Contributor

size_t in cuda device code is pretty expensive, and I don't think it is necessary to use size_t because the number of spacepoints/doublet/triplet or bin size never exceeds maximum values of unsigned int

There are meaningful difference in speedup when we use unsigned int

@krasznaa
Copy link
Member

krasznaa commented Sep 2, 2021

I didn't know that size_t would affect performance so much if you didn't want to use atomic operations on it. 😕 Because yes, unsigned int is definitely a much safer type if you want to use any CUDA-provided functions.

There is a bit of "controversy" around this in vecmem as well. (acts-project/vecmem#96) So in general I'm on your side, unsigned int is generally a good choice in device code. I just didn't know that in these cases there would be an actual performance difference. But I'll trust you on that one...

@stephenswat
Copy link
Member

CUDA devices don't have any 64-bit integer hardware as far as I know, so they emulate it using their 32-bit integer silicon. The amount of time that takes is very dependent on what operations you are trying to run. Addition and subtraction are trivial (just perform two 32-bit operations and carry), multiplication is a little more expensive, stuff like division can be quite pricey.

So it depends on your workload how much slower it is to use 64-bit integers compared to 32-bit ones. I would normally say that for the usual floating point-heavy workloads it doesn't matter at all, but for irregular workloads like seedfinding it could feasibly make a difference.

@beomki-yeo
Copy link
Contributor Author

I think size_t is slower than unsigned int when it comes to global memory writing due to its larger size.
As you can see, the changes I made in this PR are only for the indices, which are written into global memory during the seed finding. and it is also true that global memory writing time is significant in cuda seed finding. I don't think they affect the computation meaningfully.

@stephenswat
Copy link
Member

Are you sure? Every change I see in this PR looks like it would affect the register file, not the global memory. But of course I am not intimately familiar with your seeding implementation. So if you say this is faster then I'll take your word for it. 🙂

@beomki-yeo
Copy link
Contributor Author

sp_location with indices (which is member of doublet and triplet) is written into global memory, and that's why

@cgleggett
Copy link

interesting question - what determines the size of size_t on a cuda device? What happens if the host is 32-bit? Is it the size of size_t on the host that sets it, or is it something defined by the architecture of the device itself?

@krasznaa
Copy link
Member

krasznaa commented Sep 2, 2021

interesting question - what determines the size of size_t on a cuda device? What happens if the host is 32-bit? Is it the size of size_t on the host that sets it, or is it something defined by the architecture of the device itself?

It is nvcc that needs to decide what std::size_t is. (Though it does need to be compatible with the host compiler along the way...) I think CUDA stopped supporting 32-bit hosts a while ago. (Didn't it...?) All in all, I'm pretty sure that std::size_t is 64-bit-wide in any reasonable situations these days.

@stephenswat
Copy link
Member

I'm pretty sure that nvcc still supports 32-bit architectures (both on the host side and the device side). As far as I know both compilers will import their own implementation of the standard library headers, and then decide on the size depending on whether the compilation is 32-bit or 64-bit. I strongly suspect that this is decided by the host side, although that allows you to come up with some very esoteric scenarios where your host is 32-bit and your device is 64-bit. Since nvcc is a compiler driver and not a compiler itself it may in that case decide to use 64-bit integers for its host code and its device code, while the actual host compiler might default to 32-bits.

In reality 64-bit computing is so ubiquitous now that I don't think this is ever a problem, though.

@krasznaa
Copy link
Member

krasznaa commented Sep 2, 2021

I'm 99.99% sure that 32-bit OS support from CUDA was dropped a long-long time ago...

Screen Shot 2021-09-02 at 5 42 55 PM

I'd have to double-check, but probably around CUDA 6 or 7...

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As interesting as this discussion has been I think we're almost good to go ahead, I have one quick question about one of the container changes. Once that's been solved I am happy to merge.

core/include/edm/container.hpp Show resolved Hide resolved
@stephenswat stephenswat added the refactor Change the structure of the code label Sep 6, 2021
@stephenswat stephenswat merged commit 27d66b1 into acts-project:main Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Change the structure of the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants