[Proof of Concept] Multi-GPU prototype (single node) #89
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the
ndrange
by the number of GPUs.The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.
As you can see here, it is indeed using all 4 GPUs.
Here are some results using the WCSPH benchmark with 65M particles.
As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.
On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:
So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.
CC @vchuravy