Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] Multi-GPU prototype (single node) #89

Draft
wants to merge 3 commits into
base: ef/localmem-kernel
Choose a base branch
from

Conversation

efaulhaber
Copy link
Member

This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the ndrange by the number of GPUs.
The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.

As you can see here, it is indeed using all 4 GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   44C    P0             244W / 700W |  24350MiB / 95830MiB |     44%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   40C    P0             173W / 700W |   2660MiB / 95830MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   45C    P0             247W / 700W |   2658MiB / 95830MiB |     43%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   47C    P0             222W / 700W |   2654MiB / 95830MiB |     25%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Here are some results using the WCSPH benchmark with 65M particles.

1 GPU device memory (main) 1 GPU unified memory 4 GPUs device memory 4 GPUs unified memory 2 GPUs unified memory
FP64 827.574 ms 827.062 ms 6.665 s 642.866 ms 417.144 ms ~ 836.871 ms
FP32 421.291 ms 420.933 ms 4.191 s 328.196 ms 426.163 ms

As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.

On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:

BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  416.990 ms … 851.266 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     838.658 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   811.535 ms ± 107.264 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▄ ▁
  417 ms           Histogram: frequency by time          851 ms <

Memory estimate: 68.70 KiB, allocs estimate: 543.

So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.

CC @vchuravy

@efaulhaber efaulhaber added the gpu label Dec 23, 2024
@efaulhaber efaulhaber self-assigned this Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant