[Proof of Concept] Multi-GPU prototype (single node) #89

efaulhaber · 2024-12-23T10:14:21Z

This is a quick and dirty prototype to run code on multiple GPUs of a single node.
I just made everything use unified memory and split the ndrange by the number of GPUs.
The particles are ordered, so this should partition the domain in blocks. Each GPU works on one of these blocks, with only limited communication between them. Nvidia should take care of optimizing where the memory lives.

As you can see here, it is indeed using all 4 GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100                    On  | 00000000:26:00.0 Off |                    0 |
| N/A   44C    P0             244W / 700W |  24350MiB / 95830MiB |     44%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100                    On  | 00000000:46:00.0 Off |                    0 |
| N/A   40C    P0             173W / 700W |   2660MiB / 95830MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100                    On  | 00000000:A6:00.0 Off |                    0 |
| N/A   45C    P0             247W / 700W |   2658MiB / 95830MiB |     43%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100                    On  | 00000000:C6:00.0 Off |                    0 |
| N/A   47C    P0             222W / 700W |   2654MiB / 95830MiB |     25%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Here are some results using the WCSPH benchmark with 65M particles.

	1 GPU device memory (main)	1 GPU unified memory	4 GPUs device memory	4 GPUs unified memory	2 GPUs unified memory
FP64	827.574 ms	827.062 ms	6.665 s	642.866 ms	417.144 ms ~ 836.871 ms
FP32	421.291 ms	420.933 ms	4.191 s	328.196 ms	426.163 ms

As you can see, there is no difference between device and unified memory on a single GPU.
Device memory with 4 GPUs is unsurprisingly very slow. Unified memory with 4 GPUs is slightly faster than a single GPU, but the difference is a bit underwhelming.

On 2 GPUs, things are getting interesting. Most of the time, the runtime is very similar to 1 GPU, but about 1 out of 20 runs, it's almost twice as fast:

BenchmarkTools.Trial: 30 samples with 1 evaluation.
 Range (min … max):  416.990 ms … 851.266 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     838.658 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   811.535 ms ± 107.264 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█▄ ▁
  417 ms           Histogram: frequency by time          851 ms <

Memory estimate: 68.70 KiB, allocs estimate: 543.

So it seems that only sometimes the two GPUs are working in parallel, while most of the time only one is active.
I have no idea why this is happening.

CC @vchuravy

Add prototype for multi-gpu

98d83c1

efaulhaber added the gpu label Dec 23, 2024

efaulhaber added 2 commits December 23, 2024 11:14

Fix ndrange splitting

7d3053a

Fix localmem kernel

5b025d4

efaulhaber self-assigned this Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proof of Concept] Multi-GPU prototype (single node) #89

[Proof of Concept] Multi-GPU prototype (single node) #89

efaulhaber commented Dec 23, 2024

[Proof of Concept] Multi-GPU prototype (single node) #89

Are you sure you want to change the base?

[Proof of Concept] Multi-GPU prototype (single node) #89

Conversation

efaulhaber commented Dec 23, 2024