Enable H100 in CMake (with A100 parameters) #804

RMeli · 2024-06-14T08:14:58Z

Tested with Spack.

dbcsr@develop +cuda cuda_arch=90

-- GPU target architecture: H100
-- Kernel parameters: A100
-- GPU architecture number: 90
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

dbcsr@develop +cuda cuda_arch=80

-- GPU target architecture: A100
-- Kernel parameters: A100
-- GPU architecture number: 80
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

for more information, see https://pre-commit.ci

hfp · 2024-06-14T08:30:29Z

LGTM aka best effort.

hfp · 2024-06-14T08:37:50Z

( Let's pretend B100/200 has limited audience ;-)

alazzaro · 2024-06-14T09:06:32Z

Sorry to repeat myself (see #656), with these changes we are ruining the entire idea of autotuning...
The last "extensive" autotuning was done for P100 (Daint) and V100. These are about 74100 kernels.

Then @dev-zero and @mkrack did for A100, with about the same number of kernels.

@mtaillefumier did for Mi100, but with much less kernels (412 kernels), and I did repeat the same kernels optimization for Mi250.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels. Fine if you are doing in CP2K (just like @hfp did, see cp2k/cp2k#3368), but this is not an option for DBCSR.

My current idea is to introduce a "General" kernel and show in the output a (*) for the kernels that they are not using it. Still, people should use autotuning and contribute to the kernels to get the "best" performance.

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

abussy · 2024-06-14T09:32:32Z

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

hfp · 2024-06-14T09:37:34Z

Prediction code is rapidly aging and we have this issue hanging since quite some time.

alazzaro · 2024-06-14T09:42:07Z

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

RMeli · 2024-06-14T09:44:22Z

Sorry to repeat myself (see #656), with these changes we are ruining the entire idea of autotuning...

I totally agree with this, and that's why I marked is a TODO. However, I must have misunderstood our discussion at PASC24.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels.

As mentioned at PASC24, we tried to run auto-tuning for GH200 at CSCS but it is not clear to us who is actually responsible of contributing the kernels and checking that everything is in order. I was under the impression that you were about to get access to H100.

If I recall correctly our discussion at PASC24, you mentioned the following:

When autotuning fails for one reason or another the kernel is considered slow, and there is no check
Predictive tuning does not work anymore

Therefore, I was under the impression than the whole auto-tuning pipeline needs attention, and that's why I opened this PR as a temporary workaround (it might still be beneficial to target the proper architecture in the meantime).

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

This is more in line with what we discussed, if I recall correctly, which is why I opened this PR. However, at the moment we don't have numbers comparing directly the two parameter sets.

BTW, src/acc/cuda/Makefile contains the following:

dbcsr/src/acc/cuda/Makefile

Lines 85 to 88 in 0f47720

    
           else ifeq ($(WITH_GPU),H100) 
        
             # TODO: update for H100 tuned parameters 
        
             override WITH_GPU := A100 
        
             ARCH_NUMBER = 90

alazzaro · 2024-06-14T10:04:39Z

let's turn this PR in an issue, I'm open to discussion.

For the record, what I said at PASC is that autotuning is old and need refresh (2018 was the last major refresh by Shoshana + some minor updates by me and @dev-zero ), but this is what we have and we are supposed to use it (or drop it), I didn't propose any workaround.
The file you are mentioning (

dbcsr/src/acc/cuda/Makefile

Lines 85 to 88 in 0f47720

    
           else ifeq ($(WITH_GPU),H100) 
        
             # TODO: update for H100 tuned parameters 
        
             override WITH_GPU := A100 
        
             ARCH_NUMBER = 90

) is used for internal testing by @hfp , nothing related to the library itself.

The entire machinery is on users to provide optimized kernels, as described in the documentation. Likely I need to add "user" to the documentation to make it clear, good point.

mkrack · 2024-06-14T10:18:41Z

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

That's not correct, I used the ML prediction to create the 71048 predicted A100 kernels in addition to the 3043 autotuned ones. The scripts required some fixes at that time (summer 2023), but worked on JURECA like for the P100. From my experience, I can comment the following:

The performance gain with the tuned A100 kernels is minor compared to using the P100 kernels like the tuned P100 kernels work reasonably well for V100.
The tuning consumes a significant amount of computational (and human) resources.
The ML requires a sufficient number of autotuned kernels to make sense.
It is better to use the full set of autotuned and predicted kernels from the previous GPU generation than to use only a relative small set of autotuned kernels.
Once enough autotuned kernels have been accumulated, an ML step could be attempted.

alazzaro · 2024-06-14T10:26:22Z

Thanks @mkrack , this is a very nice feedback and thanks for the clarification (it turns out I did a grep for "predicted" in the wrong file! You are definitely right).

OK, then I think we are going to the conclusion that we can drop the ML predict part and likely the autotuning at all (we will keep it to add new kernels). I think @RMeli and @abussy went the same conclusion.

Then, the strategy will be to rename the file/parameters in "AMD" and "NVIDIA" and drop the specific GPU version.
As I said, I will add a generic kernel which will be good enough fo all cases we don't cover with autotuning.

RMeli · 2024-06-14T11:29:46Z

I didn't propose any workaround.

Yes, apologies for the confusion. The workaround was my interpretation, also based on what it is done for CP2K and what I saw in the repository here (out of context).

RMeli · 2024-06-14T11:30:35Z

Thank you everyone for the input. Let's move the discussion to #805.

RMeli and others added 2 commits June 14, 2024 10:12

h100

c23b5d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

38447bd

for more information, see https://pre-commit.ci

RMeli closed this Jun 14, 2024

alazzaro mentioned this pull request Jun 14, 2024

Discussion on tuning machinery #805

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable H100 in CMake (with A100 parameters) #804

Enable H100 in CMake (with A100 parameters) #804

RMeli commented Jun 14, 2024

hfp commented Jun 14, 2024

hfp commented Jun 14, 2024

alazzaro commented Jun 14, 2024

abussy commented Jun 14, 2024

hfp commented Jun 14, 2024

alazzaro commented Jun 14, 2024

RMeli commented Jun 14, 2024

alazzaro commented Jun 14, 2024

mkrack commented Jun 14, 2024

alazzaro commented Jun 14, 2024

RMeli commented Jun 14, 2024

RMeli commented Jun 14, 2024

Enable H100 in CMake (with A100 parameters) #804

Enable H100 in CMake (with A100 parameters) #804

Conversation

RMeli commented Jun 14, 2024

hfp commented Jun 14, 2024

hfp commented Jun 14, 2024

alazzaro commented Jun 14, 2024

abussy commented Jun 14, 2024

hfp commented Jun 14, 2024

alazzaro commented Jun 14, 2024

RMeli commented Jun 14, 2024

alazzaro commented Jun 14, 2024

mkrack commented Jun 14, 2024

alazzaro commented Jun 14, 2024

RMeli commented Jun 14, 2024

RMeli commented Jun 14, 2024