Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable H100 in CMake (with A100 parameters) #804

Closed
wants to merge 2 commits into from
Closed

Conversation

RMeli
Copy link
Member

@RMeli RMeli commented Jun 14, 2024

Tested with Spack.


dbcsr@develop +cuda cuda_arch=90
-- GPU target architecture: H100
-- Kernel parameters: A100
-- GPU architecture number: 90
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

dbcsr@develop +cuda cuda_arch=80
-- GPU target architecture: A100
-- Kernel parameters: A100
-- GPU architecture number: 80
-- GPU profiling enabled: OFF
-- GPU aware MPI enabled: OFF

@hfp
Copy link
Member

hfp commented Jun 14, 2024

LGTM aka best effort.

@hfp
Copy link
Member

hfp commented Jun 14, 2024

( Let's pretend B100/200 has limited audience ;-)

@alazzaro
Copy link
Member

Sorry to repeat myself (see #656), with these changes we are ruining the entire idea of autotuning...
The last "extensive" autotuning was done for P100 (Daint) and V100. These are about 74100 kernels.

Then @dev-zero and @mkrack did for A100, with about the same number of kernels.

@mtaillefumier did for Mi100, but with much less kernels (412 kernels), and I did repeat the same kernels optimization for Mi250.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels. Fine if you are doing in CP2K (just like @hfp did, see cp2k/cp2k#3368), but this is not an option for DBCSR.

My current idea is to introduce a "General" kernel and show in the output a (*) for the kernels that they are not using it. Still, people should use autotuning and contribute to the kernels to get the "best" performance.

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

@abussy
Copy link
Contributor

abussy commented Jun 14, 2024

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

@hfp
Copy link
Member

hfp commented Jun 14, 2024

Prediction code is rapidly aging and we have this issue hanging since quite some time.

@alazzaro
Copy link
Member

The main issue with tuning for H100 is the following. Running the auto-tuning framework based on the A100 is trivial: it works, and we have done so as part of testing CP2K on Alps. The problem arises with the ML predicting framework, where I was not able to finish the procedure.

As a result, we either have a handful of H100 tuned kernels, or a much more complete set of A100 (tuned + predicted) kernels. I like the latter option better.

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

@RMeli
Copy link
Member Author

RMeli commented Jun 14, 2024

Sorry to repeat myself (see #656), with these changes we are ruining the entire idea of autotuning...

I totally agree with this, and that's why I marked is a TODO. However, I must have misunderstood our discussion at PASC24.

Now, unless there is something which prevents us on running the autotuning on H100 (or any new generation), I would not consider the option to re-use previous kernels.

As mentioned at PASC24, we tried to run auto-tuning for GH200 at CSCS but it is not clear to us who is actually responsible of contributing the kernels and checking that everything is in order. I was under the impression that you were about to get access to H100.

If I recall correctly our discussion at PASC24, you mentioned the following:

  • When autotuning fails for one reason or another the kernel is considered slow, and there is no check
  • Predictive tuning does not work anymore

Therefore, I was under the impression than the whole auto-tuning pipeline needs attention, and that's why I opened this PR as a temporary workaround (it might still be beneficial to target the proper architecture in the meantime).

Of course, there is also the possibility of dropping the entire autotuning and keep the A100 kernels for any new GPU generation (including the default kernel). Do we have any measure that said that A100 are good enough for H100?

This is more in line with what we discussed, if I recall correctly, which is why I opened this PR. However, at the moment we don't have numbers comparing directly the two parameter sets.


BTW, src/acc/cuda/Makefile contains the following:

else ifeq ($(WITH_GPU),H100)
# TODO: update for H100 tuned parameters
override WITH_GPU := A100
ARCH_NUMBER = 90

@RMeli RMeli closed this Jun 14, 2024
@alazzaro
Copy link
Member

let's turn this PR in an issue, I'm open to discussion.

For the record, what I said at PASC is that autotuning is old and need refresh (2018 was the last major refresh by Shoshana + some minor updates by me and @dev-zero ), but this is what we have and we are supposed to use it (or drop it), I didn't propose any workaround.
The file you are mentioning (

else ifeq ($(WITH_GPU),H100)
# TODO: update for H100 tuned parameters
override WITH_GPU := A100
ARCH_NUMBER = 90
) is used for internal testing by @hfp , nothing related to the library itself.

The entire machinery is on users to provide optimized kernels, as described in the documentation. Likely I need to add "user" to the documentation to make it clear, good point.

@mkrack
Copy link
Member

mkrack commented Jun 14, 2024

You don't need the ML prediction, that is a "fast" solution. Personally, I have never tried it. Note that @mkrack did not use it for A100 kernels.

That's not correct, I used the ML prediction to create the 71048 predicted A100 kernels in addition to the 3043 autotuned ones. The scripts required some fixes at that time (summer 2023), but worked on JURECA like for the P100. From my experience, I can comment the following:

  • The performance gain with the tuned A100 kernels is minor compared to using the P100 kernels like the tuned P100 kernels work reasonably well for V100.
  • The tuning consumes a significant amount of computational (and human) resources.
  • The ML requires a sufficient number of autotuned kernels to make sense.
  • It is better to use the full set of autotuned and predicted kernels from the previous GPU generation than to use only a relative small set of autotuned kernels.
  • Once enough autotuned kernels have been accumulated, an ML step could be attempted.

@alazzaro
Copy link
Member

Thanks @mkrack , this is a very nice feedback and thanks for the clarification (it turns out I did a grep for "predicted" in the wrong file! You are definitely right).

OK, then I think we are going to the conclusion that we can drop the ML predict part and likely the autotuning at all (we will keep it to add new kernels). I think @RMeli and @abussy went the same conclusion.

Then, the strategy will be to rename the file/parameters in "AMD" and "NVIDIA" and drop the specific GPU version.
As I said, I will add a generic kernel which will be good enough fo all cases we don't cover with autotuning.

@RMeli
Copy link
Member Author

RMeli commented Jun 14, 2024

I didn't propose any workaround.

Yes, apologies for the confusion. The workaround was my interpretation, also based on what it is done for CP2K and what I saw in the repository here (out of context).

@RMeli
Copy link
Member Author

RMeli commented Jun 14, 2024

Thank you everyone for the input. Let's move the discussion to #805.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants