-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBCSR performs very poorly on GH200, when there are large blocks #795
Comments
Hi Augustin, I am interested to see if the OpenCL based acceleration in DBCSR can be of use. For some access/dev-time on Alps, you can help me getting this permitted (perhaps private messaging/email). In the past (Daint), OpenCL was not well supported due to GPU mode set to "exclusive" (
Same experience. Although bumping the number of MMs per stack can help a bit, but it can also induce imbalance due to unfavorable remainder-work.
Can you elaborate on how to achieve this (other than for work going through TAS/DBM directly)? Perhaps this is something to become a more regular choice rather than code changes.
This is entirely possible with contemporary higher-end CPUs. My experience is, if the system contains multiple GPUs anyway, one can harvest them "for free" and get beyond a contemporary high-end CPU in the same system. If the CPU was chosen weaker on purpose (due to emphasis on GPU), the picture can turn in favor of the GPU(s). This is of course more emphasized if the workload has a high portion of DBT/DBM otherwise it's an uphill battle against Amdahl's law.
ACK. You can at least compile the A100 kernels with compute capability corresponding to H100. In any case, I would not expect big impact. Also, consider contributing your tuned parameters.
That would be welcome. |
I had this for CP2K/DBM recently as well like one of the MPI-enabled function appearing high in the profile (it was even intra-node) in one of our labs but not in the other (same CPU kind). I blamed this to F's ALLOCATE being much slower due to compiler or more likely to the OS flavor. One resolution was to LD_PRELOAD an alternative more scalable malloc implementation, e.g., TBB's malloc proxy. Btw, I have not found time to fix this particular issue at code level let alone upstreaming a change (my plan was to take a look at OpenMP's memory allocation as this is a established programming model in CP2K). |
Hi Hans, thanks a lot for all these insights! I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). As things stand, it is not ideal because each I've tuned the H100 kernels based on the A100 options. However, the A100 parameters are still way more complete, as they also include predicted kernels. I have not been able to run the predicting framework, I think because of filesystem limitations. So at this point, the A100 kernels are still better. I'll see if I can try your malloc solution, that's an interesting one! |
Update: @abussy shared (in private) the CP2K logs with me and I gave a fast look to them. BTW, @hfp any libxsmm for ARM to be included in CP2K? |
I will work on it. I have a few PRs pending for LIBXSMM; ideally, this should happen asap. |
That's super interesting! I didn't think an incremental migration would be feasible. I'll look into this.
Sub-matrix should be fairly easy to add and complex matrices are only used by CP2K in ~3 places which can be refactored. |
I'll continue the discussion I had with @alazzaro here, so that everybody who is interested can follow. I was asked to test running with
Going from 1 thread to 8 makes I am not sure that running with many MPI ranks and a small number of OMP threads is always a good solution though. There are 72 cores per GPU on GH200, and oversubscribing the GPU too much can be detrimental too. Also, if we go to multiple nodes, we might run into scaling issues due to the large number of ranks. @hfp I also tried TBB's malloc proxy. I only got marginal gains for this benchmark though. |
This case can be solved by setting the environment variable |
On x86, NVidia's implementation of OpenCL is simply part of every CUDA installation (which in turn can be part of an NVHPC installation). However, I had an issue like yours on a Jetson-AGX system ( |
I can confirm to you that OpenCL is not distributed with CUDA on Alps. I'll get the word out, and we'll see if somebody comes up with something. |
PR #801 solves this issue. While this is not an automatic fix, it allows the user to run efficiently when encountering this issue (by setting a environment variable). |
Let's keep it open for future improvements... |
To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure? |
Just look the CP2K output timings and search for
The last two columns are the inclusive time (average across ranks, max for all ranks). |
I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the
benchmarks/QS/H2O-XXX.inp
) tests. However, when large block sizes are involved, DBCSR becomes extremely costly. This seems to be linked to the GPU acceleration.The following data was obtained with the
becnhamrks/QS_low_scaling_postHF/32-H2O/H2O-32-RPA-TZ.inp
input file, on a single node (4GPUs, 8 ranks per GPU, 8 threads per rank). In turn, CP2K was compiled with and without the-D__DBCSR_ACC
flag.Timings are in seconds, as per the CP2K output file:
-D__DBCSR_ACC
-D__DBCSR_ACC
With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.
I would appreciate any suggestion on how to solve this issue. What I have tried so far:
&GLOCAL%DBCSR
input section of CP2K: no noticeable differencebenchmarks/QS/H2O-XXX.inp
tests.Building DBCSR without GPU support is not a satisfactory solution, as many other use cases are indeed accelerated. One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.
The text was updated successfully, but these errors were encountered: