Understanding how parallelization of the mHM calibration works ? #104
-
Hi! We did compile the model with OpenMP enabled on a HPC, as well as local computer (the problem persist on both, so the HPC does not seem to be the cause). We also specify the CPUs we want to use, using the the "OMP_NUM_THREADS" environment variable. We use the SCE calibration method. On both systems the correct number of CPUs is used (close to 100%), but for the same number of Iterations, it takes about the same amount of time, independent of the number of CPUs used. E.g. 5 Iterations took about 9 mins with both using 1 or 5 CPUs (we also tried different settings). Do you know what could be the cause, have we missed something? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 1 reply
-
Hi Malve, to clarify, OpenMP is used to parallelize the execution of mHM, whether this is within a calibration or a forward run. However, not all parts of mHM are making use of OpenMP. Could you share the standard output to further investigate which parts of mHM take how much time? Best, |
Beta Was this translation helpful? Give feedback.
-
Hi Stephan, Thank you. I hope this is what you meant with standard output. I now attached the ConfigFile, the job ouput file and the outputs from the optimization, do you need anything else? This run with 10 iterations should have used 100 CPUs and took 23min to finish (with CPU utilized for 1 day, 14:09:31): FinalParam.txt Best, |
Beta Was this translation helpful? Give feedback.
-
Hi Malve, 100 Threads is a lot! You have 710 cells, so roughly 7 cells per core, which is much too few. This will create data races among threads because they need to wait for each other until the results can be written back. Have you tried with much fewer nodes, say less than 10? If I use parallelization for large domains, then there are still a few thousand cells per core left! Can you make a forward run and share the output file here? This will contain more information about timing. Another important point for small run times is the usage of compiler. Typically intel fortran compiler will result in smaller run times compared to gfortran. Also, you should activate code optimization (i.e., release flag switched on while compilation like in this script https://git.ufz.de/mhm/mhm/-/blob/develop/CI-scripts/compile_OpenMP ). One more thing: Did you validated the routing network? The config file prints missing values for routing cell ids, which might be an indication that the river network is not set up correctly. Best, |
Beta Was this translation helpful? Give feedback.
-
Hi Stephan, Thanks for your suggestions.
Please find the runtimes for a simple forward run with the different compilers and number of CPUs (ran on the HPC, so the absolute runtime might differ from one node to another...)):
--> runtime for gfortran with one CPU = runtime intel with 10 CPUs = runtime of the my local setup that does not parallelize at all. Logfile for one simple forward run with gfortran and 1 cpu gfortran and 10 cpus Do you think the MPI implementation could be faster than the OpenMP ? Best, |
Beta Was this translation helpful? Give feedback.
-
Hi Malve, the numbers you report are unexpected. The issue here is really that your domain is very small and parallelization will not help (this includes MPI parallelization). The reason is that the parallelization in mHM parallelizes over space (i.e., different grid cells run on different compute cores). But this is only effective if the data does not fit into the L1 cache (see explanation of CPU caches below). If all the data for one timestep fits into the L1 cache, then sequential execution is the fastest that can be done. You essentially run each time step as fast as it can be done with the CPU speed. I wanted to know when the parallelization becomes effective and ran the test domain on spatial resolutions of 1, 2, 4, 8, and 12 km and openmp threads of 1, 2, 5, and 10. I deactivated all output writing, which can contribute significantly to the run time (up to 30%). These are the run times in Seconds:
So at about a 1000 grid cells, the parallelization becomes effective reducing run times by 30 to 50% from 1 to 10 Threads. For domains with less grid cells, the parallelization is not effective. However, this is System dependent and might be different on your HPC. But I would be surprised if it would deviate a lot. I did this experiment with gfortran compiler. Please find attached a small tarball that contains the latlon file for domain 1 and a script to make these runs. You can extract the tarball somewhere next to your mhm installation (NOT in the same directory). You then need to set the mhm_para_dir and mhm_dir variable in the run_par.sh script. The mhm_para_dir should be the path to where you extracted the tar ball and mhm_dir the path to where you clone the mhm repository. If you execute the script, then it will overwrite the mhm.nml in the mhm_dir !!! I would be curious what numbers you get on your system. ChatGPT description of Caches:
Tar ball |
Beta Was this translation helpful? Give feedback.
-
Hi Stephan, Thank you for the explanation! Sorry, for the late reply, I ran the tests as well with openmp and deactivated all outputs. I always used 1 CPU only (I am not sure what the number was in your settings), but varied the number of threads. It seems like in our case even at 1000 gridcells the parallelization did not become more efficient by an increased number of threads. Best, |
Beta Was this translation helpful? Give feedback.
-
Hi there, Just a small update on the computing time. The following numbers are based on computations using a single node. Computation time can change from one node to another, so absolute values are not important, but here, the different runs are consistent between each others.
Best regards, Pascal |
Beta Was this translation helpful? Give feedback.
-
Hi Pascal, Hi Malve, thanks for sharing your run times. To summarize: for the setup you are running mHM, the parallelization does not decrease run times substantially. If you want to further reduce the run times within the calibration, you could choose a shorter simulation period. I am not sure how long your calibration period is. Section 2.4 in Mai (2023) provides some good background on data to use for model calibration. Best, References: |
Beta Was this translation helpful? Give feedback.
Hi Pascal, Hi Malve,
thanks for sharing your run times. To summarize: for the setup you are running mHM, the parallelization does not decrease run times substantially. If you want to further reduce the run times within the calibration, you could choose a shorter simulation period. I am not sure how long your calibration period is. Section 2.4 in Mai (2023) provides some good background on data to use for model calibration.
Best,
Stephan
References:
Juliane Mai, "Ten strategies towards successful calibration of environmental models", Journal of Hydrology, Volume 620, Part A, 2023, 129414, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2023.129414.