Understanding how parallelization of the mHM calibration works ? #104

MalveHeinz · 2024-10-02T14:51:37Z

MalveHeinz
Oct 2, 2024

Hi!

We did compile the model with OpenMP enabled on a HPC, as well as local computer (the problem persist on both, so the HPC does not seem to be the cause). We also specify the CPUs we want to use, using the the "OMP_NUM_THREADS" environment variable. We use the SCE calibration method. On both systems the correct number of CPUs is used (close to 100%), but for the same number of Iterations, it takes about the same amount of time, independent of the number of CPUs used. E.g. 5 Iterations took about 9 mins with both using 1 or 5 CPUs (we also tried different settings).

Do you know what could be the cause, have we missed something?

Thanks a lot!

Answered by StephanThober

Oct 30, 2024

Hi Pascal, Hi Malve,

thanks for sharing your run times. To summarize: for the setup you are running mHM, the parallelization does not decrease run times substantially. If you want to further reduce the run times within the calibration, you could choose a shorter simulation period. I am not sure how long your calibration period is. Section 2.4 in Mai (2023) provides some good background on data to use for model calibration.

Best,
Stephan

References:
Juliane Mai, "Ten strategies towards successful calibration of environmental models", Journal of Hydrology, Volume 620, Part A, 2023, 129414, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2023.129414.

View full answer

StephanThober · 2024-10-07T09:21:10Z

StephanThober
Oct 7, 2024
Maintainer

Hi Malve,

to clarify, OpenMP is used to parallelize the execution of mHM, whether this is within a calibration or a forward run. However, not all parts of mHM are making use of OpenMP. Could you share the standard output to further investigate which parts of mHM take how much time?

Best,
Stephan

0 replies

MalveHeinz · 2024-10-07T09:57:56Z

MalveHeinz
Oct 7, 2024
Author

Hi Stephan,

Thank you. I hope this is what you meant with standard output. I now attached the ConfigFile, the job ouput file and the outputs from the optimization, do you need anything else? This run with 10 iterations should have used 100 CPUs and took 23min to finish (with CPU utilized for 1 day, 14:09:31):

FinalParam.txt
outfile_job8.txt
sce_populations.txt
sce_results.txt
ConfigFile.log

Best,
Malve

0 replies

StephanThober · 2024-10-08T07:33:17Z

StephanThober
Oct 8, 2024
Maintainer

Hi Malve,

100 Threads is a lot! You have 710 cells, so roughly 7 cells per core, which is much too few. This will create data races among threads because they need to wait for each other until the results can be written back. Have you tried with much fewer nodes, say less than 10? If I use parallelization for large domains, then there are still a few thousand cells per core left!

Can you make a forward run and share the output file here? This will contain more information about timing.

Another important point for small run times is the usage of compiler. Typically intel fortran compiler will result in smaller run times compared to gfortran. Also, you should activate code optimization (i.e., release flag switched on while compilation like in this script https://git.ufz.de/mhm/mhm/-/blob/develop/CI-scripts/compile_OpenMP ).

One more thing: Did you validated the routing network? The config file prints missing values for routing cell ids, which might be an indication that the river network is not set up correctly.

Best,
Stephan

0 replies

MalveHeinz · 2024-10-09T12:54:21Z

MalveHeinz
Oct 9, 2024
Author

Hi Stephan,

Thanks for your suggestions.

We only ever used one single node and we also tried with 10 and less CPUs, but still then runtime did not decrease.
We also checked and we had activated the code optimization (release version) while compiling the model (using your script).
We now also compared the intel compiler and the gfortran compiler, which unfortunately did not decrease the runtime.
Regarding the routing network, thank you for pointing that out, I changed the resolution back to 1000m, so now I don't get missing values anymore.

Please find the runtimes for a simple forward run with the different compilers and number of CPUs (ran on the HPC, so the absolute runtime might differ from one node to another...)):

gfortran, 1 CPU = 64 sec.
gfortran 5 CPUs = 71 sec.
fortran 10 CPUs = 106 sec.
intel 1 CPU = 110 sec.
intel 5 CPUs = 73 sec.
intel 10 CPUs = 66 sec.

--> runtime for gfortran with one CPU = runtime intel with 10 CPUs = runtime of the my local setup that does not parallelize at all.

Logfile for one simple forward run with gfortran and 1 cpu
run_gfortran_1cpu.txt

gfortran and 10 cpus
run_gfortran_10cpus.txt

Do you think the MPI implementation could be faster than the OpenMP ?

Best,
Malve

0 replies

StephanThober · 2024-10-15T08:55:52Z

StephanThober
Oct 15, 2024
Maintainer

Hi Malve,

the numbers you report are unexpected. The issue here is really that your domain is very small and parallelization will not help (this includes MPI parallelization). The reason is that the parallelization in mHM parallelizes over space (i.e., different grid cells run on different compute cores). But this is only effective if the data does not fit into the L1 cache (see explanation of CPU caches below). If all the data for one timestep fits into the L1 cache, then sequential execution is the fastest that can be done. You essentially run each time step as fast as it can be done with the CPU speed.

I wanted to know when the parallelization becomes effective and ran the test domain on spatial resolutions of 1, 2, 4, 8, and 12 km and openmp threads of 1, 2, 5, and 10. I deactivated all output writing, which can contribute significantly to the run time (up to 30%).

These are the run times in Seconds:

Spatial resolution [km]	Ncells	1 Thread	2 Threads	5 Threads	10 Threads
12	109	2.4	1.5	2.3	2.9
8	228	2.9	2.3	2.8	3.3
4	823	9.9	7.2	6.9	6.6
2	3070	45.2	29.8	24.2	21.7
1	11851	211.7	155.4	133.5	123.3

So at about a 1000 grid cells, the parallelization becomes effective reducing run times by 30 to 50% from 1 to 10 Threads. For domains with less grid cells, the parallelization is not effective. However, this is System dependent and might be different on your HPC. But I would be surprised if it would deviate a lot. I did this experiment with gfortran compiler.

Please find attached a small tarball that contains the latlon file for domain 1 and a script to make these runs. You can extract the tarball somewhere next to your mhm installation (NOT in the same directory). You then need to set the mhm_para_dir and mhm_dir variable in the run_par.sh script. The mhm_para_dir should be the path to where you extracted the tar ball and mhm_dir the path to where you clone the mhm repository. If you execute the script, then it will overwrite the mhm.nml in the mhm_dir !!! I would be curious what numbers you get on your system.

ChatGPT description of Caches:
Here's a brief description of the cache hierarchy:

L0 Cache (Optional): Fastest but extremely small, mainly for specialized operations.
L1 Cache: Small and very fast, used for critical data and instructions the CPU will need soon. It's split into L1 Data Cache (L1d) and L1 Instruction Cache (L1i).
L2 Cache: Larger and slower than L1, but still close to the CPU, storing data that might be needed next.
L3 Cache: Shared among all CPU cores, larger but slower than L2. It serves as a last resort before accessing the main memory (RAM).

Tar ball
run_test_domain_1_parallel.tar.gz

0 replies

MalveHeinz · 2024-10-28T07:49:06Z

MalveHeinz
Oct 28, 2024
Author

Hi Stephan,

Thank you for the explanation!

Sorry, for the late reply, I ran the tests as well with openmp and deactivated all outputs. I always used 1 CPU only (I am not sure what the number was in your settings), but varied the number of threads.

It seems like in our case even at 1000 gridcells the parallelization did not become more efficient by an increased number of threads.

Best,
Malve

0 replies

pascalhorton · 2024-10-28T15:21:22Z

pascalhorton
Oct 28, 2024

Hi there,

Just a small update on the computing time. The following numbers are based on computations using a single node. Computation time can change from one node to another, so absolute values are not important, but here, the different runs are consistent between each others.

Spatial resolution [km]	Ncells	1 Thread	2 Threads	5 Threads	10 Threads
12	109	2.1	2.6	2.6	5
8	228	5.4	3.4	3.8	4.4
4	823	14.1	11.7	10	11.1
2	3070	58.5	51.9	49.1	39.2
1	11851	432.3	333.1	317.8	275.8

Best regards,

Pascal

0 replies

StephanThober · 2024-10-30T07:01:23Z

StephanThober
Oct 30, 2024
Maintainer

Hi Pascal, Hi Malve,

thanks for sharing your run times. To summarize: for the setup you are running mHM, the parallelization does not decrease run times substantially. If you want to further reduce the run times within the calibration, you could choose a shorter simulation period. I am not sure how long your calibration period is. Section 2.4 in Mai (2023) provides some good background on data to use for model calibration.

Best,
Stephan

References:
Juliane Mai, "Ten strategies towards successful calibration of environmental models", Journal of Hydrology, Volume 620, Part A, 2023, 129414, ISSN 0022-1694, https://doi.org/10.1016/j.jhydrol.2023.129414.

1 reply

MalveHeinz Oct 30, 2024
Author

Hi Stephan,

Thanks for taking the time to investigate! Our calibration period is (2016:2022) relatively short i think, but we will definetely check our setup for room for improvement, thanks for the reference!

Best,
Malve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding how parallelization of the mHM calibration works ? #104

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Understanding how parallelization of the mHM calibration works ? #104

MalveHeinz Oct 2, 2024

Replies: 8 comments · 1 reply

StephanThober Oct 7, 2024 Maintainer

MalveHeinz Oct 7, 2024 Author

StephanThober Oct 8, 2024 Maintainer

MalveHeinz Oct 9, 2024 Author

StephanThober Oct 15, 2024 Maintainer

MalveHeinz Oct 28, 2024 Author

pascalhorton Oct 28, 2024

StephanThober Oct 30, 2024 Maintainer

MalveHeinz Oct 30, 2024 Author

MalveHeinz
Oct 2, 2024

Replies: 8 comments 1 reply

StephanThober
Oct 7, 2024
Maintainer

MalveHeinz
Oct 7, 2024
Author

StephanThober
Oct 8, 2024
Maintainer

MalveHeinz
Oct 9, 2024
Author

StephanThober
Oct 15, 2024
Maintainer

MalveHeinz
Oct 28, 2024
Author

pascalhorton
Oct 28, 2024

StephanThober
Oct 30, 2024
Maintainer

MalveHeinz Oct 30, 2024
Author