-
Notifications
You must be signed in to change notification settings - Fork 15
Home
-
Up to my knowledge (@amartinhuertas), at present, the unique way of "debugging"
MPI.jl
parallel programs is "print statement debugging". We have observed that messages printed to stdout usingprintln
by the different Julia REPLs running at different MPI tasks are not atomic, but broken/intermixed stochastically. However, if you doprint("something\n")
you are more likely to get it to print to a single line thanprintln("something")
(Thanks to @symonbyrne for this trick, it is so useful). More serious/definitive solutions are being discussed in this issue ofMPI.jl
. -
Some people have used
tmpi
(https://github.com/Azrael3000/tmpi) for running multiple sessions interactively, and we could try using the@mpi_do
macro inMPIClusterManagers
(I have not explored neither of them). If am not wrong, I guess that the first alternative may involve multiplegdb
debuggers running at different terminal windows, and a deep knownledge of the low-level C code generated byJulia
(see https://docs.julialang.org/en/v1/devdocs/debuggingtips/ for more details). I wonder whether, e.g., https://github.com/JuliaDebug/Debugger.jl, could be combined withtmpi
. -
For reducing JIT lag it becomes absolutely mandatory to build a custom system image of (some of) the
GridapDistributed.jl
dependencies, e.g.,Gridap.jl
. See the following link for more details. https://github.com/gridap/Gridap.jl/tree/julia_script_creation_system_custom_images/compile. TO BE UPDATED WHEN BRANCHjulia_script_creation_system_custom_images
is merged intomaster
. Assuming that the name of theGridap.jl
image is calledGridapv0.10.4.so
, then one may call the parallel MPI.jl program as:mpirun -np 4 julia -J ./Gridapv0.10.4.so --project=. test/MPIPETScDistributedPoissonTests.jl
-
Precompilation issues of MPI.jl in parallel runs. See here for more details.
-
In
NCI@Gadi
(I do not know in other systems), I am getting per-task core dump files on crashes (e.g., SEGFAULT). This is bad, since the file system in Gadi is limited, and such core dump files are not particularly light-weight. I wrote to Gadi support, and I got the following answer. (I did not yet explore anything.):
Hi,
I am really not sure what can be done here. You are running an mpi julia program that in turn
calls petsc. It looks like both petsc and julia have its own signal handlers so potentially may overwrite
core dump settings. I am not sure how julia is calling PetscInitialize() but if does call it directly,
you may try adding -no_signal_handler to it. It appears you can also put it into ~username/.petscrc (
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html)
Alternatively, it could be that ulimit -c setting is not propagating to all nodes ... and yes,
it doesn't. I guess you can try using a wrapper to set it on all nodes (wrapper.csh):
#!/bin/tcsh
limit coredumpsize 0
./a.out
Replace a.out with the name of your program and then run mpirun ./wrapper.csh and see if this helps.
It is better to use csh script as you will get
vsetenv BASH_FUNC_module%% failed
vsetenv BASH_FUNC_switchml%% failed
errors from bash ... and it is too late for me to try to figured out where do they come from .
Best wishes
Andrey