-
Notifications
You must be signed in to change notification settings - Fork 15
Gadi (NCI) Useful links, commands, and workflows
- https://my.nci.org.au/mancini/login (to check accounting, project, remaining budget, etc.)
- https://opus.nci.org.au/pages/viewpage.action?pageId=90308777 (Gadi's user's manual).
- https://opus.nci.org.au/display/Help/Queue+Limits. We can exploit at most 20736 Gadi cores (432 nodes). I will speak with @santiagobadia to see how we can augment this threshold.
- NCI help (https://help.nci.org.au/)
- NCI terms and conditions. https://nci.org.au/users/nci-terms-and-conditions-access
- NEW Intel Sapphire Rapids Nodes (Mar,2023) https://opus.nci.org.au/display/Help/Sapphire+Rapids+Compute+Nodes
Project's budget
Use the nci_account
command to check for the amount of KSUs used, granted, etc.
Disk quota
In order to check if the amount of files and the total space used by the project members are close to the disk quota.
The commands to analyse the files for the /scratch
and /gdata
filesystem are:
nci-files-report
lquota
For the /home
filesystem, the following command should work:
quota -s
Symply use the following commands
module unload intel-mkl
module load julia
After executing these, you should be able to execute, e.g., the julia --version
command succesfully.
[am6349@gadi-login-01 ~]$ julia --version
julia version 1.6.1
To install Julia in your directory on the cluster, first login to Gadi. Open the terminal and run:
ssh [email protected]
Change to the directory where you would like the Julia installation (it is recommended that it is installed in the /home
directory) and execute the command:
wget https://julialang-s3.julialang.org/bin/linux/x64/1.4/julia-1.4.2-linux-x86_64.tar.gz
This link is obtained from https://julialang.org/downloads/ (look here if a different Julia version is required).
And, finally, untar it:
tar xvzf julia-1.4.2-linux-x86_64.tar.gz
In this section, the steps for running a serial Julia process non-interactively using a single Gadi node is described.
Job script
To batch any job to the cluster, whether it be serial or parallel, we must design a shell script (which we will refer to as the job script), detailing the important specifications of our job. This is mainly so the clusters management system can appropriately allocate the required resources. A template for this job_script.sh
, is shown below:
#!/bin/bash
#PBS -P abc123
#PBS -q normal
#PBS -l walltime=00:30:00
#PBS -l ncpus=1
#PBS -l mem=4gb
#PBS -N my_test.jl
#PBS -l software=Gridap.jl
#PBS -o /scratch/a99/abc123/stdout.txt
#PBS -e /scratch/a99/abc123/stderr.txt
#PBS -l wd
dir=/scratch/a99/abc123/<PATH_WHERE_YOU_WANT_TO_KEEP_DATA>
cd $dir
<PATH_TO_YOUR_INSTALLATION_OF_JULIA> <PATH_TO_YOUR_JULIA_SCRIPT>
The script is divided into two parts, the header and the body.
The header lines are prepended with #PBS
and include the specifications for configuring the job request:
-
#PBS -P abc123
With -P we specify the project ID, abc123 in this example. -
#PBS -q normal
With -q we specify the queue we would like to enter. The optionnormal
is used here for a regular priority job. -
#PBS -l walltime=00:30:00
With this option we specify the time the job will spend on the node. Only the time used will be charged to the project. However, on the other hand, if the job exceeds the time specified here it will be terminated. -
#PBS -l ncpus=1
#PBS -l mem=4gb
With these options we specify the number of cpu's and memory required. For a serial job, we only require 1 cpu. Each cpu has 4GB of memory, which is ok for most serial jobs. However, if we require more memory, we can specify it as:#PBS -l ncpus=1 #PBS -l mem=8gb
Note that, in this case, we will be charged for the use of 2 cpu's (8GB) even if we used less than 4GB of memory. In contrast to the wall- time specification, the resources here are charged based on this header information and not on that we actually use. A guide on how the resources are charged is given here: https://opus.nci.org.au/display/Help/Preparing+for+Gadi#PreparingforGadi-JobCharging-Examples
-
#PBS -N my_test.jl
with -N we give the job a name -
#PBS -l software=Gridap.jl
we specify the software used here for Gadi to analyse -
#PBS -o /scratch/abc123/a99/stdout.txt
#PBS -e /scratch/abc123/a99/stderr.txt
The results of the program (output messages to screen) are written into batch files. With -o and -e we set the location for the output of our code and any error messages. a99 should be replaced with your user ID. -
#PBS -l wd
This option sets the working directory to that from which the job was submitted.
On the other hand, the body of the job script is a regular Unix shell script. In the particular example:
-
dir=/scratch/abc123/a99/<PATH_WHERE_YOU_WANT_TO_KEEP_DATA>
cd $dir
Here we use standard Unix shell commands to again change the working directory if desired.abc123
anda99
should be replaced with the Project ID and user ID, respectively. -
<PATH_TO_YOUR_INSTALLATION_OF_JULIA> <PATH_TO_YOUR_JULIA_SCRIPT>
Finally, we execute the Julia script. This line should look something like:/home/565/a99/julia-1.4.2/bin/julia /scratch/abc123/a99/my_test.jl
Job script submission
After writing the job script, we submit it on Gadi.
To login, open the terminal and run:
ssh [email protected]
Once logged into the cluster, submit the freshly written job script using the command:
qsub job_script.sh
To check on the progress of the job, use the command:
qstat
Work in progress ...
The (current) newest version of MPI (OpenMPI 4.x.x
) experiences issues when calling the HCOLL
library, which is used for collective communications. This problem is probably getting fixed for OpenMPI 5.0.0+
, but until then some workarounds can be useful to avoid problems.
As a first option, one can try to set the running environment to
export HCOLL_ML_DISABLE_SCATTERV=1
export HCOLL_ML_DISABLE_BCAST=1
or as a last resort disable the library completely by running mpiexec
with the flag -mca coll_hcoll_enable 0
.