This repository contains instructions and scripts for setting up a Nimbus Cluster using Slurm (https://nimbus.pawsey.org.au/).
Additional useful documentation:
Timo Lassmann's Nimbus Cluster Instructions
Slurm database installation Wiki
Disclaimer: This is my first time setting up a cluster with Slurm. If you have more experience and wouldn't mind making any suggestions or pointing out something I missed, please let me know. I'm contactable on [email protected] or [email protected]. Thank you.
Prepare clusters according to: Pawsey Managing Instances.
Create a cluster called “node”, which can create multiple nodes at once and name them as node-1, -2 etc after you select how many instances you want. Then create one additional node as node-0 as the master node.
On your desktop set up ssh-agent and use ForwardAgent for forwarding on SSH credentials and ensure your Nimbus SSH credentials are set up for ssh-agent in your bash_profile or bashrc script as well. Added like so: ssh-add ~/.ssh/Nimbus.pem
After logging into node-0 ensure that all nodes are listed in hosts with their associated 192.168.xx.xx IPs:
Example:
cat /etc/hosts
127.0.0.1 localhost
192.168.0.109 node-0
192.168.0.112 node-1
192.168.0.143 node-2
192.168.0.63 node-3
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Note: You may need to update and upgrade first if pdsh isn’t in the package list with.
sudo apt-get -yq update
sudo apt-get -yq upgrade
sudo apt-get install pdsh -y
pdsh -a sudo apt-get -yq update
pdsh -a sudo apt-get -yq upgrade
pdsh -a sudo apt-get install pdsh -y
sudo apt-get -yq update
sudo apt-get -yq upgrade
pdsh -a sudo apt-get -yq update
pdsh -a sudo apt-get -yq upgrade
sudo apt-get -yq install libmunge-dev libmunge2 munge
pdsh -a sudo apt-get -yq install libmunge-dev libmunge2 munge
Copy scripts munge_per_node.sh
and distribute_munge.sh
to your home directory.
Run:
bash ./distribute_munge.sh
systemctl status munge
pdsh -a systemctl status munge
Generate a Slurm config file using the Slurm Configurator. Ensure that the Slurm version you’re going to use will be the same as the public configurator (check with “sudo apt search slurm”). Otherwise install on your master node and download the following:
sudo apt install slurm-wlm-doc
Download documents from master node:
scp ubuntu@X:/usr/share/doc/slurm-wlm-doc/html/configurator.html ./
(where X=master node IP where the documents are)
Note: Depending on the Slurm version, the configurator may be in /usr/share/doc/slurm-wlm/html/.
Open the configurator in your browser.
It is critical to correctly name the master node and the worker nodes. These names have to match exactly what is shown in the Nimbus OpenStack dashboard.
Make sure the number of CPUs, main memory etc are set correctly for each worker node. When setting the memory for each node, it is often better to set it slightly lower, or round down from the exact value, to avoid any memory allocation issues which can arise. If you are unsure of what these values are install the slurm daemon on a node to check:
ssh into one of the worker nodes (e.g. ssh node-1) and then:
sudo apt-get install -yq slurmd
and then run:
slurmd -C
- Set ControlMachine to node-0 (if the master node is named as such)
- Set NodeName to node-[1-3] (if named and numbered as such)
- Set SlurmUser to slurm
- Set StateSaveLocation to /var/spool/slurm-llnl
- Set ProctrackType to Cgroup
- Set TaskPlugin to Cgroup
- Set AccountingStorageType to FileTxt
- Set JobAcctGatherType to None
- Set SlurmctldLogFile to /var/log/slurm-llnl/slurmctld.log
- Set SlurmdLogFile to /var/log/slurm-llnl/slurmd.log
- Set SlurmctldPidFile to /var/run/slurmctld.pid
- Set SlurmdPidFile to /var/run/slurmd.pid
The below settings are for configuring SlurmDBD and are optional (they can be left out if SlurmDBD is not needed or for debugging purposes, and added in later)
- Set JobCompType to MySQL (set to none if leaving out SlurmDBD)
- Set JobCompLoc to slurm_acct_db
- Set JobCompHost to localhost
- Set JobCompUser to slurm
- Set JobCompPass to password (or whatever password was chosen for MariaDB)
- All other settings can be left as default.
Finally copy the output of the configurator webpage into a file called slurm.conf and send a copy to the home directory of your master node (simply copying and pasting into the file works best to preserve formatting).
pdcp -a /etc/hosts ~/
pdsh -a sudo mv ~/hosts /etc/hosts
Copy scripts setup_host_for_slurm.sh
and install_slurm.sh
to your home directory.
The install_slurm.sh
script configures the master node, copies the setup_host_for_slurm.sh
script to all nodes and configures them. It is expected that the slurm.conf
file is in the home directory of user ubuntu.
Run:
bash ./install_slurm.sh
sudo systemctl enable slurmdbd
sudo apt-get install mariadb-server -yq
sudo systemctl start mariadb
sudo systemctl enable mariadb
sudo systemctl status mariadb
sudo /usr/bin/mysql_secure_installation
Set password as “password” (as selected in the slurm.conf flle) and select Y for remaining questions.
sudo mysql -p (enter chosen password)
MariaDB> grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'password' with grant option;
MariaDB> SHOW VARIABLES LIKE 'have_innodb';
MariaDB> create database slurm_acct_db;
Check that all grants to slurm user have been accepted:
MariaDB> show grants;
MariaDB> quit;
sudo vim /etc/mysql/my.cnf
Append the following to the file:
[mysqld]
innodb_buffer_pool_size=16G
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
Note: it is recommended to set the innodb_buffer_pool_size to half the RAM on the slurmdbd server. In this case it is 16GB (half of 32G).
sudo systemctl stop mariadb
sudo mv /var/lib/mysql/ib_logfile? /tmp/
sudo systemctl start mariadb
sudo mysql -p (enter chosen password)
MariaDB> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
zcat /usr/share/doc/slurmdbd/examples/slurmdbd.conf.simple.gz > slurmdbd.conf
Edit the slurmdbd.conf file (sudo vim slurmdbd.conf) set the following:
- Set DbdHost to localhost
- Set StorageHost to localhost
- Set StorageLoc to slurm_acct_db
- Set StoragePass to password (or whatever password was chosen to access MariaDB)
- Set StorageType to accounting_storage/mysql
- Set StorageUser to slurm
- Set LogFile to /var/log/slurm-llnl/slurmdbd.log
- Set PidFile to /var/run/slurmdbd.pid
- Set SlurmUser to slurm
sudo mv slurmdbd.conf /etc/slurm-llnl/
sudo chown slurm:slurm /etc/slurm-llnl/slurmdbd.conf
sudo chmod 664 /etc/slurm-llnl/slurmdbd.conf
4. Run SlurmDBD interactively with debug options to check for any errors (this will also test if the MariaDB “slurm_acct_db” database can be populated):
sudo slurmdbd -D -vvv
End with ctrl-C
mysql -p -u slurm slurm_acct_db
MariaDB> show tables;
MariaDB> quit;
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl status slurmdbd
Check SlurmDBD for errors:
sudo slurmdbd -D -vvv
Check SlurmCTLD for errors:
sudo slurmctld -Dcvvv
Check SlurmD for errors (start-up services first):
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl status slurmdbd
sudo systemctl stop slurmctld
sudo systemctl start slurmctld
sudo systemctl status slurmctld
pdsh -a sudo systemctl stop slurmd
pdsh -a sudo systemctl enable slurmd
pdsh -a sudo systemctl status slurmd
pdsh -a sudo slurmd
pdsh -a sudo slurmd -Dcvvv
Errors can also be viewed in the log files:
sudo cat /var/log/slurm-llnl/slurmctld.log | less -S
sudo cat /var/log/slurm-llnl/slurmdbd.log | less -S
pdsh -a sudo cat /var/log/slurm-llnl/slurmd.log | less -S
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
sudo systemctl stop slurmctld
sudo systemctl start slurmctld
pdsh -a sudo systemctl stop slurmd
pdsh -a sudo systemctl enable slurmd
pdsh -a sudo slurmd
Note: You should now be able to run the following commands with no problems:
sinfo (displays node information)
sudo sacct (requires SlurmDBD and shows previous or running jobs)
scontrol show jobs (shows details of currently running jobs)
scontrol ping (pings slurmctld and shows its status)
If after setting up all services, some services appear to be inactive (dead)/failed after running the status
commands, try running the following commands to restart them (I found this to be the case for slurmd with my 2nd slurm cluster setup):
sudo service slurmdbd restart
sudo service slurmctld restart
pdsh -a sudo service slurmd restart
sudo systemctl status slurmdbd
sudo systemctl status slurmctld
pdsh -a sudo systemctl status slurmd
Note 1: I successfully ran Slurm with Ubuntu 20.04 LTS and Slurm version 19.05.5 (for some reason the Slurm controller appears to only fail when using the Pawsey custom built Ubuntu images).
Note 2: Ubuntu 20.04 LTS images can be obtained from http://cloud-images.ubuntu.com/ and uploaded as RAW format, through the Nimbus OpenStack dashboard, under "Compute" -> "Images" -> "Create Image", as LTS images are no longer available through Nimbus.
1. It is advised to first mount a large volume to your master node, outlined here: Attaching a storage volume. An example of a 20 node cluster with 1 master node, first create instances with 3TB each root volume (or a much smaller root volume depending on your needs/workload), and add an additional large separate volume to be mounted to the master node. If resizing the volume before reattaching, follow the steps in the link provided, and if unmounting the drive first becomes an issue with drive busy warnings, then follow the steps in the trouble shooting section.
grep 192.168 /etc/hosts
192.168.0.109 node-0
192.168.0.112 node-1
192.168.0.143 node-2
192.168.0.63 node-3
3. Replace your IP addresses in the setup_NFS.sh
script. The final mounted disk is on the master node (in this case node-0 with IP 192.168.0.109).
Note: If mounting the folders goes wrong and you end up with stale file handles, just soft reboot the instances and then you can remove the folders.
Run:
bash ./setup_NFS.sh
If you intend to run memory intensive tasks, it's advisable to setup a swap file. I've followed instructions from Linuxize: How to add swap space on Ubuntu 20.04
Copy scripts distribute_swap_file.sh
and swap_file_per_node.sh
to your home directory.
Replace the swap file size for the master node and worker nodes in the distribute_swap_file.sh
and swap_file_per_node.sh
scripts if needed (I had 35GB set for the master node and 2000GB set for the worker nodes -- this is highly dependent on the work loads expected).
Run:
bash ./distribute_swap_file.sh
It's important to set up a local password on each node, so that if all else fails you can log into the instance through the Nimbus OpenStack dashboard on each instance.
Put the following into a bash script (update_local_password.sh) and edit the "password" to whatever you choose:
yes password | sudo passwd ubuntu
Place the bash script into your home directory "~/", copy across to all nodes pdcp -a update_local_password.sh ~/
and then run pdsh -a bash ./update_local_password.sh
This will ensure you can always log into your instance from the console, if access from SSH is blocked or the instance becomes unresponsive.
sudo apt-get install docker.io -y
pdsh -a sudo apt-get install docker.io -y
sudo systemctl enable --now docker
pdsh -a sudo systemctl enable --now docker
sudo usermod -aG docker ubuntu
pdsh -a sudo usermod -aG docker ubuntu
Then soft-reboot all the nodes
sudo apt-get install golang -y
pdsh -a sudo apt-get install golang -y
The GO version needed is >1.13. If the system you're installing on only apt-get installs versions <1.13, then try the following:
First uninstall the incorrect version of GO:
sudo apt-get remove golang-go
sudo apt-get remove --auto-remove golang-go
pdsh -a sudo apt-get remove golang-go
pdsh -a sudo apt-get remove --auto-remove golang-go
Second, remove the GO folder:
sudo rm -dr /usr/local/go
pdsh -a sudo rm -dr /usr/local/go
Third, run the following:
sudo apt-get install software-properties-common
sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt-get update -y
sudo apt-get -yq upgrade
sudo apt-get install golang-go -y
pdsh -a sudo apt-get install software-properties-common
pdsh -a sudo add-apt-repository ppa:longsleep/golang-backports
pdsh -a sudo apt-get update -y
pdsh -a sudo apt-get -yq upgrade
pdsh -a sudo apt-get install golang-go -y
Install dependencies on all nodes:
Copy scripts singdep_per_node.sh
and singdep.sh
to your home directory.
Run:
bash ./singdep.sh
Checking out Singularity from GitHub on all nodes:
Note: If updating to a different version of Singularity, first delete all executables with:
sudo rm -rf /usr/local/libexec/singularity
pdsh -a sudo rm -rf /usr/local/libexec/singularity
Decide on a release version from: https://github.com/hpcng/singularity/releases In this case we chose version 3.7.0
Copy scripts singularity_per_node.sh
and install_singularity.sh
to your home directory and edit them as necessary with your choice of Singularity version
Run:
bash ./install_singularity.sh
Copy scripts compile_singularity_per_node.sh
and compile_singularity.sh
to your home directory
Run:
bash ./compile_singularity.sh
Sbatch example (submit.sh):
#!/bin/bash
#SBATCH --nodes=3
# allow use of all the memory on the node
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
# request all CPU cores on the node
#SBATCH --exclusive
# Customize --partition as appropriate
#SBATCH --partition=debug
srun -n 24 ./my_program.py
Running the job:
sbatch submit.sh
In this example we are running the job across 3 nodes, each with 8 cores, totalling 24 processors.
When expanding the size of your cluster you will need to revisit step 1 onwards, setup your mounted shared drive /data with a now modified setup_NFS.sh
script with the additional nodes IP, install munge, slurm and all the other dependent software.
You will also need to update your /etc/genders
, /etc/hosts
and your /etc/slurm-llnl/slurm.conf
files. It is easier just to re-run most of the scripts from the start with subtle changes to the /etc/hosts
and slurm.conf
files, as these will need to be redistributed across the nodes, as well as the munge key.
Updating the slurm.conf
file can simply be the addition of an identical node with the exact same hardware (simply change node-[1-3]
to node-[1-4]
etc.
Alternatively, the additional node can have different hardware configurations (extra CPUs and/or RAM). A new partition would need to be added. This can be done by adding new lines to your slurm.conf
file in a similar way to the following:
NodeName=node-[1-3] CPUs=8 RealMemory=32116 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
NodeName=node-[4] CPUs=16 RealMemory=64323 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=node-[1-3] Default=YES MaxTime=INFINITE State=UP
PartitionName=batch Nodes=node-[4] MaxTime=INFINITE State=UP
If previously configured nodes have changed IPs (for example if you delete the node-1 instance and replace node-1 with a new instance and IP) then you will need to first update the known hosts. Use the following command (amend to the number of worker nodes):
for i in {1..4}; do ssh-keygen -R node-$i; done
Since slurm is already up and running, all the services will need to be restarted for the changes to take affect. Simply follow the troubleshooting steps in the next section for restarting all the nodes and services.
When submitting jobs on the new nodes with a different partition, in this case batch
on nodes with 16 cores, the following will need to be modified in all submission scripts:
#SBATCH --ntasks-per-node=16
#SBATCH --partition=batch
-
If at any time job output stops appearing in the /data directory, it may be that the shared /data folders are no longer mounted. Therefore, rerun the
setup_NFS.sh
script again. -
If after a cluster configuration change (altering
slurm.conf
), reboot or power outage, your worker nodes may remain down and slurmd inactive, which can be seen by runningsinfo
andpdsh -a sudo systemctl status slurmd
. Try resetting and restarting all nodes and services, simply by running the following commands (amend to the number of worker nodes):
for i in {1..3}; do sudo scontrol update NodeName=node-$i state=idle; done
sudo service slurmctld restart
sudo service slurmdbd restart
pdsh -a sudo service slurmd restart
-
If for whatever reason Munge or other Slurm services go inactive, and no other methods seem to work, try rerunning all the scripts, and then if all else fails, try hard-rebooting all nodes and then re-initialising all the nodes and services.
-
If the /data drive becomes unmounted, simply run
sudo mount /dev/vdc /data
-
If slurmd becomes unresponsive and the process becomes stuck (you see
Unable to bind listen port (*:6818): Address already in use
when runningsudo slurmd -Dvvv
), then kill the slurmd processes on all nodes by running the following:
pdsh -a sudo killall -9 slurmd
- If for whatever reason, you want to start from the beginning again you can either destroy the instances and start again or uninstall slurm, munge and slurmdbd by running the following:
sudo apt-get remove --auto-remove slurmctld -y
sudo apt-get remove --auto-remove slurmdbd -y
sudo apt-get remove --auto-remove munge -y
pdsh -a sudo apt-get remove --auto-remove slurmd -y
pdsh -a sudo apt-get remove --auto-remove munge -y
- If you're trying to unmount a drive to resize the drive before reattaching and the drive appears busy, use the
fuser -m
command. For example:
ubuntu@node-0:/data/$ fuser -m /data
/data: 31346c
ubuntu@node-0:/data/$ ps -f 31346
UID PID PPID C STIME TTY STAT TIME CMD
ubuntu 31346 31345 0 08:56 pts/0 Ss 0:00 -bash
-
Use the
man fuser
command for more information. -
Use the
kill
command on any running process and edit any auto-mounts in/etc/fstab
, and then shut down the instance before attempting to detach the volume in the Nimbus dashboard.
Make full use of the sbatch
command, such as using the begin and dependency parameters to schedule and string together a workflow.
Example using the begin parameter:
sbatch --begin=now submit_1.sh
sbatch --begin=now+12hours submit_2.sh
Example using the dependency parameter:
sbatch submit_1.sh
Submitted batch job 200
sbatch --dependency=afterok:200 submit_2.sh
Example using the dependency parameter with multiple jobs in a bash script:
#!/bin/bash
# first job with no dependencies
job1=$(sbatch submit_1.sh | cut -d ' ' -f 4)
# second job depends on the first completing with an exit code of zero (no errors)
job2=$(sbatch --dependency=afterok:$job1 submit_2.sh | cut -d ' ' -f 4)
# multiple jobs can depend on the second job completing
job3=$(sbatch --dependency=afterok:$job2 submit_3.sh | cut -d ' ' -f 4)
job4=$(sbatch --dependency=afterok:$job2 submit_4.sh | cut -d ' ' -f 4)
# the final job can depend on multiple jobs completing
job5=$(sbatch --dependency=afterok:$job3:$job4 submit_5.sh | cut -d ' ' -f 4)
See the sbatch user page for more information and don't be afraid to look around online for other ideas to become familiar with the usage of different slurm commands.