NOTE 2021-09-30: Changes has been made to slurm role after writing this doc. It may not be relevant anymore. Now by default the slurm is coming from ohpc.
In general one needs to be careful with slurmdbd and run it in the foreground during upgrade to monitor progress. See http://slurm.schedmd.com/quickstart_admin.html#upgrade
- Service break is required. Make sure no jobs are running.
- Stop slurmdbd: systemctl stop slurmdbd
- Remove all slurm and munge packages: yum remove 'slurm' 'munge*'
- Install slurm server packages: yum install ohpc-slurm-server
- Start munge: systemctl restart munge
- Start slurmdbd in the foreground: /sbin/slurmdbd -D -v
- Wait until the DB upgrade is completed "slurmdbd version XX.XX.X started" (can take 45 mins)
- Stop the slurmdbd running in the foreground: Ctrl-C
- Start slurmdbd via systemd: systemctl start slurmdbd
- In ansible group_vars/all/all.yml, set slurm_repo: "ohpc"
- Run: ansible-playbook install.yml --tags=fgci-install
- systemctl stop slurmctld
- Remove slurm and munge packages if not done already: yum -y remove 'slurm' 'munge*'
- Install slurm server packages if not done already: yum -y install ohpc-slurm-server
- If install node: systemctl restart slurmdbd; sleep 3; systemctl restart munge; sleep 3; systemctl restart slurmctld
- If only controller node: systemctl restart munge; sleep 3; systemctl restart slurmctld
- systemctl stop slurmd
- yum -y remove 'slurm' 'munge*'
- yum -y install ohpc-slurm-client
- systemctl start munge; sleep 3; systemctl start slurmd
- Run to lock slurm version: ansible-playbook install.yml -t slurm
- Run to lock slurm version: ansible-playbook compute.yml -t slurm
- yum remove 'slurm' 'munge*'
- Run ansible: ansible-playbook grid.yml -t slurm
- yum -y remove 'slurm' 'munge*'
- yum -y install ohpc-slurm-client
- Run ansible: ansible-playbook login.yml -t slurm
- Switching from FGCI to OHPC slurm packages is complete
To upgrade to a newer slurm OHPC version:
- Do steps 0-1 from previous list above
- Delete all the slurm and munge versionlock stuff from /etc/yum/pluginconf.d/versionlock.list
- In group_vars set slurm_ohpc_versionlock to False
- ansible-playbook install.yml --tags=fgci-install
- yum update
- Do steps 4-7 from the above list.
- Upgrade all the nodes per steps 9-16 except run "yum update" instead of remove+install.
- In group_vars set slurm_ohpc_versionlock to True and run this role to lock the version again.
Mostly the steps are done outside ansible, but it helps a bit.
Slurm 16.05 details: http://slurm.schedmd.com/SLUG16/V16.05.pdf
Official upgrade documentation: http://slurm.schedmd.com/quickstart_admin.html#upgrade
This guide is not a replacement for the official instructions.
It assumes you are using the https://github.com/fgci-org/fgci-ansible playbooks where install.yml is to the service node and compute.yml to the compute nodes. It also assumes you are using the FGCI yum repo to fetch slurm packages.
- stop slurmdbd
- take backup with something like: /usr/local/sbin/dump-all-databases.sh -o /outdir -z
- set ansible variable fgci_slurmrepo_version to "fgcislurm1605" (this will point the yum.repos/fgislurm.repo to the 1605 repo in group_vars/all
- run slurm role until task "Add FGI slurm repo" on install node
- ansible-playbook install.yml -t slurm -step
- answer Y on the setup task and "Add FGI slurm repo" tasks only, rest N. ctrl+C quit after the "FGI slurm repo" task is done
- yum update
- schema upgrade, slurmdbd -D
- ctrl+C when it says started
- systemctl daemon-reload
- systemctl start slurmdbd
- systemctl restart slurmctld
- start slurmdbd
- bash tools/pullReqs.sh (to rsync group_vars to the internal web server for ansible-pull)
- ansible-playbook compute.yml -t slurm
- run yum update to update slurm
- pdsh -g compute -l root yum -y update
- optionally systemctl daemon-reload
- systemctl restart slurmd
- run the slurm role on the submit hosts (grid and login)
- ansible-playbook site.yml -t slurm -l login,grid
- yum update on login and grid node