For Google Cloud issues, please direct support requests to Google Cloud Support.
For Slurm and slurm-gcp
issues, please direct support
requests to SchedMD Support.
There are numerous things that could cause strange or undesired behavior. They
could originate from Google Cloud or from the Slurm scripts
provided by slurm-gcp
. You should always check logs to locate errors and
warning being reported by the cluster instances and Google Cloud.
Google Cloud Logging makes it easy to
monitor project activity, including all Slurm logs from each instance, by
collating all logging within the project into one place.
Optionally, you can directly check messages/syslog, Slurm logs, and Slurm script logs on each instance.
- syslog ( HINT:
grep "startup-script" $LOG
)/var/log/messages
/var/log/syslog
- Slurm
/var/log/slurm/slurmctld.log
/var/log/slurm/slurmdbd.log
/var/log/slurm/slurmrestd.log
/var/log/slurm/slurmd-%n.log
- Slurm scripts
/var/log/slurm/resume.log
/var/log/slurm/suspend.log
Additionally, increasing Slurm log verbosity level and or adding DebugFlags may be useful for tracing any errors or warnings.
$ scontrol setdebug debug2
$ scontrol setdebugflags +power
Upon startup-script failure, all users should be notified via wall
and motd
.
Check /slurm/scripts/setup.log
for details about the failure.
Google Cloud has quota limits. Instances can fail to be deployed because they would exceed your CPU limits. Additionally, instance deployments can be throttled because of quota limits placed on API requests. If you are experiencing these quota limits with your cluster, consider requesting a limit increase to better meet your cluster demands.