Federated Cluster Guide

Federated Cluster Guide

Overview

This guide focuses on setting up a federation of Slurm clusters and Slurm multi-cluster.

Federation is a superset of multi-cluster. By setting up federation, you are also setting up multi-cluster.

If using slurm_cluster terraform module, please refer to multiple-slurmdbd section.

NOTE: slurmdbd and the database (e.g. mariadb, mysql, etc..).

Federation

Slurm includes support for creating a federation of clusters and scheduling jobs in a peer-to-peer fashion between them. Jobs submitted to a federation receive a unique job ID that is unique among all clusters in the federation. A job is submitted to the local cluster (the cluster defined in the slurm.conf) and is then replicated across the clusters in the federation. Each cluster then independently attempts to the schedule the job based off of its own scheduling policies. The clusters coordinate with the "origin" cluster (cluster the job was submitted to) to schedule the job.

Each cluster in the federation independently attempts to schedule each job with the exception of coordinating with the origin cluster (cluster where the job was submitted to) to allocate resources to a federated job. When a cluster determines it can attempt to allocate resources for a job it communicates with the origin cluster to verify that no other cluster is attempting to allocate resources at the same time.

Multi-Cluster

Slurm offers the ability to target commands to other clusters instead of, or in addition to, the local cluster on which the command is invoked. When this behavior is enabled, users can submit jobs to one or many clusters and receive status from those remote clusters.

When sbatch, salloc or srun is invoked with a cluster list, Slurm will immediately submit the job to the cluster that offers the earliest start time subject its queue of pending and running jobs. Slurm will make no subsequent effort to migrate the job to a different cluster (from the list) whose resources become available when running jobs finish before their scheduled end times.

General Requirements

Use Slurmdbd
All clusters must be able to communicate with each slurmdbd and slurmctld.
slurmdbd to database forms a one-to-one relationship.
Each cluster must be able to communicate with slurmdbd.
- Either all clusters and slurmdbd uses the same MUNGE key.
- Or, all clusters have a different MUNGE key and an alternative authentication method for slurmdbd.
(Optional) Login nodes must be able to directly communicate with compute nodes (otherwise srun and salloc will fail).

Shared Slurmdbd

Deploy slurmdbd and database (e.g. mariadb, mysql, etc..).
Deploy Slurm clusters by any chosen methods (e.g. cloud, hybrid, etc..).

WARNING: This type of configuration is not supported by slurm_cluster terraform module; see the multiple-slurmdbd section instead.

Update slurm.conf with accounting storage options:

AccountingStorageHost
AccountingStoragePort
AccountingStorageUser
AccountingStoragePass

# slurm.conf
AccountingStorageHost=<HOSTNAME/IP>
AccountingStoragePort=<HOST_PORT>
AccountingStorageUser=<USERNAME>
AccountingStoragePass=<PASSWORD>

Add clusters into federation.

sacctmgr add federation <federation_name> [clusters=<list_of_clusters>]

Additional Requirements

User UID and GID are consistent across all federated clusters.

Multiple Slurmdbd

Deploy slurmdbds and databases (e.g. mariadb, mysql, etc..).

NOTE: slurm_cluster terraform module conflates the controller instance and the database instance.
Deploy Slurm clusters by any chosen methods (e.g. cloud, hybrid, etc..).

WARNING: If using the slurm_cluster terraform module, do not use the cloudsql input, as this does not work with a federation setup.

Update each slurm.conf with:

AccountingStorageExternalHost

# slurm.conf
AccountingStorageExternalHost=<host/ip>[:port][,<host/ip>[:port]]

Add clusters into federation.

sacctmgr add federation <federation_name> [clusters=<list_of_clusters>]

Additional Requirements

All clusters must know where each slurmdbd is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

federation.md

federation.md

Federated Cluster Guide

Overview

Federation

Multi-Cluster

General Requirements

Shared Slurmdbd

Additional Requirements

Multiple Slurmdbd

Additional Requirements

Files

federation.md

Latest commit

History

federation.md

File metadata and controls

Federated Cluster Guide

Overview

Federation

Multi-Cluster

General Requirements

Shared Slurmdbd

Additional Requirements

Multiple Slurmdbd

Additional Requirements