Skip to content

pytorch-ignite/aws-parallel-cluster-slurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS Parallel Cluster (SLURM) setup and info

🚧 WIP project, scope is limited to our own use-cases 🚧

AWS ParallelCluster is an AWS supported open source cluster management tool that helps you to deploy and manage high performance computing (HPC) clusters in the AWS Cloud. Built on the open source CfnCluster project, AWS ParallelCluster enables you to quickly build an HPC compute environment in AWS. It automatically sets up the required compute resources and shared filesystem. You can use AWS ParallelCluster with batch schedulers, such as AWS Batch and Slurm. AWS ParallelCluster facilitates quick start proof of concept deployments and production deployments. You can also build higher level workflows, such as a genomics portal that automates an entire DNA sequencing workflow, on top of AWS ParallelCluster.

Cluster Setup

Requirements

  1. Install aws-parallelcluster python package
pip install aws-parallelcluster
pcluster version
> 2.10.4
  1. AWS CLI:
pip install awscli

which aws
aws --version
> aws-cli/1.19.88 Python/3.8.10 Darwin/18.7.0 botocore/1.20.88
aws configure
  1. Create PEM identity file on AWS console

In the command line:

aws ec2 create-key-pair --key-name playground-cluster --output text > ~/.ssh/aws-playground-cluster.pem
chmod 600 ~/.ssh/aws-playground-cluster.pem
  1. Create VPC with "Head node in a public subnet and compute fleet in a private subnet"
Click for details

In the command line, run pcluster configure to create VPC:

pcluster configure
>
INFO: Configuration file /Users/user/.parallelcluster/config will be written.
Press CTRL-C to interrupt the procedure.


Allowed values for AWS Region ID:
...
AWS Region ID [us-east-2]: 14
Allowed values for EC2 Key Pair Name:
...
EC2 Key Pair Name [playground-cluster]: 1
Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]: 3
Allowed values for Operating System:
1. alinux2
2. centos7
3. centos8
4. ubuntu1804
5. ubuntu2004
Operating System [alinux2]: 4
Minimum cluster size (instances) [0]: 0
Maximum cluster size (instances) [10]: 2
Head node instance type [t2.micro]:
Compute instance type [t2.micro]:
Automate VPC creation? (y/n) [n]: y
Allowed values for Network Configuration:
1. Head node in a public subnet and compute fleet in a private subnet
2. Head node and compute fleet in the same public subnet
Network Configuration [Head node in a public subnet and compute fleet in a private subnet]: 1
Beginning VPC creation. Please do not leave the terminal until the creation is finalized
Creating CloudFormation stack...
Do not leave the terminal until the process has finished
Stack Name: parallelclusternetworking-pubpriv-20210718212635
Status: NatRoutePrivate - CREATE_IN_PROGRESS
The stack has been created
Configuration file written to /Users/user/.parallelcluster/config

Get public and private subnets from created config file:

cat /Users/user/.parallelcluster/config | grep subnet
>
master_subnet_id = subnet-055a4a2a3d57187d3
compute_subnet_id = subnet-0624435f202eb2e11

Remove create configuration:

rm -R /Users/user/.parallelcluster

Configure AWS ParallelCluster

Create a cluster

User management

  1. Connect to the cluster as admin user (ubuntu by default)
  2. Clone aws-parallel-cluster-slurm repository:

2.1 To enable access to the repository, add id_rsa.pub to project's deploy keys: https://github.com/pytorch-ignite/aws-parallel-cluster-slurm/settings/keys

2.2 Get the repository

git clone [email protected]:pytorch-ignite/aws-parallel-cluster-slurm.git
cd aws-parallel-cluster-slurm

Add new user

  1. Get SSH public key from the user
  2. Execute the command to create user, e.g. alice:
bash setup/users/add_new_user.bash alice
>
[INFO][2021-07-17 21:19:26] Please enter the public SSH key for the user:
ssh-rsa AAAAB....
[INFO][2021-07-17 21:19:34] Create new user: alice
[INFO][2021-07-17 21:19:34] Updated users list: alice 1001
[INFO][2021-07-17 21:19:34] Added public key to /home/alice/.ssh/authorized_keys
  1. Verify
id alice
>
uid=1001(alice) gid=1001(alice) groups=1001(alice)

User should be able to connect the cluster with SSH:

ssh -i /path/to/ssh/private/id_rsa alice@<cluster-ip>

Remove existing user

  1. Execute the command to remove user, e.g. alice:
bash setup/users/remove_user.bash alice
>
[INFO][2021-07-17 21:31:53] Please, confirm to remove user: alice [Y/n]: Y
[INFO][2021-07-17 21:31:55] Removed alice from users list
Looking for files to backup/remove ...
Removing files ...
Removing user `alice' ...
Warning: group `alice' has no more members.
Done.
[INFO][2021-07-17 21:31:55] User alice was deleted
  1. Verify
id alice
>
id: ‘alice’: no such user

Cluster usage

srun, sbatch, sinfo, squeue, scancel

References

About

AWS Parallel Cluster (SLURM) configuration and information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published