diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..84b3c05 --- /dev/null +++ b/404.html @@ -0,0 +1,870 @@ + + + +
+ + + + + + + + + + + + + + +In order to run jobs on Trixie, users need to specify which SLURM Account Code should be used for billing. This is handled by adding a line in the SLURM submission script which identifies the account
+SBATCH --account=account_code
+
+Users must be authorized to charge an account before they can use it.
+AI for Drug Design, NRC-PI:Tchagang, Alain
+Precision Discovery in Bio Systems, NRC-PI:Shao, Xiaojian
+Multi-Targeted Therapeutics, NRC-PI:Fauteux, François
+Protein Design Drugs & Gene, NRC-PI:Paquet, Eric
+AI Simulation of Bio Systems, NRC-PI:Cuperlovic-Culf, Miroslav
+Digital-Twining of Bioreactor, NRC-PI:Belacel, Nabil
+AI-based Shape Optimization, NRC-PI:Shu, Chang
+Design of Superconductive Tapes, NRC-PI:Valdes, Julio
+Intelligent Design, NRC-PI:Guo, Hong Yu
+Automated Material Synthesis using Deep Reinforcement Learning, NRC-PI:Tamblyn, Isaac
+Simulation & Design of Materials, NRC-PI:Tchagang, Alain
+Spectroscopic Signatures, NRC-PI:Tamblyn, Isaac
+Miniaturization HP Components, NRC-PI:Grinberg, Yuri
+AI-assisted Inverse Design, NRC-PI:Grinberg, Yuri
+NRC-PI:Ebadi, Ashkan; Xi, Pengchengi
+Data Analytics Centre / Données Analytiques
+Data Science for Complex Systems / Science des Données pour les Systèmes Complexes
+Multilingual Text Processing / Traitement Multilingue de Texte
+Text Analytics / Analyse de textes
+Computational Laboratory for Energy And Nanoscience
+ + + + + + + + + + + + + +Here’s a skeleton of what our jobs look like. Please check your job once it is running to dial down the number of cpus and memory needed. If we don’t use the node’s full resources, it would be nice to be able to submit other cpu-only jobs, aka none gpu jobs on those nodes.
+Important steps in order to get automatic requeueing working:
+* Ask slurm to send you a signal 30 seconds before the end of your time limit --signal=B:USR1@30
+* Have a thread listen to the requested signal trap _requeue USR1
+* Send your MAIN process in the background and wait for it otherwise your _requeue
function will NEVER get a chance to run.
#SBATCH --job-name=WMT21.training
+#SBATCH --comment="Thanks Samuel Larkin for showing me how to work with Slurm"
+
+#SBATCH --partition=TrixieMain
+#SBATCH --account=dt-mtp
+#SBATCH --gres=gpu:4
+#SBATCH --time=12:00:00 #SBATCH --exclude=cn125
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=24
+#SBATCH --mem=40G
+# To reserve a whole node for yourself
+####SBATCH --exclusive
+#SBATCH --open-mode=append
+#SBATCH --requeue
+#SBATCH --signal=B:USR1@30
+#SBATCH --output=%x-%j.out
+
+# Requeueing on Trixie
+# [source](https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/)
+# [source](https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html#how-to-recover-files-before-a-job-times-out)
+function _requeue {
+ echo "BASH - trapping signal 10 - requeueing $SLURM_JOBID"
+ date
+ # This would allow to generically requeue any job but since we are using XLM
+ # which is slurm aware, XLM could save its model before requeueing.
+ scontrol requeue $SLURM_JOBID
+}
+
+if [[ -n "$SLURM_JOBID" ]]; then
+ trap _requeue USR1
+fi
+
+
+time python –m sockeye.train …. &
+wait
+
+where time python –m sockeye.train ….
is the process you want to run.
The most up-to-date way to see which software has been preinstalled on Trixie is by using the module command. When in double, it is the definitive list.
+Software on Trixie is organized using the module
service. Users can load, unload, and swap libraries within their environment and job submission scripts via modules
.
If there is a piece of software you would like to use but it is not available from the list (and you can't figure out how to build it yourself in your home directory), you may create a request using the issues tab (https://github.com/ai4d-iasc/trixie/issues). Please do not create duplicate requests for software (but feel free to comment on a thread to ''upvote'' the priority list is clear.
+To ease usage of the bastion host to easily connect to Trixie, there are some steps which can be taken, especially making use of the SSH ProxyJump and ControlMaster parameters. Basically, you need to configure SSH to automatically connect with the Trixie server using the bastion host as a connector between your local computer and the Trixie server.
+Important Note: Before proceeding with this configuration, please ensure that you have performed the External Access Setup procedure.
+To configure SSH to automatically connect to the Trixie server, please open your .ssh/config
file with your preferred text editor and add the following lines on your local machine – not the servers – while substituting your given usernames in the User directive. You will also need to create the folder .ssh/sockets
to complete the configuration.
Host trixie-bastion
+ HostName trixie.nrc-cnrc.gc.ca
+ User <firstname>.<lastname>@pub.nrc-cnrc.gc.ca
+ ControlMaster auto
+ ControlPath ~/.ssh/sockets/%r@%h-%p
+
+Host trixie
+ HostName trixie.res.nrc.gc.ca
+ User admin.<firstname>.<lastname>
+ ProxyJump trixie-bastion
+
+Once your settings are configured, you will be able to login to the Trixie server with the following command
+ssh trixie
Please note that you will be prompted as follows
+To configure SSH to automatically connect to the Trixie server, please set the following settings in your Putty application, substituting your username where applicable.
+Under Connection -> SSH
+ssh –A –Y admin.<firstname>.<lastname>@trixie.res.nrc.gc.ca
+2. Under Connection -> SSH -> X11
++3. Under Session
+<span>
@pub.nrc-cnrc.gc.ca <span>
@trixie.nrc-cnrc.gc.ca+4. Click Save
+Once the settings have been saved, you can double click on the name in the list of Saved Sessions to open a session to the Trixie server. Please note that you will be prompted as follows
+As an external NRC collaborator, you can access the AI for Design (Trixie) Cluster using the Bastion Host. External collaborators include non-NRC researchers, industrial partners, and vendors.
+You can access only those folders on Trixie that are required for your project. Requests for access to Trixie and specific projects must be made by your NRC research contact; you cannot request access to a system yourself.
+Once granted access, you will have two sets of credentials issued to access the cluster:
+Account | +Purpose | +User name format (example: John Doe) | +
---|---|---|
PUB | +Provides access to the external bastion host and used for theLoginTC second factor authentication | +A combination of your first and last name. E.g.: john.doe@pub.nrc-cnrc.gc.ca | +
Trixie System | +Provides access to Trixie | +admin.firstname.lastname E.g.: admin.john.doe | +
Your NRC contact, or an NRC system administrator, will provide you with the PUB and Admin user names and passwords that you require to access the NRC systems. Note that on first login, you will be required to change your password. Please note: during the password change, the first prompt asks for a confirmation of your existing password prior to requesting a new one.
+Before you attempt your first login, the following initial installation and configuration of LoginTC must be implemented.
+In order to access Trixie, you will need to use an SSH client. Please note that you cannot access Trixie using a web browser. On Mac OSX and Linux, SSH is installed by default. On Windows you will need to install Putty if it is not installed already. You can download Putty from the following website:
+https://www.putty.org/
+For Mac OSX and Linux you can open a new terminal and connect to trixie.nrc-cnrc.gc.ca
via ssh using your PUB account and the following command
ssh <firstname.lastname>@pub.nrc-cnrc.gc.ca trixie.nrc-cnrc.gc.ca
For Windows, you can create a Putty profile to SSH into the bastion server
+Under Session
+<span>
@pub.nrc-cnrc.gc.ca <span>
@trixie.nrc-cnrc.gc.ca+4. Click Save
+Once the settings have been saved, you can double click on the name in the list of Saved Sessions to open a session to the bastion server.
+When you login for the first time you will be forced to change your password for both your Pub account and your Trixie admin account. Please note that when you do this, you will be prompted for your original (or current) password first and then you will be prompted to enter your new password twice.
+In the following procedure, the information printed in the images may not be the same as what you will see when you login. However the steps will be the same.
+Please perform the following steps to access Trixie.
++2. Press 1 followed by the Enter key and then check your LoginTC device as setup above to approve the login request +3. If a message similar to the one below appears, then simply type in yes to the prompt as shown below
++4. After you complete the two-factor authentication process in LoginTC you will be prompted to enter your PUB account password and then you will be forced to change your password. You should see a message similar to the one below – remember to enter your original password first and then enter your new password twice.
++5. The system will automatically log you out, thus, you will need to login again using your new password +6. Once you have successfully logged in, you will be logged into the bastion server – your screen should look similar to the following
+
+7. If you have your credentials for the trixie.res.nrc.gc.ca
server you can skip this step. Otherwise, you will now need to contact the administrator who provided you with your credentials for the bastion server to obtain your credentials for the Trixie server
+8. You will need to login to Trixie next. From the bash prompt, use SSH to log into trixie.res.nrc.gc.ca
with your Trixie admin.
ssh admin.<firstname.lastname>@trixie.res.nrc.gc.ca
+9. If a message similar to the one below appears, then simply type in yes to the prompt as shown below
+10. You will be prompted to enter your Trixie admin account password and then you will be forced to change your password. You should see a message similar to the one below – remember to enter your original password first and then enter your new password twice.
+![images/login3.png](images/login3.png)
+
+Once you have successfully logged in, you will be logged into Trixie – your screen should look similar to the following
+ +After successful authentication, you should see the Trixie cluster login banner with terms and be placed in a shell in your home directory on the cluster, similar to the image above.
+Note that you will be placed in your home directory which only you have access to. For more information on the cluster and its usage, please see the:
+ +Passwords on the PUB and RES accounts expire after 90 days and must be changed. If you do not change your password, you will be locked out of the system.
+Watch for the pop-up message notifying you to change your password, or set yourself a reminder to change your password before the 90-day expiry.
+If you get locked out of your account due to an expired password for any account, notify your NRC contact who can have the password reset.
+You can change your PUB password by logging into the following website. The site allows you to manage your PUB account. Please use the following format for your username john.doe@pub
Please note that the Reset Password feature will not work if you do not fill in the security questions on the website. Therefore it is strongly recommended that you fill in the security questions so that you can reset your password if necessary.
++4. The system will automatically log you out, thus, you will need to login again using your new password
+There may be instances where researchers require connectivity to external HPC systems from Trixie. However, network access to and from Trixie is restricted to maintain a high level of security. Therefore, connections to external systems need to be approved before the connection can be opened.
+This page provides instructions for requesting a connection to an external system, as well as a list of approved systems that already have an open connection.
+In order to submit a request to open a network flow between Trixie and an external HPC system, please post your request in the issues section of this site.
+Institution | +System URL | +
---|---|
Compute Canada - Cedar | +cedar.computecanada.ca | +
Compute Canada - Beluga | +beluga.computecanada.ca | +
Compute Canada - Niagra | +niagra.computecanada.ca | +
Compute Canada - Graham | +graham.computecanada.ca | +
Vector Institute | +v.vectorinstitute.ca | +
NERSC.gov - Cori | +cori.nersc.gov | +
This document will describe various procedures for transferring files to and from Trixie.
+Important Note: For external users, before proceeding with this configuration, please ensure that you have performed the external access setup and advanced configuration procedures.
+The following sections detail how to transfer files between your local computer and Trixie. They basically rely on advanced SSH configurations to bridge the network between your local computer and Trixie.
+To copy a file to the Trixie server, please use the scp command on your local machine.
+Please note that the use of this method requires that your system be configured as detailed in the advanced configuration in order to provide a direct link between your local machine and the Trixie server.
+The following command will copy the file test.txt
from John Doe’s local machine to his admin.john.doe account on Trixie. Please note that using trixie as the hostname will only work if you have configured SSH to use ProxyJump as detailed in the advanced configuration.
scp test.txt trixie:/home/admin.john.doe
To copy a file from Trixie to your local machine, you basically reverse the arguments to the scp command.
+scp trixie:/home/admin.john.doe/test.txt test.txt
To copy an entire directory instead of just a file, please use the –r option (for recursive) to the scp command.
+scp –r myWorkFilesDir trixie:/home/admin.john.doe
The following command will copy the file test.txt
from John Doe’s local machine to his account on Trixie. Please note that the example assumes the username on Trixie is different than the username on the local machine.
scp test.txt doej@trixie.res.nrc.gc.ca:/home/doej
To copy a file from Trixie to your local machine, you basically reverse the arguments to the scp command.
+scp doej@trixie.res.nrc.gc.ca:/home/doej/test.txt test.txt
To copy an entire directory instead of just a file, please use the –r option (for recursive) to the scp command.
+scp –r myWorkFilesDir doej@trixie.res.nrc.gc.ca:/home/doej
To copy a file to the Trixie server, please use the WinSCP command on your local machine.
+If you need to install WinSCP then please download and install it from this site
+First you will need to configure WinSCP to connect to Trixie using an SSH tunnel. Open WinSCP and follow the procedure below to configure it to access Trixie via an SSH tunnel.
++2. In the window that pops up, perform the following
+Set the User name: \<br>
+ The window should now look similar to the following
+ 4. Click the Advanced button +3. In the window that pops up, perform the following
+Click the Tunnel item in the left pane
+Set User name: \<span>
@pub.nrc-cnrc.gc.ca<br>
+ The window should now look similar to the following
+ 5. Click the OK button +4. Click the Save button in the previous popup window +5. In the window that pops up, perform the following
+Type in a Site name - perhaps Trixie <br>
+ The window should now look similar to the following
+ 2. Click the OK button
+6. Click the Login button in the previous popup window <br>
+ You will be prompted to authenticate with LoginTC (you will need to type 1) and both your Pub and Trixie passwords
+7. Once you are logged into your session, you can drag and drop the files you need to transfer between the two file listings
If you need to install WinSCP then please install it from the NRC Software Portal on your desktop.
+First you will need to configure WinSCP to connect to Trixie. Open WinSCP and follow the procedure below to configure it to access Trixie.
++2. In the window that pops up, perform the following
+Set the User name: \<br>
+ The window should now look similar to the following
+ 4. Click the Save button +3. In the window that pops up, perform the following
+Type in a Site name - perhaps Trixie <br>
+ The window should now look similar to the following
+ 2. Click the OK button
+4. Click the Login button in the previous popup window <br>
+ You will be prompted to authenticate with your Trixie password
+5. Once you are logged into your session, you can drag and drop the files you need to transfer between the two file listings
To copy a file to the Trixie server, please use the pscp command on your local machine.
+Please note that the use of this method requires that you have two Putty profiles defined.
+The bastion server profile was likely created during the setup configuration for your external access to Trixie. If not, then please see the initialize SSH connection section for detailed instructions on creating a profile for the bastion server.
+Follow the procedure below to create the Trixie server profile.
+Under Session
+<span>
@trixie.res.nrc.gc.ca+4. Click Save
+Once you have the profiles created and saved, please follow the procedure below to run the pscp command.
+Use the pscp command in the Command Prompt window to copy files to or from the trixie server using the Trixie-pscp putty profile
+Copy the file test.txt
from John Doe’s local machine to his admin.john.doe account on trixie
pscp test.txt Trixie-pscp:/home/admin.john.doe
+ 2. To copy a file from trixie to your local machine, you basically reverse the arguments to the pscp command
pscp Trixie-pscp:/home/admin.john.doe/test.txt test.txt
+ 3. To copy an entire directory instead of just a file, please use the –r option (for recursive) to the pscp command
pscp –r myWorkFilesDir Trixie-pscp:/home/admin.john.doe
Please note that the use of this method requires that you have a Putty profile defined to access the Trixie server. Follow the procedure below to create the Trixie server profile.
+Under Session
+<span>
@trixie.res.nrc.gc.ca+4. Click Save
+Once you have the profile created and saved, please follow the procedure below to run the pscp command.
+Use the pscp command in the Command Prompt window to copy files to or from the trixie server using the Trixie-pscp putty profile
+Copy the file test.txt
from John Doe’s local machine to his doej account on trixie
pscp test.txt Trixie-pscp:/home/doej
+ 2. To copy a file from trixie to your local machine, you basically reverse the arguments to the pscp command
pscp Trixie-pscp:/home/doej/test.txt test.txt
+ 3. To copy an entire directory instead of just a file, please use the –r option (for recursive) to the pscp command
pscp –r myWorkFilesDir Trixie-pscp:/home/doej
The procedures in this section assume that the advanced SSH configurations discussed above have been implemented. There are three options for copying files between Trixie and another HPC cluster
+This procedure requires that there is an approved network flow open between Trixie and the second HPC cluster. Please see the external HPC systems page for a list of approved external HPC systems. If there is an approved network flow, then files can be directly copied between Trixie and the second HPC cluster. This is the ideal situation and should be the fastest option in terms of overall network speed between the two systems.
+To copy a file from the second HPC cluster to Trixie, use the following scp command on the Trixie server.
+scp username@cluster.domain:/home/username/test.txt test.txt
To copy a file from Trixie to the second HPC cluster, you basically reverse the arguments to the scp command.
+scp test.txt username@cluster.domain:/home/username/test.txt
To copy an entire directory instead of just a file, please use the –r option (for recursive) to the scp command.
+scp –r myWorkFilesDir username@cluster.domain:/home/username/folder
This procedure requires that you have an external account setup to access Trixie. If this is the case, then files can be copied between Trixie and the second HPC cluster via the Bastion Host, but without flowing through your local computer. To use this approach, you will need to login to the second HPC cluster first, and then from the second HPC cluster computer, login to Trixie through the Bastion host.
+To copy a file from the second HPC cluster to Trixie, use the following scp command on the Trixie server.
+scp username@cluster.domain:/home/username/test.txt test.txt
To copy a file from Trixie to the second HPC cluster, you basically reverse the arguments to the scp command.
+scp test.txt username@cluster.domain:/home/username/test.txt
To copy an entire directory instead of just a file, please use the –r option (for recursive) to the scp command.
+scp –r myWorkFilesDir username@cluster.domain:/home/username/folder
This procedure requires that you copy files between the two clusters using your local computer as a bridge. The commands below should be executed on your local computer and not either of the cluster servers.
+To copy a file from the second HPC cluster to Trixie, use the following scp command on your local computer.
+scp username@cluster.domain:/home/username/test.txt trixie:/home/admin.john.doe/test.txt
To copy a file from Trixie to the second HPC cluster, you basically reverse the arguments to the scp command.
+scp trixie:/home/admion.john.doe/test.txt username@cluster.domain:/home/username/test.txt
To copy an entire directory instead of just a file, please use the –r option (for recursive) to the scp command.
+scp –r trixie:/home/admin.john.doe/myWorkFilesDir username@cluster.domain:/home/username/folder
Project folders have been created for users to use for a couple of purposes:
+Please note that users should be diligent and remove any files and folders (in both the project folder and your home folder) once they are no longer required. This helps to optimize disk usage and avoid disk space issues for all users, not just your own usage.
+The project folder can be found under the following folder hierarchy
+/gpfs/projects/<project-group>/<project>
Where project-group is the name of your project group – for example, AI4D or COVID - and project is the name of your project – for example, core-01 or bio-01.
+To copy files to a project folder you should create a personal folder under the project directory and then copy files from your home directory to the new folder. In the example below user John Doe will copy two dataset files to the AI4D/bio-01 project folder.
+cd /gpfs/projects/AI4D/bio-01
+2. Create the new folder using a unique name, perhaps your last name and first initial
mkdir doej
+3. Change back to your home directory
cd
+4. Copy the files to your new project directory
cp dataset1.dat dataset2.dat /gpfs/projects/AI4D/bio-01/doej
Trixie is a GPU cluster consisting of 36 nodes, each with NVIDIA V100 GPU, a fast, Infiniband Interconnect, and a large 1 PB global filesystem
+Runs RHEL 9
+https://slurm.schedmd.com
+slurm 22.05.9
+(for example run scripts on Trixie see Running-jobs)
This examples will show you how to setup and prepare an environment for PyTorch jobs using conda on Trixie:
+Either run from the command line or create pytorchconda-environment.sh and run it:
+#!/bin/bash
+# load the miniconda module
+module load miniconda3-4.8.2-gcc-9.2.0-sbqd2xu
+# create a conda environment with python 3.7 named pytorch
+conda create --name pytorch python=3.7
+source activate pytorch
+# install pytorch dependencies via conda
+conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
+
+import torch
+print('GPU available:', torch.cuda.is_available())
+
+#!/bin/bash
+
+# Specify the partition of the cluster to run on (Typically TrixieMain)
+#SBATCH --partition=TrixieMain
+# Add your project account code using -A or --account
+#SBATCH --account ai4d
+# Specify the time allocated to the job. Max 12 hours on TrixieMain queue.
+#SBATCH --time=12:00:00
+# Request GPUs for the job. In this case 4 GPUs
+#SBATCH --gres=gpu:4
+# Print out the hostname that the jobs is running on
+hostname
+# Run nvidia-smi to ensure that the job sees the GPUs
+/usr/bin/nvidia-smi
+
+# Load the miniconda module on the compute node
+module load miniconda3-4.8.2-gcc-9.2.0-sbqd2xu
+# Activate the conda pytorch environment created in step 1
+source activate pytorch
+# Launch our test pytorch python file
+python testtorch.py
+
+sbatch testpytorch.sh
+
+Output will be 'Submitted batch job XXXXX'
+Local directory will contain a file 'slurm-XXXXX.out' which is the output of the job (stdout).
+Output should be:
+cnXXX - <nodename>
+<Date>
++--------
+| NVIDIA-SMI XXXX...
+....
+(4 listed V100 GPUs number 0 to 3)
+
+GPU available: True
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The Trixie head node can be accessed via ssh on NRC Black & NRC Orange
+The Trixie head node has outbound ssh access to a limited number of external sites (e.g. some Canadian Universities). If you require access to an additional site which is not currently available, create a request via https://github.com/ai4d-iasc/trixie/issues
+ + + + + + + + + + + + + +Trixie use the slurm scheduler to manage jobs. +Compute Canada has a very good guide for using slurm to submit jobs to a cluster most of which is applicable for Trixie: https://docs.computecanada.ca/wiki/Running_jobs
+Here is a simple job which runs the python code hello.py
+Contents of hello.py
+print('Hello world')
+
+Contents of hello-job.sh
+#!/bin/bash
+#SBATCH -J helloworld
+
+module load miniconda3-4.8.2-gcc-9.2.0-sbqd2xu
+srun python ~/hello.py
+
+Submit job:
+sbatch ./hello-job.sh
+
+Output will be located in slurm-<jobid>
.out
In order for a job to run on Trixie, it must be "billed" against an approved project. Users are able to charge different projects depending on what their activity is for.
+See here for the Account Codes
+Key thing to know is that srun is like a super-ssh which means that when running srun cmd
it actually does something like ssh node cmd
#!/bin/bash
+
+#SBATCH --partition=TrixieMain
+#SBATCH --account=dt-mtp
+#SBATCH --time=00:20:00
+#SBATCH --job-name=pytorch.distributed
+#SBATCH --comment="Helping Harry with pytorch distributed on multiple nodes."
+#SBATCH --gres=gpu:4
+##SBATCH --ntasks=2
+
+#SBATCH --wait-all-nodes=1
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=6
+#SBATCH --exclusive
+#SBATCH --output=%x-%j.out
+
+
+# USEFUL Bookmarks
+# [Run PyTorch Data Parallel training on ParallelCluster](https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data-parallel.html)
+# [slurm SBATCH - Multiple Nodes, Same SLURMD_NODENAME](https://stackoverflow.com/a/51356947)
+
+readonly MASTER_ADDR_JOB=$SLURMD_NODENAME
+readonly MASTER_PORT_JOB="12234"
+export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+
+readonly srun='srun --output=%x-%j.%t.out'
+
+env
+
+$srun bash \
+ task.sh \
+ $MASTER_ADDR_JOB \
+ $MASTER_PORT_JOB &
+
+wait
+
+This script will be executed on each node.
+Note that we are activating the conda
environment in this script so that each node/worker can have the proper environment.
#!/bin/bash
+
+# USEFUL Bookmarks
+# [Run PyTorch Data Parallel training on ParallelCluster](https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data-parallel.html)
+# [slurm SBATCH - Multiple Nodes, Same SLURMD_NODENAME](https://stackoverflow.com/a/51356947)
+
+#module load miniconda3-4.7.12.1-gcc-9.2.0-j2idqxp
+#source activate molecule
+
+source /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/bin/activate
+conda activate pytorch-1.7.1
+
+readonly MASTER_ADDR_JOB=$1
+readonly MASTER_PORT_JOB=$2
+
+export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+
+env
+
+python \
+ -m torch.distributed.launch \
+ --nproc_per_node=4 \
+ --nnodes=$SLURM_NTASKS \
+ --node_rank=$SLURM_NODEID \
+ --master_addr=$MASTER_ADDR_JOB \
+ --master_port=$MASTER_PORT_JOB \
+ main.py \
+ --batch_size 128 \
+ --learning_rate 5e-5 &
+
+wait
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ It is possible to retrieve data from automatically generated filesystem backups. The IBM Spectrum Scale (GPFS) system includes the ability to create filesystem snapshots which create temporary backups of data stored in the filesystem. These snapshots can be accessed to retrieve files that may have been accidentally removed.
+Please see the GPFS Snapshot document for more details on how to retrieve files from the backups.
+ + + + + + + + + + + + + +Thursday, October 17, 2024 - The Trixie cluster will be shutdown because of a planned electrical outage that will allow RPPM to + commission the new emergency power generator.
+Start date: Thursday, October 17, 6:00 AM EDT
+If you have any further questions, do not hesitate to contact us at your earliest convenience (rps-spr@nrc-cnrc.gc.ca).
+Tuesday, October 22nd, 2024 - As a reminder, LoginTC is used as a second authentication service for:
+Bastion host for external access to Trixie High Performance Computing Clusters
+A maintenance period is required to perform system upgrades. Therefore, the LoginTC service will be unavailable on
+ Tuesday, October 22nd, from 3PM to 5PM EDT
Consequently, you will not be able to access the service for which LoginTC provides authentication.
+Internal access to Trixie will still be available during this time.
+If you have any questions regarding this maintenance, do not hesitate to communicate with us (rps-spr@nrc-cnrc.gc.ca)
+{"use strict";/*!
+ * escape-html
+ * Copyright(c) 2012-2013 TJ Holowaychuk
+ * Copyright(c) 2015 Andreas Lubbe
+ * Copyright(c) 2015 Tiancheng "Timothy" Gu
+ * MIT Licensed
+ */var Va=/["'&<>]/;qn.exports=za;function za(e){var t=""+e,r=Va.exec(t);if(!r)return t;var o,n="",i=0,a=0;for(i=r.index;i