Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
richelbilderbeek committed May 6, 2024
2 parents f333faa + 1a1674d commit 5cf9bc8
Show file tree
Hide file tree
Showing 5 changed files with 120 additions and 35 deletions.
79 changes: 79 additions & 0 deletions docs/cluster_guides/screen.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Running a detachable screen process in a job

When you run the `interactive` command, you get a command prompt in the _screen_ program.

!!! warning
When running the screen program in other environments, you can detach from your screen and later reattach to it. Within the environment of the `interactive` command, you lose this ability: Your job is terminated when you detach. (This is a design decision and not a bug.)

In case you want the best of both worlds, i.e. to be able to detach and reattach to your screen program within a job, you need to start a job in some other way and start your screen session from a separate ssh login. Here is an example of how you can do this:
```bash
$ salloc -A project_ID -t 15:00 -n 1 --qos=short --bell --no-shell
salloc: Pending job allocation 46964140
salloc: job 46964140 queued and waiting for resources
salloc: job 46964140 has been allocated resources
salloc: Granted job allocation 46964140
salloc: Waiting for resource configuration
salloc: Nodes r174 are ready for job
```
Check the queue manager for the allocated node. In the example bellow, one core was allocated on `r174` compute node.
```bash
$ squeue -j 46964140
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
46964140 core no-shell user R 0:44 1 r174
```
You can start `xterm` terminal in this allocated session like this:
```bash
$ xterm -e ssh -AX r174 &
```

`salloc` command gives you a job allocation of one node for 15 minutes (the "--no-shell" option is important here). Instead you can log in to any node of any of your running jobs, started with e.g. the `sbatch` command.

You get a job number and from that you can find out the node name, in this example r174.

When you log in to the node with the `ssh` command, start the screen program:

```bash
$ screen
```
When you detach from the screen program, with e.g. the "d" command, you can later in the same ssh session or in another ssh session reattach to your screen session:
```bash
$ screen -r
```
When your job has terminated, you can neither reattach to your screen session nor log in to the node.

The screen session of the `interactive` command is integrated into your job, so e.g. all environment variables for the job is correctly assigned. For a separate ssh session, as in this example, that is not the case.

Please note that it is the job allocation that determines your core hour usage and not your ssh or screen sessions.

## Tips

- Start a new screen session with a command:
```
screen -dm your_command
```
This will start a new screen session, run the command, and then detach from the session.

- If you want to run multiple commands, you can do so like this:
```
screen -dm bash -c "command1; command2"
```
This will run `command1` and `command2` in order.

- To reattach to the screen session, use:
```
screen -r
```
If you have multiple sessions, you'll need to specify the session ID.

- To list your current screen sessions, use:
```
screen -ls
```

Please note that when a program terminates, `screen` (by default) kills the window that contained it. If you don't want your session to get killed after the script is finished, add `exec sh` at the end. For example:
```
screen -dm bash -c 'your_command; exec sh'
```
This will keep the screen session alive after `your_command` has finished executing.

YouTube : [How to use GNU SCREEN - the Terminal Multiplexer](https://www.youtube.com/watch?v=I4xVn6Io5Nw)
Original file line number Diff line number Diff line change
@@ -1,44 +1,46 @@
# GAMESS-US user guide

<https://www.uppmax.uu.se/support/user-guides/gamess-us-user-guide/>

GAMESS-US versions 20170930 is installed on Rackham. Newer versions can be installed on request to UPPMAX support. Snowy currently lacks GAMESS-US.

Citing GAMESS papers
It is essential that you read the GAMESS manual thoroughly to properly reference the papers specified in the instructions. All publications using gamess should cite at least the following paper:

@Article{GAMESS,
author={M.W.Schmidt and K.K.Baldridge and J.A.Boatz and S.T.Elbert and
M.S.Gordon and J.J.Jensen and S.Koseki and N.Matsunaga and
K.A.Nguyen and S.Su and T.L.Windus and M.Dupuis and J.A.Montgomery},
journal={J.~Comput.~Chem.},
volume=14,
pages={1347},
year=1993,
comment={The GAMESS program}}
If you need to obtain GAMESS yourself, please visit the GAMESS website for further instructions.

Running GAMESS
## Running GAMESS

Load the module using
```bash
module load gamess/20170930

```
Below is an example submit script for Rackham, running on 40 cores (2 nodes with 20 cores each). It is essential to specify the project name:

```slurm
#!/bin/bash -l
#SBATCH -J jobname
#SBATCH -p node -n 40
#SBATCH -A PROJECT
#SBATCH -t 03:00:00
module load gamess/20170930
rungms gms >gms.out
Memory specification
```

## Memory specification
GAMESS uses two kinds of memory: replicated memory and distributed memory. Both kinds of memory should be given in the $SYSTEM specification. Replicated memory is specified using the MWORDS keyword and distributed memory with the MEMDDI keyword. It is very important that you understand the uses of these keywords. Check the GAMESS documentation for further information.

If your job requires 16MW (mega-words) of replicated memory and 800MW of distributed memory, as in the example below, the memory requirements per CPU core varies as 16+800/N where N is the number of cores. Each word is 8 bytes of memory, why the amount of memory per core is (16+800/N)*8. The amount of memory per node depends on the number of cores per node. Rackham has 20 cores per node, most nodes have 128 GB of memory, but 30 nodes have 512 GB and 4 nodes at 1 TB.


Communication
## Communication
For intra-node communication shared memory is used. For inter-node communication MPI is used which uses the Infiniband interconnect.

## Citing GAMESS papers
It is essential that you read the GAMESS manual thoroughly to properly reference the papers specified in the instructions. All publications using GAMESS should cite at least the following paper:

```bibtex
@Article{GAMESS,
author={M.W.Schmidt and K.K.Baldridge and J.A.Boatz and S.T.Elbert and
M.S.Gordon and J.J.Jensen and S.Koseki and N.Matsunaga and
K.A.Nguyen and S.Su and T.L.Windus and M.Dupuis and J.A.Montgomery},
journal={J.~Comput.~Chem.},
volume=14,
pages={1347},
year=1993,
comment={The GAMESS program}}
```
If you need to obtain GAMESS yourself, please visit the GAMESS website for further instructions.
File renamed without changes
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
# NVIDIA Deep Learning Frameworks
https://www.uppmax.uu.se/support/user-guides/nvidia-deep-learning-frameworks/

Here is how easy one can use an NVIDIA environment for deep learning with all the following tools preset. A screenshot of that page is shown below.
Here is how easy one can use an NVIDIA [environment](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-03.html) for deep learning with all the following tools preset. A screenshot of that page is shown below.

pytorch

Pull the container (6.5GB).
![web screenshot](./img/pytorch-nvidia.png)

First - pull the container (6.5GB).
```bash
singularity pull docker://nvcr.io/nvidia/pytorch:22.03-py3

```
Get an interactive shell.

```bash
singularity shell --nv ~/external_1TB/tmp/pytorch_22.03-py3.sif

Singularity> python3
Expand All @@ -34,9 +33,9 @@ True
# test torch
>>> torch.zeros(1).to('cuda')
tensor([0.], device='cuda:0')

```
From the container shell, check what else is available...

```bash
Singularity> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Expand All @@ -60,19 +59,21 @@ Singularity> jupyter-lab
[I 13:35:46.616 LabApp] http://hostname:8888/?token=d6e865a937e527ff5bbccfb3f150480b76566f47eb3808b1
[I 13:35:46.616 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
...

```
You can use this container to add more packages.

```singularity
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:22.03-py3
...
```

Just keep in mind that "upgrading" the build-in torch package might install a package that is compatible with less GPU architectures and it might not work anymore on your hardware.

```bash
Singularity> python3 -c "import torch; print(torch.__version__); print(torch.cuda.is_available()); print(torch.cuda.get_arch_list()); torch.zeros(1).to('cuda')"

1.10.0+cu102
True
['sm_37', 'sm_50', 'sm_60', 'sm_70']
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
```
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ nav:
- Checking and optimizing jobs: cluster_guides/optimizing_jobs.md
- Runtime tips: cluster_guides/running_jobs/runtime_tips.md
- Storage and compression: cluster_guides/running_jobs/storage_compression.md
- Screen: cluster_guides/screen.md
- Data transfer:
- Transfer to/from Bianca: cluster_guides/transfer_bianca.md
- Migrate to Dardel: cluster_guides/dardel_migration.md
Expand All @@ -49,6 +50,7 @@ nav:
- Software-specific documentation:
- Whisper: software/whisper.md
- Chemistry/physics:
- GAMESS_US: software/games_us.md
- Gaussian: software/gaussian.md
- GROMACS: software/gromacs.md
- Molcas: software/openmolcas.md
Expand All @@ -65,6 +67,7 @@ nav:
- MetONTIIME: software/metontiime.md
- Tracer: software/tracer.md
- Machine Learning:
- NVIDIA DLF: software/nvidia-deep-learning-frameworks.md
- TensorFlow: software/tensorflow.md
- Programming languages:
- Julia: software/julia.md
Expand Down

0 comments on commit 5cf9bc8

Please sign in to comment.