Skip to content

Commit

Permalink
Fix markdown style issues
Browse files Browse the repository at this point in the history
  • Loading branch information
richelbilderbeek committed May 15, 2024
1 parent 89f98dd commit 9901571
Show file tree
Hide file tree
Showing 6 changed files with 42 additions and 19 deletions.
5 changes: 3 additions & 2 deletions docs/cluster_guides/old/disk_quota_more.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ You can display your current usage with the command 'uquota'.

When you exceed your quota, the system will not let you write any more data and you have to either remove some files or request more quota. The 'uquota' command will also show the date and to what limit your quota will change to, if you have been given a larger quota.

To get more quota, send a mail to support ([email protected]) and state how much, for how long time, and why you need it. See the storage project application page for more information on how we handle and prioritise storage requests.
To get more quota, send a mail to support (`[email protected]`) and state how much, for how long time, and why you need it. See the storage project application page for more information on how we handle and prioritise storage requests.

Here are two commands. The first results in a list of subdirectories ordered by size and proportion of total size. The second produces a list of the files in the current directory that take up most space. These may take a long time to complete, use 'ctrl-c' to stop the process if you change your mind.

```
```bash
du -b $PWD | sort -rn | awk 'NR==1 {ALL=$1} {print int($1*100/ALL) "% " $0}'
find $PWD -print0 -type f | xargs -0 stat -c "%s %n" | sort -rn
```

You should also read the Disk Storage Guide. **LINK**
1 change: 1 addition & 0 deletions docs/cluster_guides/old/webexport.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
In IG: Support User guides Webexport guide

## Rackham

You can enable webexport by creating a publicly readable folder called **webexport** in your project directory (``/proj/[project id]``). The contents of that folder will be accessible through `https://export.uppmax.uu.se/[project id]/`.

A publicly readable folder has the execute permission set for "other" users. Run the command chmod o+x webexport to ensure that the webexport directory has the correct permissions.
Expand Down
11 changes: 3 additions & 8 deletions docs/cluster_guides/running_jobs/interactive_more.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# How to run interactively on a compute node?

???- info "For UPPMAX staff"

TODO: InfoGlue link: `https://www.uppmax.uu.se/support/faq/running-jobs-faq/how-can-i-run-interactively-on-a-compute-node/`

You may want to run an interactive application on one or several compute nodes. You may want to use one or several compute nodes as a development workbench, interactively. How can this be arranged?
The program interactive may be what you are looking for.

Expand All @@ -13,13 +9,13 @@ The one parameter you must always specify is the project name. Let's assume for

To get one compute core with the proportional amount of RAM, we recommend you to use the most simple command on the login node for the cluster you want to use:

```
```bash
interactive -A p2010099
```

If you need more than one core, or special features on your node, you can specify that to the interactive command, e.g. on milou:

```
```bash
interactive -A p2010099 -n 16 -C fat
```

Expand All @@ -36,11 +32,10 @@ In the last example ("interactive -A p2010099 -n 16 -C fat"), the interactive co

If you also want to run for 15 hours, you may say so, with the command

```
```bash
interactive -A p2010099 -n 16 -C fat -t 15:00:00
```

but no "priority lane" can be used, you get your normal queue priority, and you might have to wait for a very long time for your session to start. Please note that you need to keep watch over when the job starts, because you are accounted for all the time from job start even if you are sleeping, and because an allocated and unused node is a waste of expensive resources.


NB. You can not launch an interactive job from an other cluster with the flag -M, which otherwise is a common flag to other SLURM commands. You must launch it from a login node to the cluster you want to use.
27 changes: 25 additions & 2 deletions docs/cluster_guides/running_jobs/nodes_own_disk.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,24 @@ Parallel network file systems are very fast when accessed from many nodes, but c

For this reason, jobs that perform a lot of file accesses, especially on temporary files, should use the compute node's local hard drive. If you do, then any slow-down due to file I/O is limited to the node(s) on which these jobs are running.

The hard drive of the node is located at /scratch, and each job that runs on a node gets a folder created automatically with the same name as the jobid, /scratch/<jobid>. This folder name is also stored in the environment variable $SNIC_TMP for ease of use. The idea is that you copy files that you will be reading randomly, such as indices and databases but not files of reads, to $SNIC_TMP first thing in the job. Files that you read as a stream from beginning to end, like files of reads, should remain in project storage and read from there. You then run your analysis and have all the output files written to $SNIC_TMP as well. After the analysis is done, you copy back all the output files you want to keep to your project storage folder. Everything in /scratch/<jobid> will be deleted as soon as the job is finished, and you have no hope of recovering it after the job is completed.
The hard drive of the node is located at `/scratch`,
and each job that runs on a node gets a folder created automatically
with the same name as the jobid, `/scratch/<jobid>`.
This folder name is also stored in the environment variable `$SNIC_TMP`
for ease of use.
The idea is that you copy files that you will be reading randomly,
such as indices and databases but not files of reads,
to `$SNIC_TMP` first thing in the job.
Files that you read as a stream from beginning to end, like files of reads,
should remain in project storage and read from there.
You then run your analysis and
have all the output files written to `$SNIC_TMP` as well.
After the analysis is done,
you copy back all the output files you want to keep
to your project storage folder.
Everything in `/scratch/<jobid>` will be deleted
as soon as the job is finished,
and you have no hope of recovering it after the job is completed.

An example would be a script that runs bwa to align read. Usually they look something like this:

Expand Down Expand Up @@ -62,4 +79,10 @@ cp $SNIC_TMP/sample.bam /proj/snic2022-X-YYY/nobackup/results/

It's not harder than that. This way, the index files are copied to $SNIC_TMP in a single operation, which is much less straining for the file system than small random read/writes. The network filesystem is used when gathering reads for alignment, and streaming reads are easy for that filesystem. When the alignment is finished the results is copied back to project directory so that it can be used in other analysis.

One problem that can happen is if your files and the results are too large for the node's hard drive. The drive is 2TiB on Rackham and 4TiB on Bianca, so it is unusual for the hard drive to be too small for the results of such analyses. If you run into this problem, please email UPPMAX at [email protected] and we can look into the problem.
One problem that can happen is if your files and the results are too large
for the node's hard drive.
The drive is 2TiB on Rackham and 4TiB on Bianca,
so it is unusual for the hard drive to be too small
for the results of such analyses.
If you run into this problem,
please email UPPMAX at `[email protected]` and we can look into the problem.
2 changes: 1 addition & 1 deletion docs/cluster_guides/running_jobs/storage_compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@

- You might also be interested in our [disk storage guide](../storage/disk_storage_guide.md).


## Compression

???- question "File compression guide"

[Compression guide](../storage/compress_guide.md)
Expand Down
15 changes: 9 additions & 6 deletions docs/cluster_guides/storage/compress_fastQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
**Short answer: The best compression using a widely available format is provided by bzip2 and its parallel equivalent pbzip2. The best compression ratio for FastQ is provided by fqz_comp in the fqzcomp/4.6 module. However, this tool is experimental and is not recommended for general, everyday use.**

## Long answer

We conducted an informal examination of two specialty FastQ compression tools by recompressing an existing fastq.gz file. The first tool fqzcomp (available in the module fqzcomp/4.6) uses a compiled executable (fqz_comp) that works similar to e.g., gzip, while the second tool (LFQC in the module lfqc/1.1) uses separate ruby-language scripts for compression (lfqc.rb) and decompression (lfqcd.rb). It does not appear the LFQC scripts accept piping but the documentation is limited.

```
```bash
module load bioinfo-tools
module load fqzcomp/4.6
module load lfqc/1.1
Expand All @@ -17,7 +18,7 @@ One thing changed from the 'standard' implementation of LFQC was to make the scr

Since piping is not available with LFQC, it is preferable to avoid creating a large intermediate decompressed FastQ file. So, create a named pipe using mkfifo that is named like a fastq file.

```
```bash
mkfifo UME_081102_P05_WF03.se.fastq
zcat UME_081102_P05_WF03.se.fastq.gz > UME_081102_P05_WF03.se.fastq &
lfqc.rb UME_081102_P05_WF03.se.fastq
Expand All @@ -28,13 +29,13 @@ This took a long time, 310 wall seconds.

Next,fqz_comp from fqzcomp/4.6. Since this works like gzip, just use it in a pipe.

```
```bash
zcat UME_081102_P05_WF03.se.fastq.gz | fqz_comp > UME_081102_P05_WF03.se.fastq.fqz
```

This used a little multithreading (up to about 150% CPU) and was much faster than LFQC, just 2-3 seconds. There are other compression options (we tried -s1 and -s9+) but these did not outperform the default (equivalent to -s3). This is not necessarily a surprise; stronger compression means attempting to make better guesses and sometimes these guesses are not correct. No speedup/slowdown was noticed with other settings but the input file was relatively small.

```
```console
-rw-rw-r-- 1 28635466 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz1
-rw-rw-r-- 1 29271063 Mar 10 12:52 UME_081102_P05_WF03.se.fastq.fqz9+
-rw-rw-r-- 1 46156932 Jun 6 2015 UME_081102_P05_WF03.se.fastq.gz
Expand All @@ -44,7 +45,7 @@ This used a little multithreading (up to about 150% CPU) and was much faster tha

We also compared against bzip2 and xz, which are general-use compressors. These both function like gzip (and thus like fqz_comp) and both outperform gzip, as expected. xz is becoming a more widely-used general compressor like bzip2, but for this file it required perhaps 20x as much time as bzip2 and did worse.

```
```console
-rw-rw-r-- 1 35664555 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.bz2
-rw-rw-r-- 1 36315260 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.xz
```
Expand All @@ -53,16 +54,18 @@ Neither of these improved general-use compressors did as well with FastQ as the


### Which is the best method in this trial?

From the results of this trial, the tool that provides the best compression ratio in a reasonable amount of time is fqz_comp in the fqzcomp/4.6 module. It is as fast as bzip2 which is also much better than gzip but does a much better job of compressing FastQ. However, fqz_comp is experimental so we do not recommend using fqz_comp for everyday use. We recommend using bzip2 or its parallel equivalent, pbzip2.

The fqz_comp executable could be used to decompress FastQ within a named pipe if FastQ is required for input:

```
```text
... <(fqz_comp -d < file.fastq.fqz) ...
```

Note that fqz_comp is designed to compress FastQ files alone, and neither method here provides the blocked compression format suitable for random access that bgzip does; see [which-compression-format-should-i-use-for-ngs-related-files](../storage/compress_format.md) for more on that subject.


### Why not LFQC?

Though LFQC has the best compression of FastQ, there are some strong disadvantages. First, it takes quite a long time, perhaps 50x longer than fqz_comp. Second, it apparently cannot be used easily within a pipe like many other compressors. Third, it contains multiple scripts with multiple auxiliary programs, rather than a single executable. Fourth, it is quite verbose during operation, which can be helpful but cannot be turned off. Finally, it was difficult to track down for installation; two different links were provided in the publications and neither worked. It was finally found in a github repository, the location of which is provided in the module help.

0 comments on commit 9901571

Please sign in to comment.