From 990157152147d4720de722831b923a95a56dffb7 Mon Sep 17 00:00:00 2001 From: richelbilderbeek Date: Wed, 15 May 2024 14:20:17 +0200 Subject: [PATCH] Fix markdown style issues --- docs/cluster_guides/old/disk_quota_more.md | 5 ++-- docs/cluster_guides/old/webexport.md | 1 + .../running_jobs/interactive_more.md | 11 +++----- .../running_jobs/nodes_own_disk.md | 27 +++++++++++++++++-- .../running_jobs/storage_compression.md | 2 +- docs/cluster_guides/storage/compress_fastQ.md | 15 ++++++----- 6 files changed, 42 insertions(+), 19 deletions(-) diff --git a/docs/cluster_guides/old/disk_quota_more.md b/docs/cluster_guides/old/disk_quota_more.md index 7a7f60297..3a19a876d 100644 --- a/docs/cluster_guides/old/disk_quota_more.md +++ b/docs/cluster_guides/old/disk_quota_more.md @@ -6,12 +6,13 @@ You can display your current usage with the command 'uquota'. When you exceed your quota, the system will not let you write any more data and you have to either remove some files or request more quota. The 'uquota' command will also show the date and to what limit your quota will change to, if you have been given a larger quota. -To get more quota, send a mail to support (support@uppmax.uu.se) and state how much, for how long time, and why you need it. See the storage project application page for more information on how we handle and prioritise storage requests. +To get more quota, send a mail to support (`support@uppmax.uu.se`) and state how much, for how long time, and why you need it. See the storage project application page for more information on how we handle and prioritise storage requests. Here are two commands. The first results in a list of subdirectories ordered by size and proportion of total size. The second produces a list of the files in the current directory that take up most space. These may take a long time to complete, use 'ctrl-c' to stop the process if you change your mind. -``` +```bash du -b $PWD | sort -rn | awk 'NR==1 {ALL=$1} {print int($1*100/ALL) "% " $0}' find $PWD -print0 -type f | xargs -0 stat -c "%s %n" | sort -rn ``` + You should also read the Disk Storage Guide. **LINK** diff --git a/docs/cluster_guides/old/webexport.md b/docs/cluster_guides/old/webexport.md index daa20055b..35c13f2db 100644 --- a/docs/cluster_guides/old/webexport.md +++ b/docs/cluster_guides/old/webexport.md @@ -3,6 +3,7 @@ In IG: Support User guides Webexport guide ## Rackham + You can enable webexport by creating a publicly readable folder called **webexport** in your project directory (``/proj/[project id]``). The contents of that folder will be accessible through `https://export.uppmax.uu.se/[project id]/`. A publicly readable folder has the execute permission set for "other" users. Run the command chmod o+x webexport to ensure that the webexport directory has the correct permissions. diff --git a/docs/cluster_guides/running_jobs/interactive_more.md b/docs/cluster_guides/running_jobs/interactive_more.md index b0bf3e475..a63910138 100644 --- a/docs/cluster_guides/running_jobs/interactive_more.md +++ b/docs/cluster_guides/running_jobs/interactive_more.md @@ -1,9 +1,5 @@ # How to run interactively on a compute node? -???- info "For UPPMAX staff" - - TODO: InfoGlue link: `https://www.uppmax.uu.se/support/faq/running-jobs-faq/how-can-i-run-interactively-on-a-compute-node/` - You may want to run an interactive application on one or several compute nodes. You may want to use one or several compute nodes as a development workbench, interactively. How can this be arranged? The program interactive may be what you are looking for. @@ -13,13 +9,13 @@ The one parameter you must always specify is the project name. Let's assume for To get one compute core with the proportional amount of RAM, we recommend you to use the most simple command on the login node for the cluster you want to use: -``` +```bash interactive -A p2010099 ``` If you need more than one core, or special features on your node, you can specify that to the interactive command, e.g. on milou: -``` +```bash interactive -A p2010099 -n 16 -C fat ``` @@ -36,11 +32,10 @@ In the last example ("interactive -A p2010099 -n 16 -C fat"), the interactive co If you also want to run for 15 hours, you may say so, with the command -``` +```bash interactive -A p2010099 -n 16 -C fat -t 15:00:00 ``` but no "priority lane" can be used, you get your normal queue priority, and you might have to wait for a very long time for your session to start. Please note that you need to keep watch over when the job starts, because you are accounted for all the time from job start even if you are sleeping, and because an allocated and unused node is a waste of expensive resources. - NB. You can not launch an interactive job from an other cluster with the flag -M, which otherwise is a common flag to other SLURM commands. You must launch it from a login node to the cluster you want to use. diff --git a/docs/cluster_guides/running_jobs/nodes_own_disk.md b/docs/cluster_guides/running_jobs/nodes_own_disk.md index e9d801350..f104c8c04 100644 --- a/docs/cluster_guides/running_jobs/nodes_own_disk.md +++ b/docs/cluster_guides/running_jobs/nodes_own_disk.md @@ -10,7 +10,24 @@ Parallel network file systems are very fast when accessed from many nodes, but c For this reason, jobs that perform a lot of file accesses, especially on temporary files, should use the compute node's local hard drive. If you do, then any slow-down due to file I/O is limited to the node(s) on which these jobs are running. -The hard drive of the node is located at /scratch, and each job that runs on a node gets a folder created automatically with the same name as the jobid, /scratch/. This folder name is also stored in the environment variable $SNIC_TMP for ease of use. The idea is that you copy files that you will be reading randomly, such as indices and databases but not files of reads, to $SNIC_TMP first thing in the job. Files that you read as a stream from beginning to end, like files of reads, should remain in project storage and read from there. You then run your analysis and have all the output files written to $SNIC_TMP as well. After the analysis is done, you copy back all the output files you want to keep to your project storage folder. Everything in /scratch/ will be deleted as soon as the job is finished, and you have no hope of recovering it after the job is completed. +The hard drive of the node is located at `/scratch`, +and each job that runs on a node gets a folder created automatically +with the same name as the jobid, `/scratch/`. +This folder name is also stored in the environment variable `$SNIC_TMP` +for ease of use. +The idea is that you copy files that you will be reading randomly, +such as indices and databases but not files of reads, +to `$SNIC_TMP` first thing in the job. +Files that you read as a stream from beginning to end, like files of reads, +should remain in project storage and read from there. +You then run your analysis and +have all the output files written to `$SNIC_TMP` as well. +After the analysis is done, +you copy back all the output files you want to keep +to your project storage folder. +Everything in `/scratch/` will be deleted +as soon as the job is finished, +and you have no hope of recovering it after the job is completed. An example would be a script that runs bwa to align read. Usually they look something like this: @@ -62,4 +79,10 @@ cp $SNIC_TMP/sample.bam /proj/snic2022-X-YYY/nobackup/results/ It's not harder than that. This way, the index files are copied to $SNIC_TMP in a single operation, which is much less straining for the file system than small random read/writes. The network filesystem is used when gathering reads for alignment, and streaming reads are easy for that filesystem. When the alignment is finished the results is copied back to project directory so that it can be used in other analysis. -One problem that can happen is if your files and the results are too large for the node's hard drive. The drive is 2TiB on Rackham and 4TiB on Bianca, so it is unusual for the hard drive to be too small for the results of such analyses. If you run into this problem, please email UPPMAX at support@uppmax.uu.se and we can look into the problem. +One problem that can happen is if your files and the results are too large +for the node's hard drive. +The drive is 2TiB on Rackham and 4TiB on Bianca, +so it is unusual for the hard drive to be too small +for the results of such analyses. +If you run into this problem, +please email UPPMAX at `support@uppmax.uu.se` and we can look into the problem. diff --git a/docs/cluster_guides/running_jobs/storage_compression.md b/docs/cluster_guides/running_jobs/storage_compression.md index 3c6cef8ac..6bdd2831b 100644 --- a/docs/cluster_guides/running_jobs/storage_compression.md +++ b/docs/cluster_guides/running_jobs/storage_compression.md @@ -21,8 +21,8 @@ - You might also be interested in our [disk storage guide](../storage/disk_storage_guide.md). - ## Compression + ???- question "File compression guide" [Compression guide](../storage/compress_guide.md) diff --git a/docs/cluster_guides/storage/compress_fastQ.md b/docs/cluster_guides/storage/compress_fastQ.md index f7c9733e3..62f951547 100644 --- a/docs/cluster_guides/storage/compress_fastQ.md +++ b/docs/cluster_guides/storage/compress_fastQ.md @@ -3,9 +3,10 @@ **Short answer: The best compression using a widely available format is provided by bzip2 and its parallel equivalent pbzip2. The best compression ratio for FastQ is provided by fqz_comp in the fqzcomp/4.6 module. However, this tool is experimental and is not recommended for general, everyday use.** ## Long answer + We conducted an informal examination of two specialty FastQ compression tools by recompressing an existing fastq.gz file. The first tool fqzcomp (available in the module fqzcomp/4.6) uses a compiled executable (fqz_comp) that works similar to e.g., gzip, while the second tool (LFQC in the module lfqc/1.1) uses separate ruby-language scripts for compression (lfqc.rb) and decompression (lfqcd.rb). It does not appear the LFQC scripts accept piping but the documentation is limited. -``` +```bash module load bioinfo-tools module load fqzcomp/4.6 module load lfqc/1.1 @@ -17,7 +18,7 @@ One thing changed from the 'standard' implementation of LFQC was to make the scr Since piping is not available with LFQC, it is preferable to avoid creating a large intermediate decompressed FastQ file. So, create a named pipe using mkfifo that is named like a fastq file. -``` +```bash mkfifo UME_081102_P05_WF03.se.fastq zcat UME_081102_P05_WF03.se.fastq.gz > UME_081102_P05_WF03.se.fastq & lfqc.rb UME_081102_P05_WF03.se.fastq @@ -28,13 +29,13 @@ This took a long time, 310 wall seconds. Next,fqz_comp from fqzcomp/4.6. Since this works like gzip, just use it in a pipe. -``` +```bash zcat UME_081102_P05_WF03.se.fastq.gz | fqz_comp > UME_081102_P05_WF03.se.fastq.fqz ``` This used a little multithreading (up to about 150% CPU) and was much faster than LFQC, just 2-3 seconds. There are other compression options (we tried -s1 and -s9+) but these did not outperform the default (equivalent to -s3). This is not necessarily a surprise; stronger compression means attempting to make better guesses and sometimes these guesses are not correct. No speedup/slowdown was noticed with other settings but the input file was relatively small. -``` +```console -rw-rw-r-- 1 28635466 Mar 10 12:53 UME_081102_P05_WF03.se.fastq.fqz1 -rw-rw-r-- 1 29271063 Mar 10 12:52 UME_081102_P05_WF03.se.fastq.fqz9+ -rw-rw-r-- 1 46156932 Jun 6 2015 UME_081102_P05_WF03.se.fastq.gz @@ -44,7 +45,7 @@ This used a little multithreading (up to about 150% CPU) and was much faster tha We also compared against bzip2 and xz, which are general-use compressors. These both function like gzip (and thus like fqz_comp) and both outperform gzip, as expected. xz is becoming a more widely-used general compressor like bzip2, but for this file it required perhaps 20x as much time as bzip2 and did worse. -``` +```console -rw-rw-r-- 1 35664555 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.bz2 -rw-rw-r-- 1 36315260 Mar 10 13:10 UME_081102_P05_WF03.se.fastq.xz ``` @@ -53,11 +54,12 @@ Neither of these improved general-use compressors did as well with FastQ as the ### Which is the best method in this trial? + From the results of this trial, the tool that provides the best compression ratio in a reasonable amount of time is fqz_comp in the fqzcomp/4.6 module. It is as fast as bzip2 which is also much better than gzip but does a much better job of compressing FastQ. However, fqz_comp is experimental so we do not recommend using fqz_comp for everyday use. We recommend using bzip2 or its parallel equivalent, pbzip2. The fqz_comp executable could be used to decompress FastQ within a named pipe if FastQ is required for input: -``` +```text ... <(fqz_comp -d < file.fastq.fqz) ... ``` @@ -65,4 +67,5 @@ Note that fqz_comp is designed to compress FastQ files alone, and neither method ### Why not LFQC? + Though LFQC has the best compression of FastQ, there are some strong disadvantages. First, it takes quite a long time, perhaps 50x longer than fqz_comp. Second, it apparently cannot be used easily within a pipe like many other compressors. Third, it contains multiple scripts with multiple auxiliary programs, rather than a single executable. Fourth, it is quite verbose during operation, which can be helpful but cannot be turned off. Finally, it was difficult to track down for installation; two different links were provided in the publications and neither worked. It was finally found in a github repository, the location of which is provided in the module help.