Skip to content

Commit

Permalink
Fix markdown style issues
Browse files Browse the repository at this point in the history
  • Loading branch information
richelbilderbeek committed May 15, 2024
1 parent 5a8b31c commit 89f98dd
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 10 deletions.
9 changes: 8 additions & 1 deletion docs/cluster_guides/storage/compress_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,14 @@

How well things compress will vary a great deal with the input data. An additional consideration is how useful the compressed format will be to you later. Some tools can handle only one compressed format (almost always gzip) and some can handle two (almost always gzip and bzip2). The help information for the tool should be explicit about which formats it understands. You can also use named pipes or the bash <() syntax to uncompress files 'on the fly' if the tool you are using cannot handle that compressed format.

Another consideration for usefulness is the structure of the specific compressed format. By default gzip is not 'blocked'; the compression is applied continually across the entire file, and to uncompress something in the middle it is necessary to uncompress everything up to that point. Tools that understand compressed VCF and GFF files require these to be compressed with **bgzip** (available as part of the **htslib** module), which applies blocked gzip compression, so that it is possible to uncompress interior chunks of the files efficiently. This is useful when viewing compressed VCF/GFF files in a viewer such as IGV, for example. For viewing, such files also need an index created, which is accomplished using **tabix** (also part of the **htslib** module), which understands bgzip-compressed files. BAM files also use a type of gzip compression that is blocked. Files compressed with bgzip can be uncompressed with gzip.
Another consideration for usefulness is the structure
of the specific compressed format.
By default gzip is not 'blocked';
the compression is applied continually across the entire file,
and to uncompress something in the middle
it is necessary to uncompress everything up to that point.
Tools that understand compressed VCF and GFF files
require these to be compressed with **bgzip** (available as part of the **htslib** module), which applies blocked gzip compression, so that it is possible to uncompress interior chunks of the files efficiently. This is useful when viewing compressed VCF/GFF files in a viewer such as IGV, for example. For viewing, such files also need an index created, which is accomplished using **tabix** (also part of the **htslib** module), which understands bgzip-compressed files. BAM files also use a type of gzip compression that is blocked. Files compressed with bgzip can be uncompressed with gzip.

Bzip2 is inherently blocked. Bzip2 is a more efficient compression method than gzip, but takes perhaps twice as long or longer to compress the same file. Fortunately, another advantage of blocked compression is that multiple parts of the file can be compressed at once. Uppmax has **pbzip2** available as a system tool, which can perform parallel compression and decompression of bzip2-format files using multiple threads. This is quite fast. Do 'pbzip2 -h' for help. An Uppmax user has provided a helpful SBATCH script.

Expand Down
22 changes: 13 additions & 9 deletions docs/cluster_guides/storage/compress_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
To avoid filling up the storage at UPPMAX, we all users to do their part and store their files in a good way. The best way to store files is of course to delete everything you don't need anymore, like temporary and intermediate files. For everything else you need to keep, here are some useful commands to know (section about biological data below).

## General files

We have several compression programs installed and you are free to chose whichever you want (any better than none). Examples:

### gzip (fast, good compression)

gzip also has a parallel version (pigz) that will let the program use multiple cores, making it much faster. If you want to run multithreaded you should make a reservation in the queue system, as the login nodes will throttle your programs if they use too much resources.

```
```bash
# compress a file
$ gzip file.txt # single threaded
$ pigz -p 4 file.txt # using 4 threads
Expand All @@ -22,7 +23,7 @@ $ unpigz -p 4 file.txt # using 4 threads (4 is max)

bzip2 also has a parallel version (pbzip2) that will let the program use multiple cores, making it much faster. If you want to run multithreaded you should make a reservation in the queue system, as the login nodes will throttle your programs if they use too much resources.

```
```bash
# compress a file
$ bzip2 file.txt # single threaded
$ pbzip2 -p4 file.txt # using 4 threads
Expand All @@ -35,7 +36,7 @@ $ pbunzip2 -p4 file.txt.gz # using 4 threads

zstd has built in support for using multiple threads when compressing data only, making it much faster. If you want to run multithreaded you should make a reservation in the queue system, as the login nodes will throttle your programs if they use too much resources.

```
```bash
# compress a file
$ zstd --rm file.txt # single threaded
$ zstd --rm -T4 file.txt # using 4 threads
Expand All @@ -44,9 +45,10 @@ $ unzstd --rm file.txt.zst
```

## Compressing lots of files

The commands above work on a single file at a time, and if you have 1000s of files it is quite boring to go through them manually. If you want to combine all the files into a single compressed archive, you can use a program named tar.

```
```bash
# to compress a folder (folder/)
# and all files/folder inside it,
# creating a archive file named files.tar.gz
Expand All @@ -57,7 +59,7 @@ $ tar -xzvf files.tar.gz

If you don't want to combine them in a single file, and instead compress them one by one, you can use the command find.

```
```bash
# to find all files with a name ending
# with .fq and compress them
$ find /path/to/search/in -iname *.fq -print -exec gzip "{}" \;
Expand All @@ -71,19 +73,20 @@ $ find . \( -iname '*.fq' -o -iname '*.fastq' \) -print0 | xargs -0 -P 4 gzip
```

## Biological data

There are some compression algorithms that have become standard practice to use in the realm of biological data. Most programs can read the compressed versions of files as long as it's compressed with the correct program. Leaving out the decompression commands, mostly because they are already described in the General files section about, but also because there is little reason to ever decompress biological data.

### fastq files

```
```bash
# compress sample.fq
$ gzip sample.fq # single threaded
$ pigz -p 4 sample.fq # using 4 threads
```

### sam files

```
```bash
# load samtools
$ module load bioinfo-tools samtools
# compress sample.sam, but remember to delete
Expand All @@ -97,7 +100,7 @@ $ samtools view -@ 4 -b -o sample.bam sample.sam

### vcf / g.vcf files

```
```bash
# load htslib
$ module load bioinfo-tools htslib
# compress sample.vcf / sample.g.vcf
Expand All @@ -108,11 +111,12 @@ $ tabix sample.vcf.gz
```

## Programs that don't read compressed files

There are clever ways to get around programs that don't support reading compressed files. Let's say you have a program that only reads plain text files. You can use something called process substitution (also known as anonymous named pipes) to be able to decompress the data on-the-fly while feeding it to the program.

### How you normally would run the program

```
```bash
# run the program with uncompressed file
$ the_program uncompressed.txt
# now, let's compress the file first and run
Expand Down

0 comments on commit 89f98dd

Please sign in to comment.