fix: switch mem_gb to mem_mb #262

dlaehnemann · 2023-08-16T14:48:30Z

a mem_gb specification plus a default-resources: specification of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here

…of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here

FelixMoelder

The memory should probably also passed as number instead of a string. For me the latest version of the wrapper failed as snakemake-wrapper utils awaits a number instead of a sequence.

And just as a sidenote: I had some more issues running gatk with wrapper v2.3.2. I did not find the time do debug this as this seems to be related to the spark master and downgraded to v1.25.0 which was the last executable.

dlaehnemann · 2023-08-17T10:51:18Z

Good point, done.

workflow/rules/mapping.smk

The [documentation of `fgbio AnnotateBamWithUmis`](https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html) states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input `fastq.gz` files according to [Table 2](https://academic.oup.com/view-large/394488195) of this paper: Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384 As we should plan for some extra head space, but also have the `bam` file as another input, I think that `4*input.size_mb` should be a good estimate. This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the `fastq.gz` files beforehand and then use the `fgbio AnnotateBamWithUmis` flag `--sorted`.

dlaehnemann · 2023-08-22T15:33:56Z

Yes, the specification should definitely be dynamic. I was already planning this but hadn't gotten around to looking into a useful calculation. Here's my reasoning for what I now suggest:

The documentation of fgbio AnnotateBamWithUmis states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input fastq.gz files according to Table 2 of this paper:
Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384

As we should plan for some extra head space, but also have the bam file as another input, I think that 4*input.size_mb should be a good estimate.

This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the fastq.gz files beforehand and then use the fgbio AnnotateBamWithUmis flag --sorted.

dlaehnemann · 2023-08-22T15:38:59Z

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

FelixMoelder · 2023-08-23T06:41:49Z

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb).
So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

FelixMoelder · 2023-08-23T08:14:28Z

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

I think transforming the fastq into a bam file makes things more complicated. In general reads are mapped after adapter trimming, while the UMIs are derived from the input fastqs as this information is lost after trimming. If we want to annotate at an early stage we would need to run FastqToBam first and then transform the annotated bam files into fastqs again in order to run Cutadapt. Maybe I missed something here but fastq -> bam -> fastq -> trimming -> mapping does not appear reasonable to me.

dlaehnemann · 2023-08-23T10:19:54Z

But cutadapt doesn't have any UMI functionality and should only remove sequencing adapters, which usually remain at the end of a read if the read is longer than the insert size / fragment length. But the UMIs, that are usually at the start of a read, should remain after cutadapt. So we should be able to run:

cutadapt
fgbio FastqToBam
bwa mem

The latter does seem to imply samtools fastq to bring the BAM file back to FASTQ format, but this can then be piped directly on to bwa mem. A detailed description is here:

https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

This would also avoid having to again merge the untrimmed reads and reading all of them into memory for fgbio AnnotateBamWithUmis.

FelixMoelder · 2023-08-23T10:34:01Z

This sounds like a good option to me. We should probably implement this in a separate PR and go along with that memory consuming option for now to get the workflow running again.

dlaehnemann · 2023-08-23T10:58:00Z

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb). So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

OK, for such a case the 4* does seem a bit excessive. We probably expect the size of the BAM file and (all of the) FASTQ file(s) to be similar, right? So going for 2* or maybe 2.5* might be a bit better for the quick fix?

FelixMoelder · 2023-08-23T11:49:30Z

I would be happy with that especially as this would only a temporary solution until we replace the umi annotation as discussed.

fix: the mem_gb specification plus a default-resources specification …

835ba93

…of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here

FelixMoelder requested changes Aug 17, 2023

View reviewed changes

mapping.smk: make mem_mb value an int

d21e7dc

Merge branch 'master' into fix-mem-gb-to-mem-mb-resources-spec

12f0b98

johanneskoester requested changes Aug 21, 2023

View reviewed changes

workflow/rules/mapping.smk Outdated Show resolved Hide resolved

formatting

d9a4d33

go down to 2.5 * input.size_mb, as BAM almost doubles input size

cf949f7

FelixMoelder approved these changes Aug 23, 2023

View reviewed changes

FelixMoelder requested a review from johanneskoester August 23, 2023 11:59

johanneskoester approved these changes Aug 24, 2023

View reviewed changes

johanneskoester merged commit 6bda9b5 into master Aug 24, 2023
4 checks passed

johanneskoester deleted the fix-mem-gb-to-mem-mb-resources-spec branch August 24, 2023 12:08

github-actions bot mentioned this pull request Aug 24, 2023

chore(master): release 5.0.1 #264

Merged

FelixMoelder mentioned this pull request Apr 11, 2024

feat: reduce fgbio memory usage #296

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: switch mem_gb to mem_mb #262

fix: switch mem_gb to mem_mb #262

dlaehnemann commented Aug 16, 2023

FelixMoelder left a comment •

edited

Loading

dlaehnemann commented Aug 17, 2023

dlaehnemann commented Aug 22, 2023

dlaehnemann commented Aug 22, 2023

FelixMoelder commented Aug 23, 2023 •

edited

Loading

FelixMoelder commented Aug 23, 2023 •

edited

Loading

dlaehnemann commented Aug 23, 2023

FelixMoelder commented Aug 23, 2023 •

edited

Loading

dlaehnemann commented Aug 23, 2023

FelixMoelder commented Aug 23, 2023

fix: switch mem_gb to mem_mb #262

fix: switch mem_gb to mem_mb #262

Conversation

dlaehnemann commented Aug 16, 2023

FelixMoelder left a comment • edited Loading

Choose a reason for hiding this comment

dlaehnemann commented Aug 17, 2023

dlaehnemann commented Aug 22, 2023

dlaehnemann commented Aug 22, 2023

FelixMoelder commented Aug 23, 2023 • edited Loading

FelixMoelder commented Aug 23, 2023 • edited Loading

dlaehnemann commented Aug 23, 2023

FelixMoelder commented Aug 23, 2023 • edited Loading

dlaehnemann commented Aug 23, 2023

FelixMoelder commented Aug 23, 2023

FelixMoelder left a comment •

edited

Loading

FelixMoelder commented Aug 23, 2023 •

edited

Loading

FelixMoelder commented Aug 23, 2023 •

edited

Loading

FelixMoelder commented Aug 23, 2023 •

edited

Loading