Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: switch mem_gb to mem_mb #262

Merged
merged 6 commits into from
Aug 24, 2023

Conversation

dlaehnemann
Copy link
Contributor

a mem_gb specification plus a default-resources: specification of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here

…of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here
Copy link
Contributor

@FelixMoelder FelixMoelder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory should probably also passed as number instead of a string. For me the latest version of the wrapper failed as snakemake-wrapper utils awaits a number instead of a sequence.

And just as a sidenote: I had some more issues running gatk with wrapper v2.3.2. I did not find the time do debug this as this seems to be related to the spark master and downgraded to v1.25.0 which was the last executable.

@dlaehnemann
Copy link
Contributor Author

Good point, done.

workflow/rules/mapping.smk Outdated Show resolved Hide resolved
The [documentation of `fgbio AnnotateBamWithUmis`](https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html) states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input `fastq.gz` files according to [Table 2](https://academic.oup.com/view-large/394488195) of this paper:
Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384

As we should plan for some extra head space, but also have the `bam` file as another input, I think that `4*input.size_mb` should be a good estimate.

This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the `fastq.gz` files beforehand and then use the `fgbio AnnotateBamWithUmis` flag `--sorted`.
@dlaehnemann
Copy link
Contributor Author

Yes, the specification should definitely be dynamic. I was already planning this but hadn't gotten around to looking into a useful calculation. Here's my reasoning for what I now suggest:

The documentation of fgbio AnnotateBamWithUmis states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input fastq.gz files according to Table 2 of this paper:
Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384

As we should plan for some extra head space, but also have the bam file as another input, I think that 4*input.size_mb should be a good estimate.

This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the fastq.gz files beforehand and then use the fgbio AnnotateBamWithUmis flag --sorted.

@dlaehnemann
Copy link
Contributor Author

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

@FelixMoelder
Copy link
Contributor

FelixMoelder commented Aug 23, 2023

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb).
So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

@FelixMoelder
Copy link
Contributor

FelixMoelder commented Aug 23, 2023

The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like fgbio fastqToBam that should be able to do the UMI extraction directly from fastq files. As the author of AnnotateBamWithUmis puts it, this tool should be the last resort for UMI annotation. Or are there important reasons that I missed, for doing the UMI annotation so late?

I think transforming the fastq into a bam file makes things more complicated. In general reads are mapped after adapter trimming, while the UMIs are derived from the input fastqs as this information is lost after trimming. If we want to annotate at an early stage we would need to run FastqToBam first and then transform the annotated bam files into fastqs again in order to run Cutadapt. Maybe I missed something here but fastq -> bam -> fastq -> trimming -> mapping does not appear reasonable to me.

@dlaehnemann
Copy link
Contributor Author

But cutadapt doesn't have any UMI functionality and should only remove sequencing adapters, which usually remain at the end of a read if the read is longer than the insert size / fragment length. But the UMIs, that are usually at the start of a read, should remain after cutadapt. So we should be able to run:

  1. cutadapt
  2. fgbio FastqToBam
  3. bwa mem

The latter does seem to imply samtools fastq to bring the BAM file back to FASTQ format, but this can then be piped directly on to bwa mem. A detailed description is here:

https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

This would also avoid having to again merge the untrimmed reads and reading all of them into memory for fgbio AnnotateBamWithUmis.

@FelixMoelder
Copy link
Contributor

FelixMoelder commented Aug 23, 2023

This sounds like a good option to me. We should probably implement this in a separate PR and go along with that memory consuming option for now to get the workflow running again.

@dlaehnemann
Copy link
Contributor Author

Should we really go for 4*input.size_mb? In my specific case that you be (21+18)*4=156gb. Depending on the hardware this could easily exceed the available memory and impact systems stability. For my scenario I tried some memory settings and 80gb where sufficient (probably a bit less as I also tried 40gb first and then 80gb). So something like 4*fastq+bam should be sufficient. But I do not see that there is an option for getting the size of separate input files. Can we alternatively set some upper limit? E.g. lambda wc, input: max(4*input.size_mb, 0.9*max_memory)? Not sure if snakemake has any information of the total system memory.

OK, for such a case the 4* does seem a bit excessive. We probably expect the size of the BAM file and (all of the) FASTQ file(s) to be similar, right? So going for 2* or maybe 2.5* might be a bit better for the quick fix?

@FelixMoelder
Copy link
Contributor

I would be happy with that especially as this would only a temporary solution until we replace the umi annotation as discussed.

@johanneskoester johanneskoester merged commit 6bda9b5 into master Aug 24, 2023
4 checks passed
@johanneskoester johanneskoester deleted the fix-mem-gb-to-mem-mb-resources-spec branch August 24, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants