-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: switch mem_gb to mem_mb #262
Conversation
…of mem_mb for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standard mem_mb here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory should probably also passed as number instead of a string. For me the latest version of the wrapper failed as snakemake-wrapper utils awaits a number instead of a sequence.
And just as a sidenote: I had some more issues running gatk with wrapper v2.3.2
. I did not find the time do debug this as this seems to be related to the spark master and downgraded to v1.25.0
which was the last executable.
Good point, done. |
The [documentation of `fgbio AnnotateBamWithUmis`](https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html) states, that this tool will read the entire input UMI fastq files into memory in an uncompressed format. As we work with gzipped fastq files, I would expect this to take about 4x the size of the input `fastq.gz` files according to [Table 2](https://academic.oup.com/view-large/394488195) of this paper: Marius Nicolae and others, LFQC: a lossless compression algorithm for FASTQ files, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3276–3281, https://doi.org/10.1093/bioinformatics/btv384 As we should plan for some extra head space, but also have the `bam` file as another input, I think that `4*input.size_mb` should be a good estimate. This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the `fastq.gz` files beforehand and then use the `fgbio AnnotateBamWithUmis` flag `--sorted`.
Yes, the specification should definitely be dynamic. I was already planning this but hadn't gotten around to looking into a useful calculation. Here's my reasoning for what I now suggest: The documentation of As we should plan for some extra head space, but also have the This can be rather heavy on the memory requirements, but this should be fine on modern servers and cluster systems -- and I think this workflow should usually be run on bigger compute infrastructure. So I think this is acceptable, but as an alternative we could sort the |
The other option would be to switch to something that does the UMI annotation earlier. I was already wondering why this only happens after mapping, with tools like |
Should we really go for |
I think transforming the fastq into a bam file makes things more complicated. In general reads are mapped after adapter trimming, while the UMIs are derived from the input fastqs as this information is lost after trimming. If we want to annotate at an early stage we would need to run |
But
The latter does seem to imply This would also avoid having to again merge the untrimmed reads and reading all of them into memory for |
This sounds like a good option to me. We should probably implement this in a separate PR and go along with that memory consuming option for now to get the workflow running again. |
OK, for such a case the |
I would be happy with that especially as this would only a temporary solution until we replace the umi annotation as discussed. |
a
mem_gb
specification plus adefault-resources:
specification ofmem_mb
for a cluster system leads to multiple distinct resource definitions that can get confused -- so we should just stick to the standardmem_mb
here