c_assign_taxonomy2 constant sized blocks of working memory #1366
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change reduces the amount of memory used by assignTaxonomy by allocating a constant size of arrays before dispatching parallel tasks.
Rationale
I was reviewing dada2 wrappers for MicrobiomeDB, with the hope of moving on to a better way of running dada2 at scale.
I have ended up with code that runs
assignTaxonomy
andaddSpecies
in chunks:(https://github.com/VEuPathDB/DJob/blob/4b0a3a512e/DistribJobTasks/bin/dada2/mergeAsvsAndAssignToOtus.R#32). which were necessary to process ASVs from a few thousand samples on a memory budget of 6GB.
I thought I might be able to do this in dada2 itself, investigated what the limiting step was, and I noticed most memory was spent in
C_assign_taxonomy2
. I've modified it to use constant sized blocks of working memory before parallelisation dispatch, which seems to work well enough to make the above workaround unnecessary.Testing results
I've profiled the method with a repeated sequence of 100 E.coli bases and a small reference fasta, and I submitted it as a job with 6GB. Without the change I can do 300k sequences, with the change about 5M and then something else becomes a bottleneck.
currently (dada2 master branch)
with this change
The above run times are from two different jobs in a cluster, so they're not directly comparable, but I think the run time didn't change much either way. I don't know if there are any benefits to having NSEQ smaller or if something about this change could be good for run time. As far as potential slowdowns go, I had to zero out one array in a for loop instead of using
calloc
, and there's an extra sync point everyNSEQ / INTERRUPT_BLOCK_SIZE
parallel tasks, so NSEQ was chosen to be large enough that it doesn't matter.I've ran some basic checks at the outputs from a changed function in my one test case - I think it's still correct.
Would you like to add this change to the code? Let me know if you'd like more testing, something changed style-wise etc.