Expected memory usage #80

matnguyen · 2023-07-19T22:49:42Z

What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB

dnbaker · 2023-07-20T01:02:59Z

May I have the full command you're using? Any additional information you could provide would be helpful as well. In the default all-pairs mode, memory is allocated in a few key areas: 1. Sketches. This will be around (num_entities * sketch_size * 8) bytes for the default modes. For 500,000 entities + default sketch size, I would expect ~4GB. 2. Kmers, If you're saving them as well. This is again (num_entities * sketch_size * 8), but 0 otherwise. 3. Parsing buffers. Each thread reuses a buffer when parsing files. Its size is the length of the longest sequence it has encountered yet, rounded up to the nearest power of 2. For assembled eukaryotic genomes, (3) can big especially if highly-multithreaded, but I doubt it's the problem for viral assemblies. 4. Temporary data prepared for I/O. When writing out the distances, chunks of data are computed in parallel, and they are each added to a queue of results to be written to disk. It's possible that the I/O was slow enough that the program stored an excessive amount of distance data. (1) and (2) can be reduced by using `-o` to specify an output location for a sketch database. Then the data is mmap'd instead of stored in RAM, which can reduce your memory usage. And if you're doing top-k, threshold-filtered, or the greedy clustering modes, there is additional memory allocated to build LSH tables over the data to pull up near neighbors, which can be rather significant. Best, Daniel

…

On Wed, Jul 19, 2023 at 3:49 PM Matthew Nguyen ***@***.***> wrote: What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB — Reply to this email directly, view it on GitHub <#80>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ5UVMHNOOLZHDI4GCKGVLXRBQBBANCNFSM6AAAAAA2QSLQFY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

matnguyen · 2023-07-20T05:30:16Z

This is the command I'm running:

dashing2_savx2 sketch --cmpout dist_mat.txt -k 7 --parse-by-seq -p 32 sequences.multi.fa

dnbaker · 2023-07-21T02:25:10Z

Thank you! This is a big help.

There's one other place memory is used in --parse-by-seq mode: storing the sequences of each read. What's happening here is that the whole fasta ends up being stored in memory.

I have to say this isn't desirable behavior for most cases. If edit distance is chosen for an output distance or the program is running in greedy clustering mode, then the program needs to hold on to the sequences for later use, but otherwise it doesn't need to hold on to them.

I need to do a bit of work to reorder this to avoid this problem; I think I have a path to do it, but it will take a bit of reorganization.

I'll update you when there's a fix for this.

Thanks again,

Daniel

dnbaker · 2023-07-31T00:57:16Z

Checking back in - this is improved with #81. I'm rebuilding v2.1.18 binaries currently and will update you when they're ready.

Memory usage should be a ower for --parse-by-seq mode. It won't hold onto sequences it doesn't need.

Would you give it another try?

Thanks!

Daniel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected memory usage #80

Expected memory usage #80

matnguyen commented Jul 19, 2023

dnbaker commented Jul 20, 2023 via email

matnguyen commented Jul 20, 2023

dnbaker commented Jul 21, 2023

dnbaker commented Jul 31, 2023

Expected memory usage #80

Expected memory usage #80

Comments

matnguyen commented Jul 19, 2023

dnbaker commented Jul 20, 2023 via email

matnguyen commented Jul 20, 2023

dnbaker commented Jul 21, 2023

dnbaker commented Jul 31, 2023