-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expected memory usage #80
Comments
May I have the full command you're using? Any additional information you
could provide would be helpful as well.
In the default all-pairs mode, memory is allocated in a few key areas:
1. Sketches.
This will be around (num_entities * sketch_size * 8) bytes for the default
modes.
For 500,000 entities + default sketch size, I would expect ~4GB.
2. Kmers,
If you're saving them as well. This is again (num_entities * sketch_size *
8), but 0 otherwise.
3. Parsing buffers.
Each thread reuses a buffer when parsing files. Its size is the length of
the longest sequence it has encountered yet, rounded up to the nearest
power of 2.
For assembled eukaryotic genomes, (3) can big especially if
highly-multithreaded, but I doubt it's the problem for viral assemblies.
4. Temporary data prepared for I/O.
When writing out the distances, chunks of data are computed in parallel,
and they are each added to a queue of results to be written to disk.
It's possible that the I/O was slow enough that the program stored an
excessive amount of distance data.
(1) and (2) can be reduced by using `-o` to specify an output location for
a sketch database. Then the data is mmap'd instead of stored in RAM, which
can reduce your memory usage.
And if you're doing top-k, threshold-filtered, or the greedy clustering
modes, there is additional memory allocated to build LSH tables over the
data to pull up near neighbors, which can be rather significant.
Best,
Daniel
…On Wed, Jul 19, 2023 at 3:49 PM Matthew Nguyen ***@***.***> wrote:
What's the expected memory usage per genome for Dashing2? I'm trying to
run it on 500,000 viral isolates, and am running out of memory even with
500GB
—
Reply to this email directly, view it on GitHub
<#80>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ5UVMHNOOLZHDI4GCKGVLXRBQBBANCNFSM6AAAAAA2QSLQFY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This is the command I'm running:
|
Thank you! This is a big help. There's one other place memory is used in I have to say this isn't desirable behavior for most cases. If edit distance is chosen for an output distance or the program is running in greedy clustering mode, then the program needs to hold on to the sequences for later use, but otherwise it doesn't need to hold on to them. I need to do a bit of work to reorder this to avoid this problem; I think I have a path to do it, but it will take a bit of reorganization. I'll update you when there's a fix for this. Thanks again, Daniel |
Checking back in - this is improved with #81. I'm rebuilding v2.1.18 binaries currently and will update you when they're ready. Memory usage should be a ower for --parse-by-seq mode. It won't hold onto sequences it doesn't need. Would you give it another try? Thanks! Daniel |
What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB
The text was updated successfully, but these errors were encountered: