Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add generic support for any type of sketch collection as query or database #430

Open
wants to merge 109 commits into
base: main
Choose a base branch
from

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Aug 19, 2024

This PR adds MultiCollection, a wrapper around multiple Collection structs, as an implementation for many important features, including loading of standalone manifests and zip files from pathlists. As part of this it also adds direct sketch loading from RocksDB/RevIndex.

This PR:

It does, however, break some functionality in index, because RevIndexes with external storage cannot be created from multiple CollectionSets, so we had to use specialized loading code.

Note also that there is a bug around loading multisketch files from pathlists in (see #445); this PR does not adjust the code to deal with this.

TODO:

From #436 -

  • add tests for standalone manifests containing zip files
  • add tests for pathlists containing zip files
  • add tests for fastgather loading a query sketch from RocksDB
  • add tests for fastgather loading against sketches from RocksDB
  • add tests for multisearch loading query & against sketches from RocksDB
  • add tests for pairwise loading sketches from RocksDB
  • add check for warning about loading all sketches from a RocksDB

From #437 -

  • regularize the code for multisearch error exit/better reporting, add tests, etc.

Punting to issues, to be created before merge:

@ctb ctb changed the base branch from main to ctb_misc_cleanup August 19, 2024 23:00
ctb added 10 commits August 20, 2024 10:32
#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc
@ctb
Copy link
Collaborator Author

ctb commented Oct 6, 2024

The benchmarks in #463 show that loading is not significantly slower nor is more memory used, with this PR - if anything, less memory and faster loading, which could make sense given the changes - here is a screenshot of the benchmarks as of 5380325, which includes the fixes for #463 in #464.

Screenshot 2024-10-06 at 8 47 10 AM

The only remaining big blocker for merging this PR is the memory performance of fastmultigather, seen above in #430 (comment). I'm going to rerun this to verify the bad behavior before tackling a fix ;).

@ctb
Copy link
Collaborator Author

ctb commented Oct 12, 2024

Updated benchmarks as of 2563b0b, via code in sourmash-bio/sourmash#3232. (Previous benchmarks for this PR here: #430 (comment))

Indexing:

GTDB rs214 took 3 hr 10m, and 14.3 GB of RAM. The index is 19 GB in size.

  • This is a decrease in time from the last release, by more than an hour - wow!
  • This is a large increase in size, presumably due to using internal storage

For SRR1976948, we see:

prefix s max_rss
fastmultigather_rocksdb 151.352 590.32
fastgather 166.86 8287.86
fastmultigather 467.033 25236.7
pygather 2708.01 13832.3

The times for this PR are slower by a bit from , but not a huge amount.

fastmultigather still requires 25 GB. Sigh. I think that's the last remaining blocker for merge.

but! fastgather memory has decreased by about 1/3, probably due to sourmash-bio/sourmash#3342!

@ctb ctb changed the title WIP: add generic support for any type of sketch collection as query or database MRG: add generic support for any type of sketch collection as query or database Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants