-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: add generic support for any type of sketch collection as query or database #430
base: main
Are you sure you want to change the base?
Conversation
…water into ctb_misc_cleanup
…water into ctb_misc2
…water into ctb_misc2
…water into ctb_misc_cleanup
#434) * preliminary victory * compiles and mostly runs * cleanup, split to new module * cleanup and comment * more cleanup of diff * cargo fmt * fix fmt * restore n_failed * comment failing test * cleanup and de-vec * create module/submodule structure * comment for later * get rid of vec * beg for help * cleanup and doc
…water into ctb_misc2
The benchmarks in #463 show that loading is not significantly slower nor is more memory used, with this PR - if anything, less memory and faster loading, which could make sense given the changes - here is a screenshot of the benchmarks as of 5380325, which includes the fixes for #463 in #464. The only remaining big blocker for merging this PR is the memory performance of fastmultigather, seen above in #430 (comment). I'm going to rerun this to verify the bad behavior before tackling a fix ;). |
…water into ctb_misc2
Updated benchmarks as of 2563b0b, via code in sourmash-bio/sourmash#3232. (Previous benchmarks for this PR here: #430 (comment)) Indexing: GTDB rs214 took 3 hr 10m, and 14.3 GB of RAM. The index is 19 GB in size.
For SRR1976948, we see:
The times for this PR are slower by a bit from , but not a huge amount. fastmultigather still requires 25 GB. Sigh. I think that's the last remaining blocker for merge. but! fastgather memory has decreased by about 1/3, probably due to sourmash-bio/sourmash#3342! |
This PR adds
MultiCollection
, a wrapper around multipleCollection
structs, as an implementation for many important features, including loading of standalone manifests and zip files from pathlists. As part of this it also adds direct sketch loading from RocksDB/RevIndex.This PR:
multisearch
continues past "no query signatures loaded, exiting" message #280It does, however, break some functionality in
index
, becauseRevIndex
es with external storage cannot be created from multipleCollectionSet
s, so we had to use specialized loading code.Note also that there is a bug around loading multisketch files from pathlists in (see #445); this PR does not adjust the code to deal with this.
TODO:
manysearch
performance slowdown between v0.9.5 and v0.9.6 #463, 'unreleased with 430 and 464')Manifest::intersect_manifest
to Rust core sourmash#3305)test_fastgather.py::test_indexed_against
MultiCollection
/SmallSignature
struct to not cloneCollection
s each timeindex
to take more/better inputsFrom #436 -
From #437 -
multisearch
error exit/better reporting, add tests, etc.Punting to issues, to be created before merge:
test_fastgather.py::test_against_multisigfile
(see WIP: debug multisigfile test #445)env_logger
, perhaps via features?Collection
andStorage
: consider how to support more flexibleCollection
inRevIndex
for external storage sourmash#3321