Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dartminhash implenmentation #67

Open
jianshu93 opened this issue Mar 1, 2023 · 3 comments
Open

dartminhash implenmentation #67

jianshu93 opened this issue Mar 1, 2023 · 3 comments

Comments

@jianshu93
Copy link

Hello Daniel,

Just want to ask whether there are plans for also implement dartminhash (https://arxiv.org/abs/2005.11547), which is very fast for sparse dataset (genomes, metagenomes), and much faster than bagminhsh according to my reading for very sparse dataset and large k (minhashes).

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Mar 27, 2023

Hi Jianshu -

Thanks for asking! Sorry it's been so long. That's something that's definitely possible. I'll see if I can work on it in the next week.

Take care,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

A follow up question, is the one permutation MinHash/SetSketch in Dashing2 the one used in BinDash (I noticed that you also contribute to it), which is the b-bit one permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf). I feel like BinDash is faster than Dashing 1 (HLL) and also one permutation SetSketch.

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Mar 30, 2023

It varies by the task at hand. The raw comparison speed of bindash's packed sketches is very fast because they can terminate their loop early, and bindash is an excellent software application. I'm not sure on dashing2 vs bindash runtime in all-pairs dense comparisons. But both Dashing1 and Dashing2 sketch substantially faster than bindash, and we've measured dashing2's comparisons to be 20x as fast as dashing1's, so I expect it to be sufficiently performant for pairwise comparisons.

More important for large problems, I think, Dashing2's K-nearest neighbor mode, distance thresholded sparse distance matrix, and nearly-linear clustering via --greedy allow it to scale in both space and time without being limited to performing all-pair comparisons.

Apart from that, our code is quite general, supporting weighted sketching with either prob or bagminhash, can perform complete k-mer set and weighted k-mer set comparisons, consumes a variety of alphabets, and can do minimizer set comparisons.

Happy to discuss further!

And I'm definitely behind on a faster weighted sketching algorithm, but I'm planning to get to it soon. I'm encouraged by the fact that I've been able to dedicate some attention.

Best,

Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants