-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dartminhash implenmentation #67
Comments
Hi Jianshu - Thanks for asking! Sorry it's been so long. That's something that's definitely possible. I'll see if I can work on it in the next week. Take care, Daniel |
Hello Daniel, A follow up question, is the one permutation MinHash/SetSketch in Dashing2 the one used in BinDash (I noticed that you also contribute to it), which is the b-bit one permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf). I feel like BinDash is faster than Dashing 1 (HLL) and also one permutation SetSketch. Thanks, Jianshu |
It varies by the task at hand. The raw comparison speed of bindash's packed sketches is very fast because they can terminate their loop early, and bindash is an excellent software application. I'm not sure on dashing2 vs bindash runtime in all-pairs dense comparisons. But both Dashing1 and Dashing2 sketch substantially faster than bindash, and we've measured dashing2's comparisons to be 20x as fast as dashing1's, so I expect it to be sufficiently performant for pairwise comparisons. More important for large problems, I think, Dashing2's K-nearest neighbor mode, distance thresholded sparse distance matrix, and nearly-linear clustering via Apart from that, our code is quite general, supporting weighted sketching with either prob or bagminhash, can perform complete k-mer set and weighted k-mer set comparisons, consumes a variety of alphabets, and can do minimizer set comparisons. Happy to discuss further! And I'm definitely behind on a faster weighted sketching algorithm, but I'm planning to get to it soon. I'm encouraged by the fact that I've been able to dedicate some attention. Best, Daniel |
Hello Daniel,
Just want to ask whether there are plans for also implement dartminhash (https://arxiv.org/abs/2005.11547), which is very fast for sparse dataset (genomes, metagenomes), and much faster than bagminhsh according to my reading for very sparse dataset and large k (minhashes).
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: