dartminhash implenmentation #67

jianshu93 · 2023-03-01T18:31:57Z

Hello Daniel,

Just want to ask whether there are plans for also implement dartminhash (https://arxiv.org/abs/2005.11547), which is very fast for sparse dataset (genomes, metagenomes), and much faster than bagminhsh according to my reading for very sparse dataset and large k (minhashes).

Thanks,

Jianshu

dnbaker · 2023-03-27T02:30:30Z

Hi Jianshu -

Thanks for asking! Sorry it's been so long. That's something that's definitely possible. I'll see if I can work on it in the next week.

Take care,

Daniel

jianshu93 · 2023-03-27T18:38:41Z

Hello Daniel,

A follow up question, is the one permutation MinHash/SetSketch in Dashing2 the one used in BinDash (I noticed that you also contribute to it), which is the b-bit one permutation MinHash with optimal densification (http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf). I feel like BinDash is faster than Dashing 1 (HLL) and also one permutation SetSketch.

Thanks,

Jianshu

dnbaker · 2023-03-30T05:01:06Z

It varies by the task at hand. The raw comparison speed of bindash's packed sketches is very fast because they can terminate their loop early, and bindash is an excellent software application. I'm not sure on dashing2 vs bindash runtime in all-pairs dense comparisons. But both Dashing1 and Dashing2 sketch substantially faster than bindash, and we've measured dashing2's comparisons to be 20x as fast as dashing1's, so I expect it to be sufficiently performant for pairwise comparisons.

More important for large problems, I think, Dashing2's K-nearest neighbor mode, distance thresholded sparse distance matrix, and nearly-linear clustering via --greedy allow it to scale in both space and time without being limited to performing all-pair comparisons.

Apart from that, our code is quite general, supporting weighted sketching with either prob or bagminhash, can perform complete k-mer set and weighted k-mer set comparisons, consumes a variety of alphabets, and can do minimizer set comparisons.

Happy to discuss further!

And I'm definitely behind on a faster weighted sketching algorithm, but I'm planning to get to it soon. I'm encouraged by the fact that I've been able to dedicate some attention.

Best,

Daniel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dartminhash implenmentation #67

dartminhash implenmentation #67

jianshu93 commented Mar 1, 2023

dnbaker commented Mar 27, 2023

jianshu93 commented Mar 27, 2023

dnbaker commented Mar 30, 2023 •

edited

Loading

dartminhash implenmentation #67

dartminhash implenmentation #67

Comments

jianshu93 commented Mar 1, 2023

dnbaker commented Mar 27, 2023

jianshu93 commented Mar 27, 2023

dnbaker commented Mar 30, 2023 • edited Loading

dnbaker commented Mar 30, 2023 •

edited

Loading