-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit Intersection Size #38
Comments
Fixed as of 22e340a. (Emits intersection size instead of union size.) |
Maybe it's still WIP, but with
Using just EDIT: working with the v0.4.4 release |
Hi Mihkel, Thanks for reporting. I made a mistake making this change. It's been patched, both in master at this commit and v0.4.5. Would you give it another try for me? Sample output (the files with cp are copies, so the expected intersection is complete):
|
TL;DR: Using another k-mer program with the same k I found some differences that should not be there. Also, a minor detail (but probably a bit tricky to fix): |
Thanks for the find, I'll investigate this soon. You're right that it'd be a big tricky to have -T report intersections along the diagonal, but checking out the off-diagonal entries is important. |
In practice, the zero info on the diagonal is more or less irrelevant so it can be pushed to the bottom of the TODO list. Comparing the results of different data structures vs full khash sets I found that Some testing results, might be interesting:
The values are found on a small set of full genomes where only smaller intersection sizes are observed with |
I'm closing this for now, but feel free to open if you have any further issues. |
I managed to hide the comment previously so well I had trouble myself finding it 😄 : Any plans on implementing |
I see. I can add this sometime relatively soon. I would guess that the bloom filter would overestimate. I'm not sure if there's any way to arrive at cardinality estimates using SuperMinHash (and thereby intersection sizes), but I'll see what I can do. It uses random number generators and sampling to get values with given probabilities. I'll probably have time in a couple of weeks, and thanks for the reminder. |
Currently, dashing emits union sizes with
--sizes
, but not intersection size. You can get to that by subtracting the estimated set cardinality, but it would be preferable to emit it directly.The text was updated successfully, but these errors were encountered: