-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically determine which hash to use #1
Comments
Hi @rob-p, |
Thanks! Any suggestions on when to check or how to best genericize the code for this? I suspect we have to record the hash size in the serialized output too. |
If we just default to 128-bit hashes, it's just a matter of changing one template, here: https://github.com/jermp/sshash/blob/master/include/util.hpp#L26. No need to save anything to the index file. |
But when we load the index, don't we need to know which hash it was built with? Are you suggesting to always use the 128-bit hash? I think many common indices (e.g human genome and txome) are well within the safe 64-bit hash space. I agree about unifying with upstream. We should see what the full set of necessary changes is. |
Yes, I am. But if we still want to make a choice, then the choice must be chosen statically before building like this |
Always using 128-bits is, of course cleaner and simpler;P. Any idea what the effect of this is on construction and queries? |
Yeah, exactly :D |
I totally agree. I just wonder if we want to take a bowtie2 like approach (dynamically switch based on what we are indexing) or always use 128-bit hashes, even for small references. The affect isn't nearly as big for us as bowtie2, bacause our size is unaffected. If the random lookup differences are very small, then it may be worth to take the hit to keep things simpler, but if the effect is non-trivial, we may want to make dynamic. I could try building a small (txome/genome) index with both and testing it out later, but am afk right now. |
Yes. I will make some experiment too and post them here in this thread. |
Ok I've tried on the human chromosome 13. To test, I've created two folders,
and obtained identical results. 64-bit hashes:
128-bit hashes:
Can you confirm this behavior on your end as well? |
On the whole human genome:
64-bit hashes:
128-bit hashes:
|
I'll test chromosome 5 now on my end — and then maybe see what I get with xxhash ;P |
Great! Thank you @rob-p. |
So it seems my machine is slower than yours (uniformly; expected since it's quite old now). These are with the default hasher. build
64-bit hasher
128-bit hasher
|
Pretty similar results 64 vs. 128 bits, no? |
Yea pretty similar. It looks like my submodule-fu is not awesome, so pulling in the xxhash fork you listed earlier is non-trivial for me. I'll try and work on that later and report back (have to get the little guy up now ;P). |
So I got xxhash to work, and the results don't seem appreciably different (basically within measurement noise) both to what is above, and between the two modes: 64-bit hasher
128-bit hasher
Also, 128-bit
|
Excellent! Thank you for these results. So, I would say:
|
Done, as of jermp/sshash@3bc3c16. |
Hi @jermp,
It would be great if, in piscem (and perhaps generally upstream in sshash?) the size of the hash function used could be automatically adjusted based on the estimated collision probability. Specifically, if there is a large number of keys, then this function (https://github.com/COMBINE-lab/piscem-cpp/blob/5231b57776422c5fcd77b48de3b39a21161726c2/include/util.hpp#L355) triggers the program to terminate with the relevant message.
I wonder if, instead, a more robust behavior might be to check the collision probability and, if it's too high, just use the larger hash values instead. Alternatively (not sure if it's a good idea though), one could just always try with the 64-bit hash and then, if a collision occurs, try again with the 128-bit hash.
I don't think this will be super common for "normal" sized indices, but there's already been some interest expressed in trying to build quite large indices with
piscem
(e.g. https://genomic.social/web/@raw937/109334346658954871), and I imagine that in such a case, this issue would come up rather quickly.The text was updated successfully, but these errors were encountered: