-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add seqhash v2 #398
Add seqhash v2 #398
Conversation
Don't merge this yet: I am unsatisfied with the fragment algorithm, and want to fix it up. |
Ready for check @TimothyStiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only general comment is how do you feel about renaming Hash2 to HashV2 @Koeng101?
@TimothyStiles Feel good about it. Changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Koeng101 I'm a little on the fence on wanting to support multiple versions of seqhash.
Least Rotation is super applicable and allows people to hash plasmids reliably.
Seqhash V1 and V2 don't feel as generally applicable or stable. At the time seqhash V1 came out I wasn't very familiar with sketching, bwt, and other non-cryptographic, non-alignment based methods of searching and comparing sequences. Now I'm wondering if V2 should have a little more consideration in what goes into the ID besides the metadata itself. Should the non-metadata string component be non-cryptographic and more similar to bwt for speed and better comparison? Should V2 just be the metadata and we let users choose their own hashing method?
Another thought. How many different hash algorithms should we intend to support and how should they be named? What's the difference between blake 1, 2, & 3? Is it just a speed thing? Same could be said for the seqhash family but in this case there are some true semantic difference in both input and output. Should seqhashV2 be called FragHash?
seqhashV1 felt more like a reference implementation and a good short hand than a standard but with the advent of seqhashV2 it starts bringing up the questions, "When should I use V1 rather than V2?", and, "how many versions will there be and how will I know which one I should be using if there's, V3, V4, etc?". V2 gives the assumption that you should always use the most up to date hash version. Is there any case where that's not true and it would be better to use V1 instead of V2?
// First, run checks and get the determistic sequence of the hash | ||
for _, char := range sequence { | ||
if !strings.Contains("ATUGCYRSWKMBDHVNZ", string(char)) { | ||
return result, errors.New("Only letters ATUGCYRSWKMBDHVNZ are allowed for DNA/RNA. Got letter: " + string(char)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sequence hash for amino acids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't fragment proteins because they aren't double stranded. (ie, if you fragment them, proteins just become 2 proteins)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case then maybe this should be called something like FragHash rather than V2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not the main V2 function? It's called HashV2Fragment
to complement HashV2
, which is an entirely different function
fmt.Println(sequenceSeqhash) | ||
// Output: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350 | ||
// Output: C_JPQCj5PgjFwjy7jaoYmwqQ== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API usage changes between V2 examples and V2 <-> V1? Should be unified and expect similar behavior.
Do you mean the change to [16]byte
?
It doesn't though - you still have to do least comparison in order to get reverse complements.
I don't think they should be used for comparison at least, not many opinions on cryptography but it's hard to fix non cyptographic later if someone maliciously breaks things. Seqhash is used as an identifier - the kmer table or mash or blast index or minimap index or whatever can be on the other side of the index. Let's you change what your comparison algorithm is, which is much more important that your hashing algorithm.
No, I don't think this is a good idea when trying to make relatively stable identifiers.
However many are necessary! And I'm not really sure about naming - but might be good to think about.
I think I explain the use case for seqhashV2 pretty clearly in the package documentation. It shouldn't be called FragHash because it isn't just for fragments, it's for DNA/RNA/Proteins/Fragments. Fragment hashing is just a feature of v2 that v1 did not have.
This is a good question that I should add to the top of the package docs, but I do think it is answered if you read the v2 documentation. ie, V2 is much smaller but sacrifices length from 256 to 120 bits, increasing collision chance. But more people will probably have that question and not bother to read the full docs.
"How many versions will there be" I think is literally unanswerable because you can't fully predict the future, and if it was predictable, that would be bad, because then it would be inflexible to the changing needs of users. The point of having versions is so that you can add more later, otherwise just do away with versioning at all.
Well, presumably we can explain that to the user.
Less collision chance. But generally speaking, I think you should probably always use V2 over V1. TODO for me:
|
Changes in this PR
Fixes #397 . This PR makes seqhash v2, which is both compressed (~3x smaller than seqhash v1) and has compatibility for encoding fragments as seqhashes.
This PR has 100% test coverage.
Why are you making these changes?
Seqhash v1 produces hashes that are too large, and cannot be used to encode genetic parts / fragments.
Are any changes breaking? (IMPORTANT)
No
Pre-merge checklist
All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.
primers/primers_test.go
for what this might look like.CHANGELOG.md
in the[Unreleased]
section.