Compressed Seqhash #397
Labels
easy
A quick and easy fix!
enhancement
New feature or request
good first issue
Good for newcomers
low priority
Would be nice to fix, but doesn't have to happen right now/there are more important things
wontfix
This will not be worked on
What I want
I would like a more compressed Seqhash. Here is a current seqhash:
v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350
Here is the latter portion encoded in base58:
HRWk6jLXJ3uvuKBnjyAhinEUsuzKbgpphDkrEcStX4AT
- much shorter, 44 letters instead of 64. If we truncate to 16 bytes instead of 32 bytes, we getX8d1qRxANHFkdQM4kqKYWb
. Much shorter! (base58 is nicer for encoding into various applications since it doesn't have any special characters)If there are 8 bits in a byte, we can have a flag take up less space:
This would result in seqhashes that are 16 bytes, and would take up 22 characters of text rather than the current 71. This is far better than, say, 1000char when it comes to encoding a full protein.
I ask for 15 byte truncation rather than 16 byte truncation (120 bit vs 128 bit), so that the flags can fit, while still being a multiple of 2. (and to have a 50% chance of a collision in a 120-bit hash, you would need to generate approximately 1.357×10^18 hashes)
Why this is important
I am looking at building seqhashes into applications that interact with LLMs. LLMs are very bad at handling sequences, since they usually want to introspect inputs and outputs from APIs / interactive code. This introspection is also very useful for most users, since the AI can automatically improve if it can look at the full input / output of its code. However, the LLM then tries to look at sequences themselves, and then it gets kinda confused. This feature would greatly compress the amount of data needed to refer to genetic sequences.
The text was updated successfully, but these errors were encountered: