Compressed Seqhash #397

Koeng101 · 2023-11-08T07:07:00Z

What I want

I would like a more compressed Seqhash. Here is a current seqhash: v1_DLD_f4028f93e08c5c23cbb8daa189b0a9802b378f1a1c919dcbcf1608a615f46350

Here is the latter portion encoded in base58: HRWk6jLXJ3uvuKBnjyAhinEUsuzKbgpphDkrEcStX4AT - much shorter, 44 letters instead of 64. If we truncate to 16 bytes instead of 32 bytes, we get X8d1qRxANHFkdQM4kqKYWb. Much shorter! (base58 is nicer for encoding into various applications since it doesn't have any special characters)

If there are 8 bits in a byte, we can have a flag take up less space:

3 bit: seqhash version
1 bit: 15 byte version (vs 32 byte, which is default). 15 byte should be good enough for the majority of purposes, while being half the size, and the full seqhash nicely fits in 16 bytes.
1 bit: circularity
1 bit: double-strandedness
2 bit: DNA/RNA/PROTEIN (other left unspecified)

This would result in seqhashes that are 16 bytes, and would take up 22 characters of text rather than the current 71. This is far better than, say, 1000char when it comes to encoding a full protein.

I ask for 15 byte truncation rather than 16 byte truncation (120 bit vs 128 bit), so that the flags can fit, while still being a multiple of 2. (and to have a 50% chance of a collision in a 120-bit hash, you would need to generate approximately 1.357×10^18 hashes)

Why this is important

I am looking at building seqhashes into applications that interact with LLMs. LLMs are very bad at handling sequences, since they usually want to introspect inputs and outputs from APIs / interactive code. This introspection is also very useful for most users, since the AI can automatically improve if it can look at the full input / output of its code. However, the LLM then tries to look at sequences themselves, and then it gets kinda confused. This feature would greatly compress the amount of data needed to refer to genetic sequences.

The text was updated successfully, but these errors were encountered:

TimothyStiles added the wontfix This will not be worked on label Dec 8, 2023

TimothyStiles closed this as completed Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed Seqhash #397

Compressed Seqhash #397

Koeng101 commented Nov 8, 2023

Compressed Seqhash #397

Compressed Seqhash #397

Comments

Koeng101 commented Nov 8, 2023

What I want

Why this is important