Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise initial bounds for maching peptides in Suffix Array #25

Draft
wants to merge 26 commits into
base: develop
Choose a base branch
from

Conversation

tibvdm
Copy link
Contributor

@tibvdm tibvdm commented Aug 27, 2024

This PR introduces the "K-mer bounds table". This table is a small lookup that allows us to very easily match a short peptide sequence and provides us already with a better initial value for the minimum and maximum bounds when looking up a peptide in the suffix array.

See this detailed explanation for how this works exactly:

When searching for a peptide within the suffix array, we employ a binary search strategy across all the suffixes of the proteins in UniProtKB. In the 2024_04 release of UniProtKB, which contains more than 87 billion amino acids, we need to perform log2(87 * 10^9) ≈ 36,35 = 37 suffix comparisons, on average, per bound. Since we are identifying a range of suffixes that match with our peptide within the suffix array, two binary searches are required to determine the minimum and maximum bounds, resulting in a total of 74 comparisons per peptide. These comparisons are computationally costly and should be minimized. To reduce the number of searches, we precompute and store the initial bounds for all possible peptides of length k and shorter. This allows for a constant-time lookup, followed by a much smaller binary search step for each query.

For a k-mer table with factor k, there exist a total of 20k different k-mers. This can be seen as dividing the space of 87 billion amino acids into 20k compartments, where, on average, each compartment has size 2^37 / 20^k. By taking the logarithm log2(2^37 / 20^k) = 37 - k * log2(20), we can calculate how many text comparisons we still need to perform after consulting the k-mer bounds table. For a k of size 5, we only need 15 comparisons per bound instead of 37, resulting in significantly less work per search.

@pverscha pverscha changed the title Feature/sa bounds optimization Optimise initial bounds for maching peptides in Suffix Array Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant