-
Notifications
You must be signed in to change notification settings - Fork 82
Requirements for Binning Directories
Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S = [ [T_0, T_1, ..., T_k-1], [T_1, T_2, ..., T_k], ..., [T_n-k, T_n-k+1, ..., T_n-1] ].
k = 5
T = GACTACGTAGC
S = [ GACTA,
ACTAC,
CTACG,
TACGT,
ACGTA,
CGTAG,
GTAGC ]
To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P|=m shares at least m-(e+1)*k-1 k-mers with T.
Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].
Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - e*ceil(k/o) many k-mers.
k = 5
o = 3
T = GACTACGTAGC
S = [ GACTA,
TACGT,
GTAGC ]
When looking for a pattern P in text T, all k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o. In theory, it would also work the other way around (storing the offset k-mers of T and searching all k-mers of T). Storing the offset k-mers of T would reduce the memory needed to store the k-mers of T, while it (depending on how you look the k-mers up, this is the case for Binning Directories) wouldn't speed up the querying. For Binning Directories we usually store all k-mers of T and look up the offset k-mers of P to speed up the querying (usually by about a factor of o).
k = 3
o = 2
T = GACTACGTAGC
S_T = [ GAC,
CTA,
ACG,
GTA,
AGC ]
P = ACTACG
S_P = [ ACT,
TAC ]
Even though P is a substring of T, no k-mers match.
k = 3
o = 2
T = GACTACGTAGC
S_T = [ GAC,
ACT,
CTA,
TAC,
ACG,
CGT,
GTA,
TAG,
AGC ]
P = ACTACG
S_P = [ ACT,
TAC ]
Now S_P is contained in S_T.
Given a text T with |T| = n, a k-mer size k and a window size w, the set of minimizing k-mers over T is the set of the lexicographically smallest k-mers in consecutive substrings of length w of T and its reverse complement.
Heuristically we can look at the minimizer coverage of a pattern P and infer some consequences for the amount of minimizers destroyed by an error.
[...]
We have a mask.
NP-hard. There are heuristics.
[...]