Skip to content

Requirements for Binning Directories

Enrico Seiler edited this page Jan 25, 2019 · 8 revisions

Shapes

Consecutive Shapes

Definition

Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S = [ [T_0, T_1, ..., T_k-1], [T_1, T_2, ..., T_k], ..., [T_n-k, T_n-k+1, ..., T_n-1] ].

Example

k = 5
T =   GACTACGTAGC
S = [ GACTA,
       ACTAC,
        CTACG,
         TACGT,
          ACGTA,
           CGTAG,
            GTAGC ]

Thresholding

To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P|=m shares at least m-(e+1)*k-1 k-mers with T.


Offset Shapes

Definition

Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].

Thresholding

Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - e*ceil(k/o) many k-mers.

Example

k = 5
o = 3
T =   GACTACGTAGC
S = [ GACTA,
         TACGT,
            GTAGC ]

Details

When looking for a pattern P in text T, all k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o. In theory, it would also work the other way around (storing the offset k-mers of T and searching all k-mers of T). Storing the offset k-mers of T would reduce the memory needed to store the k-mers of T, while it (depending on how you look the k-mers up, this is the case for Binning Directories) wouldn't speed up the querying. For Binning Directories we usually store all k-mers of T and look up the offset k-mers of P to speed up the querying (usually by about a factor of o).

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
          CTA,
            ACG,
              GTA,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Even though P is a substring of T, no k-mers match.

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
         ACT,
          CTA,
           TAC,
            ACG,
             CGT,
              GTA,
               TAG,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Now S_P is contained in S_T.


Minimizer Shapes

Definition

Given a text T with |T| = n, a k-mer size k and a window size w, the set of minimizing k-mers over T is the set of the lexicographically smallest k-mers in consecutive substrings of length w of T and its reverse complement.

Thresholding

Heuristically we can look at the minimizer coverage of a pattern P and infer some consequences for the amount of minimizers destroyed by an error.

Example

[...]

Gapped Shapes

Definition

We have a mask.

Thresholding

NP-hard. There are heuristics.

Example

[...]


Clone this wiki locally