Requirements for Binning Directories

Shapes

Consecutive Shapes

Definition

Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S = [ [T_0, T_1, ..., T_k-1], [T_1, T_2, ..., T_k], ..., [T_n-k, T_n-k+1, ..., T_n-1] ].

Example

k = 5
T =   GACTACGTAGC
S = [ GACTA,
       ACTAC,
        CTACG,
         TACGT,
          ACGTA,
           CGTAG,
            GTAGC ]

Thresholding

To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P|=m shares at least m-(e+1)*k-1 k-mers with T.

Offset Shapes

Definition

Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].

Thresholding

Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - e*ceil(k/o) many k-mers.

Example

k = 5
o = 3
T =   GACTACGTAGC
S = [ GACTA,
         TACGT,
            GTAGC ]

Details

When looking for a pattern P in text T, all k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o. In theory, it would also work the other way around (storing the offset k-mers of T and searching all k-mers of T). Storing the offset k-mers of T would reduce the memory needed to store the k-mers of T, while it (depending on how you look the k-mers up, this is the case for Binning Directories) wouldn't speed up the querying. For Binning Directories we usually store all k-mers of T and look up the offset k-mers of P to speed up the querying (usually by about a factor of o).

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
          CTA,
            ACG,
              GTA,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Even though P is a substring of T, no k-mers match.

k = 3
o = 2
T =     GACTACGTAGC
S_T = [ GAC,
         ACT,
          CTA,
           TAC,
            ACG,
             CGT,
              GTA,
               TAG,
                AGC ]
P =      ACTACG
S_P = [  ACT,
           TAC ]

Now S_P is contained in S_T.

Minimizer Shapes

Definition

Given a text T with |T| = n, a k-mer size k and a window size w, the set of minimizing k-mers over T is the set of the lexicographically smallest k-mers in consecutive substrings of length w of T and its reverse complement.

Thresholding

Heuristically we can look at the minimizer coverage of a pattern P and infer some consequences for the amount of minimizers destroyed by an error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requirements for Binning Directories

Shapes

Consecutive Shapes

Definition

Example

Thresholding

Offset Shapes

Definition

Thresholding

Example

Details

Minimizer Shapes

Definition

Thresholding

Example

Gapped Shapes

Definition

Thresholding

Example

Getting started

Guidelines

General Reading

Modern C++ and design patterns

Others

Other links

Clone this wiki locally