diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index dc83ff4c..6635fbe2 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-01-02T20:07:58","documenter_version":"1.2.1"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.0","generation_timestamp":"2024-01-15T11:20:39","documenter_version":"1.2.1"}} \ No newline at end of file diff --git a/dev/construction/index.html b/dev/construction/index.html index eb825668..3d9b0665 100644 --- a/dev/construction/index.html +++ b/dev/construction/index.html @@ -1,5 +1,5 @@ -Constructing sequences · BioSequences.jl

Construction & conversion

Here we will showcase the various ways you can construct the various sequence types in BioSequences.

Constructing sequences

From strings

Sequences can be constructed from strings using their constructors:

julia> LongDNA{4}("TTANC")
+Constructing sequences · BioSequences.jl

Construction & conversion

Here we will showcase the various ways you can construct the various sequence types in BioSequences.

Constructing sequences

From strings

Sequences can be constructed from strings using their constructors:

julia> LongDNA{4}("TTANC")
 5nt DNA Sequence:
 TTANC
 
@@ -135,14 +135,14 @@
 "TAGA"
 
 julia> string(push!(f(), DNA_A))
-"TAGA"
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
+PKLEQC
source

Comparison to other sequence types

Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:

julia> seq = dna"GAGCTGA"; vec = collect(seq);
 
 julia> seq == vec, isequal(seq, vec)
 (false, false)
 
 julia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))
-true 
+true diff --git a/dev/counting/index.html b/dev/counting/index.html index e1d576a7..2989b489 100644 --- a/dev/counting/index.html +++ b/dev/counting/index.html @@ -1,9 +1,9 @@ -Counting · BioSequences.jl

Counting

BioSequences extends the Base.count method to provide some useful utilities for counting the number of sites in biological sequences.

Most generically you can count the number of sites that satisfy some condition i.e. cause some function to return true:

julia> count(isambiguous, dna"ATCGM")
+Counting · BioSequences.jl

Counting

BioSequences extends the Base.count method to provide some useful utilities for counting the number of sites in biological sequences.

Most generically you can count the number of sites that satisfy some condition i.e. cause some function to return true:

julia> count(isambiguous, dna"ATCGM")
 1
 

You can also use two sequences, for example to compute the number of matching or mismatching symbols:

julia> count(!=, dna"ATCGM", dna"GCCGM")
 2
 
 julia> count(==, dna"ATCGM", dna"GCCGM")
 3
-

Alias functions

A number of functions which are aliases for various invocations of Base.count are provided.

Alias functionBase.count call(s)
n_ambiguouscount(isambiguous, seq), count(isambiguous, seqa, seqb)
n_certaincount(iscertain, seq), count(iscertain, seqa, seqb)
n_gapcount(isgap, seq), count(isgap, seqa, seqb)
matchescount(==, seqa, seqb)
mismatchescount(!=, seqa, seqb)

Bit-parallel optimisations

For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.

However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl

For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.

+

Alias functions

A number of functions which are aliases for various invocations of Base.count are provided.

Alias functionBase.count call(s)
n_ambiguouscount(isambiguous, seq), count(isambiguous, seqa, seqb)
n_certaincount(iscertain, seq), count(iscertain, seqa, seqb)
n_gapcount(isgap, seq), count(isgap, seqa, seqb)
matchescount(==, seqa, seqb)
mismatchescount(!=, seqa, seqb)

Bit-parallel optimisations

For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.

However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl

For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.

diff --git a/dev/index.html b/dev/index.html index 5b521e69..afc9f74b 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status Chat

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Financial contributions

We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development of the community, it will be "merged" in the ledger of our open collective by the core contributors and the person who filed the expense will be reimbursed.

Backers & Sponsors

Thank you to all our backers and sponsors!

Love our work and community? Become a backer.

backers

Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project Your logo will show up here with a link to your website.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on Gitter, or you can try the Bio category of the Julia discourse site.

+Home · BioSequences.jl

BioSequences

Latest Release MIT license Documentation Pkg Status

Description

BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.

Installation

You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:

add BioSequences

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.

diff --git a/dev/interfaces/index.html b/dev/interfaces/index.html index 21142a48..c8f0f7a7 100644 --- a/dev/interfaces/index.html +++ b/dev/interfaces/index.html @@ -1,5 +1,5 @@ -Implementing custom types · BioSequences.jl

Custom BioSequences types

If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.

This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.

Implementing custom Alphabets

Recall the required methods that define the Alphabet interface.

To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.

Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.

julia> struct ReducedAAAlphabet <: Alphabet end
+Implementing custom types · BioSequences.jl

Custom BioSequences types

If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.

This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.

Implementing custom Alphabets

Recall the required methods that define the Alphabet interface.

To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.

Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.

julia> struct ReducedAAAlphabet <: Alphabet end
 
 julia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid
 
@@ -59,4 +59,4 @@
 julia> Base.copy(seq::Codon) = Codon(seq.x)
 
 julia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)
-true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
+true

Interface checking functions

BioSequences.has_interfaceFunction
function has_interface(::Type{Alphabet}, A::Alphabet)

Returns whether A conforms to the Alphabet interface.

source
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)

Check if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.

source
diff --git a/dev/io/index.html b/dev/io/index.html index 053bd4a7..6586ad87 100644 --- a/dev/io/index.html +++ b/dev/io/index.html @@ -1,2 +1,2 @@ -I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
+I/O · BioSequences.jl

I/O for sequencing file formats

Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.

After version v2.0, in order to neatly separate concerns, these submodules were removed.

Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.

A list of all of the different formats and packages is provided below to help you find them quickly.

FormatPackage
FASTAFASTX.jl
FASTQFASTX.jl
2BitTwoBit.jl
diff --git a/dev/iteration/index.html b/dev/iteration/index.html deleted file mode 100644 index 5d832b13..00000000 --- a/dev/iteration/index.html +++ /dev/null @@ -1,13 +0,0 @@ - -Iteration · BioSequences.jl

Iteration

As you might expect, sequence types are iterators over their elements:

julia> n = 0
-0
-
-julia> for nt in dna"ATNGNNT"
-           if nt == DNA_N
-               global n += 1
-           end
-       end
-
-julia> n
-3
-
diff --git a/dev/predicates/index.html b/dev/predicates/index.html index a7634f34..7f3be721 100644 --- a/dev/predicates/index.html +++ b/dev/predicates/index.html @@ -1,5 +1,5 @@ -Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+Predicates · BioSequences.jl

Predicates

A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.

BioSequences.isrepetitiveFunction
isrepetitive(seq::BioSequence, n::Integer = length(seq))

Return true if and only if seq contains a repetitive subsequence of length ≥ n.

source
BioSequences.iscanonicalFunction
iscanonical(seq::NucleotideSeq)

Returns true if seq is canonical.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

source
diff --git a/dev/random/index.html b/dev/random/index.html index 5819087e..3a796b3e 100644 --- a/dev/random/index.html +++ b/dev/random/index.html @@ -1,8 +1,8 @@ -Random sequences · BioSequences.jl

Generating random sequences

Long sequences

You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:

BioSequences.randseqFunction
randseq([rng::AbstractRNG], A::Alphabet, len::Integer)

Generate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.

For RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).

Example:

julia> seq = randseq(AminoAcidAlphabet(), 50)
+Random sequences · BioSequences.jl

Generating random sequences

Long sequences

You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:

BioSequences.randseqFunction
randseq([rng::AbstractRNG], A::Alphabet, len::Integer)

Generate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.

For RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).

Example:

julia> seq = randseq(AminoAcidAlphabet(), 50)
 50aa Amino Acid Sequence:
-VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
+VFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM
source
randseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)

Generate a LongSequence{A} of length len with elements drawn from the given sampler.

Example:

# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U
 julia> sp = SamplerWeighted(rna"ACGUN", fill(0.24, 4))
 julia> seq = randseq(RNAAlphabet{4}(), sp, 50)
 50nt RNA Sequence:
-CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU
source
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
+CUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU
source
BioSequences.randdnaseqFunction
randdnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]

source
BioSequences.randrnaseqFunction
randrnaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]

source
BioSequences.randaaseqFunction
randaaseq([rng::AbstractRNG], len::Integer)

Generate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.

source
BioSequences.SamplerUniformType
SamplerUniform{T}

Uniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.

Examples

julia> sp = SamplerUniform(rna"ACGU");
source
BioSequences.SamplerWeightedType
SamplerWeighted{T}

Weighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.

Examples

julia> sp = SamplerWeighted(rna"ACGUN", fill(0.2475, 4));
source
diff --git a/dev/recipes/index.html b/dev/recipes/index.html new file mode 100644 index 00000000..0aa39257 --- /dev/null +++ b/dev/recipes/index.html @@ -0,0 +1,32 @@ + +Recipes · BioSequences.jl

Recipes

This page provides tested example code to solve various common problems using BioSequences.

One-hot encoding biosequences

The types DNA, RNA and AminoAcid expose a binary representation through the exported function BioSymbols.compatbits, which is a one-hot encoding of:

julia> using BioSymbols
+
+julia> compatbits(DNA_W)
+0x09
+
+julia> compatbits(AA_J)
+0x00000600

Each set bit in the encoding corresponds to a compatible unambiguous symbol. For example, for RNA, the four lower bits encode A, C, G, and U, in order. Hence, the symbol D, which is short for A, G or U, is encoded as 0x01 | 0x04 | 0x08 == 0x0d:

julia> compatbits(RNA_D)
+0x0d
+
+julia> compatbits(RNA_A) | compatbits(DNA_G) | compatbits(RNA_U)
+0x0d

Using this, we can construct a function to one-hot encode sequences - in this example, nucleic acid sequences:

function one_hot(s::NucSeq)
+    M = falses(4, length(s))
+    for (i, s) in enumerate(s)
+        bits = compatbits(s)
+        while !iszero(bits)
+            M[trailing_zeros(bits) + 1, i] = true
+            bits &= bits - one(bits) # clear lowest bit
+        end
+    end
+    M
+end
+
+one_hot(dna"TGNTKCTW-T")
+
+# output
+
+4×10 BitMatrix:
+ 0  0  1  0  0  0  0  1  0  0
+ 0  0  1  0  0  1  0  0  0  0
+ 0  1  1  0  1  0  0  0  0  0
+ 1  0  1  1  1  0  1  1  0  1
diff --git a/dev/search_index.js b/dev/search_index.js index d161821d..5b7a0d81 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"symbols/#Biological-symbols","page":"Biological Symbols","title":"Biological symbols","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Type Meaning\nDNA DNA nucleotide\nRNA RNA nucleotide\nAminoAcid Amino acid","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"These symbols are elements of biological sequence types, just as characters are elements of strings.","category":"page"},{"location":"symbols/#DNA-and-RNA-nucleotides","page":"Biological Symbols","title":"DNA and RNA nucleotides","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' DNA_A / RNA_A A; Adenine\n'C' DNA_C / RNA_C C; Cytosine\n'G' DNA_G / RNA_G G; Guanine\n'T' DNA_T T; Thymine (DNA only)\n'U' RNA_U U; Uracil (RNA only)\n'M' DNA_M / RNA_M A or C\n'R' DNA_R / RNA_R A or G\n'W' DNA_W / RNA_W A or T/U\n'S' DNA_S / RNA_S C or G\n'Y' DNA_Y / RNA_Y C or T/U\n'K' DNA_K / RNA_K G or T/U\n'V' DNA_V / RNA_V A or C or G; not T/U\n'H' DNA_H / RNA_H A or C or T; not G\n'D' DNA_D / RNA_D A or G or T/U; not C\n'B' DNA_B / RNA_B C or G or T/U; not A\n'N' DNA_N / RNA_N A or C or G or T/U\n'-' DNA_Gap / RNA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with DNA_ or RNA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> DNA_A\nDNA_A\n\njulia> DNA_T\nDNA_T\n\njulia> RNA_U\nRNA_U\n\njulia> DNA_Gap\nDNA_Gap\n\njulia> typeof(DNA_A)\nDNA\n\njulia> typeof(RNA_A)\nRNA\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(DNA, 'C')\nDNA_C\n\njulia> convert(DNA, 'C') === DNA_C\ntrue\n","category":"page"},{"location":"symbols/#Amino-acids","page":"Biological Symbols","title":"Amino acids","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' AA_A Alanine\n'R' AA_R Arginine\n'N' AA_N Asparagine\n'D' AA_D Aspartic acid (Aspartate)\n'C' AA_C Cysteine\n'Q' AA_Q Glutamine\n'E' AA_E Glutamic acid (Glutamate)\n'G' AA_G Glycine\n'H' AA_H Histidine\n'I' AA_I Isoleucine\n'L' AA_L Leucine\n'K' AA_K Lysine\n'M' AA_M Methionine\n'F' AA_F Phenylalanine\n'P' AA_P Proline\n'S' AA_S Serine\n'T' AA_T Threonine\n'W' AA_W Tryptophan\n'Y' AA_Y Tyrosine\n'V' AA_V Valine\n'O' AA_O Pyrrolysine\n'U' AA_U Selenocysteine\n'B' AA_B Aspartic acid or Asparagine\n'J' AA_J Leucine or Isoleucine\n'Z' AA_Z Glutamine or Glutamic acid\n'X' AA_X Any amino acid\n'*' AA_Term Termination codon\n'-' AA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with AA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> AA_A\nAA_A\n\njulia> AA_Q\nAA_Q\n\njulia> AA_Term\nAA_Term\n\njulia> typeof(AA_A)\nAminoAcid\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(AminoAcid, 'A')\nAA_A\n\njulia> convert(AminoAcid, 'P') === AA_P\ntrue\n","category":"page"},{"location":"symbols/#Other-functions","page":"Biological Symbols","title":"Other functions","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"alphabet\ngap\niscompatible\nisambiguous","category":"page"},{"location":"symbols/#BioSymbols.alphabet","page":"Biological Symbols","title":"BioSymbols.alphabet","text":"alphabet(DNA)\n\nGet all symbols of DNA in sorted order.\n\nExamples\n\njulia> alphabet(DNA)\n(DNA_Gap, DNA_A, DNA_C, DNA_M, DNA_G, DNA_R, DNA_S, DNA_V, DNA_T, DNA_W, DNA_Y, DNA_H, DNA_K, DNA_D, DNA_B, DNA_N)\n\njulia> issorted(alphabet(DNA))\ntrue\n\n\n\n\n\n\nalphabet(RNA)\n\nGet all symbols of RNA in sorted order.\n\nExamples\n\njulia> alphabet(RNA)\n(RNA_Gap, RNA_A, RNA_C, RNA_M, RNA_G, RNA_R, RNA_S, RNA_V, RNA_U, RNA_W, RNA_Y, RNA_H, RNA_K, RNA_D, RNA_B, RNA_N)\n\njulia> issorted(alphabet(RNA))\ntrue\n\n\n\n\n\n\nalphabet(AminoAcid)\n\nGet all symbols of AminoAcid in sorted order.\n\nExamples\n\njulia> alphabet(AminoAcid)\n(AA_A, AA_R, AA_N, AA_D, AA_C, AA_Q, AA_E, AA_G, AA_H, AA_I, AA_L, AA_K, AA_M, AA_F, AA_P, AA_S, AA_T, AA_W, AA_Y, AA_V, AA_O, AA_U, AA_B, AA_J, AA_Z, AA_X, AA_Term, AA_Gap)\n\njulia> issorted(alphabet(AminoAcid))\ntrue\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.gap","page":"Biological Symbols","title":"BioSymbols.gap","text":"gap(::Type{T})::T\n\nReturn the gap (indel) representation of T. By default, gap is defined for DNA, RNA, AminoAcid and Char.\n\nExamples\n\njulia> gap(RNA)\nRNA_Gap\n\njulia> gap(Char)\n'-': ASCII/Unicode U+002D (category Pd: Punctuation, dash)\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.iscompatible","page":"Biological Symbols","title":"BioSymbols.iscompatible","text":"iscompatible(x::S, y::S) where S <: BioSymbol\n\nTest if x and y are compatible with each other.\n\nExamples\n\njulia> iscompatible(AA_A, AA_R)\nfalse\n\njulia> iscompatible(AA_A, AA_X)\ntrue\n\njulia> iscompatible(DNA_A, DNA_A)\ntrue\n\njulia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C\ntrue\n\njulia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C\nfalse\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.isambiguous","page":"Biological Symbols","title":"BioSymbols.isambiguous","text":"isambiguous(nt::NucleicAcid)\n\nTest if nt is an ambiguous nucleotide.\n\n\n\n\n\nisambiguous(aa::AminoAcid)\n\nTest if aa is an ambiguous amino acid.\n\n\n\n\n\n","category":"function"},{"location":"io/#I/O-for-sequencing-file-formats","page":"I/O","title":"I/O for sequencing file formats","text":"","category":"section"},{"location":"io/","page":"I/O","title":"I/O","text":"Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"After version v2.0, in order to neatly separate concerns, these submodules were removed.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"A list of all of the different formats and packages is provided below to help you find them quickly.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Format Package\nFASTA FASTX.jl\nFASTQ FASTX.jl\n2Bit TwoBit.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"counting/#Counting","page":"Counting","title":"Counting","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"BioSequences extends the Base.count method to provide some useful utilities for counting the number of sites in biological sequences.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Most generically you can count the number of sites that satisfy some condition i.e. cause some function to return true:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"julia> count(isambiguous, dna\"ATCGM\")\n1\n","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"You can also use two sequences, for example to compute the number of matching or mismatching symbols:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"julia> count(!=, dna\"ATCGM\", dna\"GCCGM\")\n2\n\njulia> count(==, dna\"ATCGM\", dna\"GCCGM\")\n3\n","category":"page"},{"location":"counting/#Alias-functions","page":"Counting","title":"Alias functions","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"A number of functions which are aliases for various invocations of Base.count are provided.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Alias function Base.count call(s)\nn_ambiguous count(isambiguous, seq), count(isambiguous, seqa, seqb)\nn_certain count(iscertain, seq), count(iscertain, seqa, seqb)\nn_gap count(isgap, seq), count(isgap, seqa, seqb)\nmatches count(==, seqa, seqb)\nmismatches count(!=, seqa, seqb)","category":"page"},{"location":"counting/#Bit-parallel-optimisations","page":"Counting","title":"Bit-parallel optimisations","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"interfaces/#Custom-BioSequences-types","page":"Implementing custom types","title":"Custom BioSequences types","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.","category":"page"},{"location":"interfaces/#Implementing-custom-Alphabets","page":"Implementing custom types","title":"Implementing custom Alphabets","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the Alphabet interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct ReducedAAAlphabet <: Alphabet end\n\njulia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid\n\njulia> BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()\n\njulia> function BioSequences.symbols(::ReducedAAAlphabet)\n (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,\n AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)\n end\n\njulia> const (ENC_LUT, DEC_LUT) = let\n enc_lut = fill(0xff, length(alphabet(AminoAcid)))\n dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))\n for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))\n enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1\n dec_lut[i] = aa\n end\n (Tuple(enc_lut), Tuple(dec_lut))\n end\n((0x02, 0xff, 0x0b, 0x0a, 0x01, 0x0c, 0x09, 0x03, 0x0e, 0xff, 0x00, 0x0d, 0x0f, 0x07, 0x06, 0x04, 0x05, 0x08, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff), (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F, AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M))\n\njulia> function BioSequences.encode(::ReducedAAAlphabet, aa::AminoAcid)\n i = reinterpret(UInt8, aa) + 0x01\n (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))\n (@inbounds ENC_LUT[i]) % UInt\n end\n\njulia> function BioSequences.decode(::ReducedAAAlphabet, x::UInt)\n x ≥ length(DEC_LUT) && throw(DomainError(aa))\n @inbounds DEC_LUT[x + UInt(1)]\n end\n\njulia> BioSequences.has_interface(Alphabet, ReducedAAAlphabet())\ntrue\n","category":"page"},{"location":"interfaces/#Implementing-custom-BioSequences","page":"Implementing custom types","title":"Implementing custom BioSequences","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the BioSequence interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the BioSequence documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a custom sequence type that is optimised to represent a small sequence: A Codon. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct Codon <: BioSequence{RNAAlphabet{2}}\n x::UInt8\n end\n\njulia> function Codon(iterable)\n length(iterable) == 3 || error(\"Must have length 3\")\n x = zero(UInt)\n for (i, nt) in enumerate(iterable)\n x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)\n end\n Codon(x % UInt8)\n end\nCodon\n\njulia> Base.length(::Codon) = 3\n\njulia> BioSequences.encoded_data_eltype(::Type{Codon}) = UInt\n\njulia> function BioSequences.extract_encoded_element(x::Codon, i::Int)\n ((x.x >>> (6-2i)) & 3) % UInt\n end\n\njulia> Base.copy(seq::Codon) = Codon(seq.x)\n\njulia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)\ntrue","category":"page"},{"location":"interfaces/#Interface-checking-functions","page":"Implementing custom types","title":"Interface checking functions","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"BioSequences.has_interface","category":"page"},{"location":"interfaces/#BioSequences.has_interface","page":"Implementing custom types","title":"BioSequences.has_interface","text":"function has_interface(::Type{Alphabet}, A::Alphabet)\n\nReturns whether A conforms to the Alphabet interface.\n\n\n\n\n\nhas_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)\n\nCheck if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.\n\n\n\n\n\n","category":"function"},{"location":"iteration/","page":"Iteration","title":"Iteration","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"iteration/#Iteration","page":"Iteration","title":"Iteration","text":"","category":"section"},{"location":"iteration/","page":"Iteration","title":"Iteration","text":"As you might expect, sequence types are iterators over their elements:","category":"page"},{"location":"iteration/","page":"Iteration","title":"Iteration","text":"julia> n = 0\n0\n\njulia> for nt in dna\"ATNGNNT\"\n if nt == DNA_N\n global n += 1\n end\n end\n\njulia> n\n3\n","category":"page"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"random/#Generating-random-sequences","page":"Random sequences","title":"Generating random sequences","text":"","category":"section"},{"location":"random/#Long-sequences","page":"Random sequences","title":"Long sequences","text":"","category":"section"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:","category":"page"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"randseq\nranddnaseq\nrandrnaseq\nrandaaseq\nSamplerUniform\nSamplerWeighted","category":"page"},{"location":"random/#BioSequences.randseq","page":"Random sequences","title":"BioSequences.randseq","text":"randseq([rng::AbstractRNG], A::Alphabet, len::Integer)\n\nGenerate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.\n\nFor RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).\n\nExample:\n\njulia> seq = randseq(AminoAcidAlphabet(), 50)\n50aa Amino Acid Sequence:\nVFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM\n\n\n\n\n\nrandseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)\n\nGenerate a LongSequence{A} of length len with elements drawn from the given sampler.\n\nExample:\n\n# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.24, 4))\njulia> seq = randseq(RNAAlphabet{4}(), sp, 50)\n50nt RNA Sequence:\nCUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randdnaseq","page":"Random sequences","title":"BioSequences.randdnaseq","text":"randdnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randrnaseq","page":"Random sequences","title":"BioSequences.randrnaseq","text":"randrnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randaaseq","page":"Random sequences","title":"BioSequences.randaaseq","text":"randaaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.SamplerUniform","page":"Random sequences","title":"BioSequences.SamplerUniform","text":"SamplerUniform{T}\n\nUniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.\n\nExamples\n\njulia> sp = SamplerUniform(rna\"ACGU\");\n\n\n\n\n\n","category":"type"},{"location":"random/#BioSequences.SamplerWeighted","page":"Random sequences","title":"BioSequences.SamplerWeighted","text":"SamplerWeighted{T}\n\nWeighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.\n\nExamples\n\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.2475, 4));\n\n\n\n\n\n","category":"type"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"transforms/#Indexing-and-modifying-sequences","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"","category":"section"},{"location":"transforms/#Indexing","page":"Indexing & modifying sequences","title":"Indexing","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For example, with LongSequences:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5]\nDNA_T\n\njulia> seq[6:end]\n14nt DNA Sequence:\nTANAGTNNAGTACC\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The biological symbol at a given locus in a biological sequence can be set using setindex:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5] = DNA_A\nDNA_A\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"note: Note\nSome types such can be indexed using integers but not using ranges.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.","category":"page"},{"location":"transforms/#Modifying-sequences","page":"Indexing & modifying sequences","title":"Modifying sequences","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"push!(::BioSequences.BioSequence, ::Any)\npop!(::BioSequences.BioSequence)\npushfirst!(::BioSequences.BioSequence, ::Any)\npopfirst!(::BioSequences.BioSequence)\ninsert!(::BioSequences.BioSequence, ::Integer, ::Any)\ndeleteat!(::BioSequences.BioSequence, ::Integer)\nappend!(::BioSequences.BioSequence, ::BioSequences.BioSequence)\nresize!(::BioSequences.LongSequence, ::Integer)\nempty!(::BioSequences.BioSequence)","category":"page"},{"location":"transforms/#Base.push!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.push!","text":"push!(seq::BioSequence, x)\n\nAppend a biological symbol x to a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pop!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.pop!","text":"pop!(seq::BioSequence)\n\nRemove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pushfirst!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.pushfirst!","text":"pushfirst!(seq, x)\n\nInsert a biological symbol x at the beginning of a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.popfirst!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.popfirst!","text":"popfirst!(seq)\n\nRemove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.insert!-Tuple{BioSequence, Integer, Any}","page":"Indexing & modifying sequences","title":"Base.insert!","text":"insert!(seq::BioSequence, i, x)\n\nInsert a biological symbol x into a biological sequence seq, at the given index i.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.deleteat!-Tuple{BioSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.deleteat!","text":"deleteat!(seq::BioSequence, i::Integer)\n\nDelete a biological symbol at a single position i in a biological sequence seq.\n\nModifies the input sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.append!-Tuple{BioSequence, BioSequence}","page":"Indexing & modifying sequences","title":"Base.append!","text":"append!(seq, other)\n\nAdd a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.resize!-Tuple{LongSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.resize!","text":"resize!(seq, size, [force::Bool])\n\nResize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.empty!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.empty!","text":"empty!(seq::BioSequence)\n\nCompletely empty a biological sequence seq of nucleotides.\n\n\n\n\n\n","category":"method"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Here are some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACG\"\n3nt DNA Sequence:\nACG\n\njulia> push!(seq, DNA_T)\n4nt DNA Sequence:\nACGT\n\njulia> append!(seq, dna\"AT\")\n6nt DNA Sequence:\nACGTAT\n\njulia> deleteat!(seq, 2)\n5nt DNA Sequence:\nAGTAT\n\njulia> deleteat!(seq, 2:3)\n3nt DNA Sequence:\nAAT\n","category":"page"},{"location":"transforms/#Additional-transformations","page":"Indexing & modifying sequences","title":"Additional transformations","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"reverse!(::BioSequences.LongSequence)\nreverse(::BioSequences.LongSequence{<:NucleicAcidAlphabet})\ncomplement!\ncomplement\nreverse_complement!\nreverse_complement\nungap!\nungap\ncanonical!\ncanonical","category":"page"},{"location":"transforms/#Base.reverse!-Tuple{LongSequence}","page":"Indexing & modifying sequences","title":"Base.reverse!","text":"reverse!(seq::LongSequence)\n\nReverse a biological sequence seq in place.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.reverse-Tuple{LongSequence{<:NucleicAcidAlphabet}}","page":"Indexing & modifying sequences","title":"Base.reverse","text":"reverse(seq::BioSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\nreverse(seq::LongSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#BioSequences.complement!","page":"Indexing & modifying sequences","title":"BioSequences.complement!","text":"complement!(seq)\n\nMake a complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSymbols.complement","page":"Indexing & modifying sequences","title":"BioSymbols.complement","text":"complement(nt::NucleicAcid)\n\nReturn the complementary nucleotide of nt.\n\nThis function returns the union of all possible complementary nucleotides.\n\nExamples\n\njulia> complement(DNA_A)\nDNA_T\n\njulia> complement(DNA_N)\nDNA_N\n\njulia> complement(RNA_U)\nRNA_A\n\n\n\n\n\n\ncomplement(seq)\n\nMake a complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement!","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement!","text":"reverse_complement!(seq)\n\nMake a reversed complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement","text":"reverse_complement(seq)\n\nMake a reversed complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap!","page":"Indexing & modifying sequences","title":"BioSequences.ungap!","text":"Remove gap characters from an input sequence.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap","page":"Indexing & modifying sequences","title":"BioSequences.ungap","text":"Create a copy of a sequence with gap characters removed.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical!","page":"Indexing & modifying sequences","title":"BioSequences.canonical!","text":"canonical!(seq::NucleotideSeq)\n\nTransforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\nUsing this function on a seq will ensure it is the canonical version.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical","page":"Indexing & modifying sequences","title":"BioSequences.canonical","text":"canonical(seq::NucleotideSeq)\n\nCreate the canonical sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTAT\"\n6nt DNA Sequence:\nACGTAT\n\njulia> reverse!(seq)\n6nt DNA Sequence:\nTATGCA\n\njulia> complement!(seq)\n6nt DNA Sequence:\nATACGT\n\njulia> reverse_complement!(seq)\n6nt DNA Sequence:\nACGTAT\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!. ","category":"page"},{"location":"transforms/#Translation","page":"Indexing & modifying sequences","title":"Translation","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"translate\nncbi_trans_table","category":"page"},{"location":"transforms/#BioSequences.translate","page":"Indexing & modifying sequences","title":"BioSequences.translate","text":"translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)\n\nTranslate an LongRNA or a LongDNA to an LongAA.\n\nTranslation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ncbi_trans_table","page":"Indexing & modifying sequences","title":"BioSequences.ncbi_trans_table","text":"Genetic code list of NCBI.\n\nThe standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.\n\n\n\n\n\n","category":"constant"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> ncbi_trans_table\nTranslation Tables:\n 1. The Standard Code (standard_genetic_code)\n 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)\n 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)\n 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)\n 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)\n 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)\n 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)\n 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)\n 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)\n 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)\n 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)\n 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)\n 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)\n 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)\n 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)\n 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)\n 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)\n 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/#Construction-and-conversion","page":"Constructing sequences","title":"Construction & conversion","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Here we will showcase the various ways you can construct the various sequence types in BioSequences.","category":"page"},{"location":"construction/#Constructing-sequences","page":"Constructing sequences","title":"Constructing sequences","text":"","category":"section"},{"location":"construction/#From-strings","page":"Constructing sequences","title":"From strings","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed from strings using their constructors:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongSequence{RNAAlphabet{2}}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Type alias' can also be used for brevity.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongDNA{2}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongRNA{2}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC","category":"page"},{"location":"construction/#Constructing-sequences-from-arrays-of-BioSymbols","page":"Constructing sequences","title":"Constructing sequences from arrays of BioSymbols","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed using vectors or arrays of a BioSymbol type:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}([DNA_T, DNA_T, DNA_A, DNA_G, DNA_C])\n5nt DNA Sequence:\nTTAGC\n","category":"page"},{"location":"construction/#Constructing-sequences-from-other-sequences","page":"Constructing sequences","title":"Constructing sequences from other sequences","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can create sequences, by concatenating other sequences together:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{2}(\"ACGT\") * LongDNA{2}(\"TGCA\")\n8nt DNA Sequence:\nACGTTGCA\n\njulia> repeat(LongDNA{4}(\"TA\"), 10)\n20nt DNA Sequence:\nTATATATATATATATATATA\n\njulia> LongDNA{4}(\"TA\") ^ 10\n20nt DNA Sequence:\nTATATATATATATATATATA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequence views (LongSubSeqs) are special, in that they do not own their own data, and must be constructed from a LongSequence or another LongSubSeq:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = LongDNA{4}(\"TACGGACATTA\")\n11nt DNA Sequence:\nTACGGACATTA\n\njulia> seqview = LongSubSeq(seq, 3:7)\n5nt DNA Sequence:\nCGGAC\n\njulia> seqview2 = @view seq[1:3]\n3nt DNA Sequence:\nTAC\n\njulia> typeof(seqview) == typeof(seqview2) && typeof(seqview) <: LongSubSeq\ntrue\n","category":"page"},{"location":"construction/#Conversion-of-sequence-types","page":"Constructing sequences","title":"Conversion of sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can convert between sequence types, if the sequences are compatible - that is, if the source sequence does not contain symbols that are un-encodable by the destination type.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna = dna\"TTACGTAGACCG\"\n12nt DNA Sequence:\nTTACGTAGACCG\n\njulia> dna2 = convert(LongDNA{2}, dna)\n12nt DNA Sequence:\nTTACGTAGACCG","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DNA/RNA are special in that they can be converted to each other, despite containing distinct symbols. When doing so, DNA_T is converted to RNA_U and vice versa.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> convert(LongRNA{2}, dna\"TAGCTAGG\")\n8nt RNA Sequence:\nUAGCUAGG","category":"page"},{"location":"construction/#String-literals","page":"Constructing sequences","title":"String literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"BioSequences provides several string literal macros for creating sequences.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"note: Note\nWhen you use literals you may mix the case of characters.","category":"page"},{"location":"construction/#Long-sequence-literals","page":"Constructing sequences","title":"Long sequence literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna\"TACGTANNATC\"\n11nt DNA Sequence:\nTACGTANNATC\n\njulia> rna\"AUUUGNCCANU\"\n11nt RNA Sequence:\nAUUUGNCCANU\n\njulia> aa\"ARNDCQEGHILKMFPSTWYVX\"\n21aa Amino Acid Sequence:\nARNDCQEGHILKMFPSTWYVX","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, it should be noted that by default these sequence literals allocate the LongSequence object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You might expect that every time you call foo, that a DNA sequence CTTA would be returned. You might expect that this is because every time foo is called, a new DNA sequence variable CTT is created, and the A nucleotide is pushed to it, and the result, CTTA is returned. In other words you might expect the following output:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, this is not what happens, instead the following happens:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"s\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n5nt DNA Sequence:\nCTTAA\n\njulia> foo()\n6nt DNA Sequence:\nCTTAAA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The reason for this is because the sequence literal is allocated only once before the first time the function foo is called and run. Therefore, s in foo is always a reference to that one sequence that was allocated. So one sequence is created before foo is called, and then it is pushed to every time foo is called. Thus, that one allocated sequence grows with every call of foo.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"If you wanted foo to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated whilst the code is running, and not before. So to change foo so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"d # 'd' flag appended to the string literal.\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Now every time foo is called, a new sequence CTT is created, and an A nucleotide is pushed to it:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"So the take home message of sequence literals is this:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"@dna_str\n@rna_str\n@aa_str","category":"page"},{"location":"construction/#BioSequences.@dna_str","page":"Constructing sequences","title":"BioSequences.@dna_str","text":"@dna_str(seq, flag=\"s\") -> LongDNA{4}\n\nCreate a LongDNA{4} sequence at parse time from string seq. If flag is \"s\" ('static', the default), the sequence is created at parse time, and inserted directly into the returned expression. A static string ought not to be mutated Alternatively, if flag is \"d\" (dynamic), a new sequence is parsed and created whenever the code where is macro is placed is run.\n\nSee also: @aa_str, @rna_str\n\nExamples\n\nIn the example below, the static sequence is created once, at parse time, NOT when the function f is run. This means it is the same sequence that is pushed to repeatedly.\n\njulia> f() = dna\"TAG\";\n\njulia> string(push!(f(), DNA_A)) # NB: Mutates static string!\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGAA\"\n\njulia> f() = dna\"TAG\"d; # dynamically make seq\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@rna_str","page":"Constructing sequences","title":"BioSequences.@rna_str","text":"The LongRNA{4} equivalent to @dna_str\n\nSee also: @dna_str, @aa_str\n\nExamples\n\njulia> rna\"UCGUGAUGC\"\n9nt RNA Sequence:\nUCGUGAUGC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@aa_str","page":"Constructing sequences","title":"BioSequences.@aa_str","text":"The AminoAcidAlphabet equivalent to @dna_str\n\nSee also: @dna_str, @rna_str\n\nExamples\n\njulia> aa\"PKLEQC\"\n6aa Amino Acid Sequence:\nPKLEQC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#Comparison-to-other-sequence-types","page":"Constructing sequences","title":"Comparison to other sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = dna\"GAGCTGA\"; vec = collect(seq);\n\njulia> seq == vec, isequal(seq, vec)\n(false, false)\n\njulia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))\ntrue ","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"sequence_search/#Searching-for-sequence-motifs","page":"Pattern matching and searching","title":"Searching for sequence motifs","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are many ways to search for particular motifs in biological sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Exact searches, where you are looking for exact matches of a particular character of substring.\nApproximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.\nSearches where you are looking for sequences that conform to some sort of pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.","category":"page"},{"location":"sequence_search/#Symbol-search","page":"Pattern matching and searching","title":"Symbol search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> seq = dna\"ACAGCGTAGCT\";\n\njulia> findfirst(DNA_A, seq)\n1\n\njulia> findlast(DNA_A, seq)\n8\n\njulia> findnext(DNA_A, seq, 2)\n3\n\njulia> findprev(DNA_A, seq, 7)\n3\n\njulia> findall(DNA_A, seq)\n3-element Vector{Int64}:\n 1\n 3\n 8","category":"page"},{"location":"sequence_search/#Exact-search","page":"Pattern matching and searching","title":"Exact search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ExactSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ExactSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ExactSearchQuery","text":"ExactSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for exact sequence search.\n\nAn exact search, is one where are you are looking in some given sequence, for exact instances of some given substring.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ExactSearchQuery(dna\"AGC\");\n\njulia> findfirst(query, seq)\n3:5\n\njulia> findlast(query, seq)\n8:10\n\njulia> findnext(query, seq, 6)\n8:10\n\njulia> findprev(query, seq, 7)\n3:5\n\njulia> findall(query, seq)\n2-element Vector{UnitRange{Int64}}:\n 3:5\n 8:10\n\njulia> occursin(query, seq)\ntrue\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ExactSearchQuery(dna\"CGT\", iscompatible);\n\njulia> findfirst(query, dna\"ACNT\") # 'N' matches 'G'\n2:4\n\njulia> findfirst(query, dna\"ACGT\") # 'G' matches 'N'\n2:4\n\njulia> occursin(ExactSearchQuery(dna\"CNT\", iscompatible), dna\"ACNT\")\ntrue\n\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Allowing-mismatches","page":"Pattern matching and searching","title":"Allowing mismatches","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ApproximateSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ApproximateSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ApproximateSearchQuery","text":"ApproximateSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for approximate sequence search.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nUsing these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.\n\nIn other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\");\n\njulia> findfirst(query, 0, seq) == nothing # nothing matches with no errors\ntrue\n\njulia> findfirst(query, 1, seq) # seq[3:6] matches with one error\n3:6\n\njulia> findfirst(query, 2, seq) # seq[1:4] matches with two errors\n1:4\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\", iscompatible);\n\njulia> occursin(query, 1, dna\"AAGNGG\") # 1 mismatch permitted (A vs G) & matched N\ntrue\n\njulia> findnext(query, 1, dna\"AAGNGG\", 1) # 1 mismatch permitted (A vs G) & matched N\n1:4\n\n\nnote: Note\nThis method of searching for motifs was implemented with smaller query motifs in mind.If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Searching-according-to-a-pattern","page":"Pattern matching and searching","title":"Searching according to a pattern","text":"","category":"section"},{"location":"sequence_search/#Regular-expression-search","page":"Pattern matching and searching","title":"Regular expression search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}(\"MV+\"). For bioregex literals, it is instead recommended using the @biore_str macro:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: \"dna\", \"rna\" or \"aa\". For example, biore\"A+\"dna is a regular expression for DNA sequences and biore\"A+\"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: \"d\", \"r\" or \"a\", respectively.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Here are examples of using the regular expression for BioSequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(biore\"A+C*\"dna, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> match(biore\"A+C*\"d, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> occursin(biore\"A+C*\"dna, dna\"AAC\")\ntrue\n\njulia> occursin(biore\"A+C*\"dna, dna\"C\")\nfalse\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"match will return a RegexMatch if a match is found, otherwise it will return nothing if no match is found.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The table below summarizes available syntax elements.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Syntax Description Example\n| alternation \"A|T\" matches \"A\" and \"T\"\n* zero or more times repeat \"TA*\" matches \"T\", \"TA\" and \"TAA\"\n+ one or more times repeat \"TA+\" matches \"TA\" and \"TAA\"\n? zero or one time \"TA?\" matches \"T\" and \"TA\"\n{n,} n or more times repeat \"A{3,}\" matches \"AAA\" and \"AAAA\"\n{n,m} n-m times repeat \"A{3,5}\" matches \"AAA\", \"AAAA\" and \"AAAAA\"\n^ the start of the sequence \"^TAN*\" matches \"TATGT\"\n$ the end of the sequence \"N*TA$\" matches \"GCTA\"\n(...) pattern grouping \"(TA)+\" matches \"TA\" and \"TATA\"\n[...] one of symbols \"[ACG]+\" matches \"AGGC\"","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"eachmatch and findfirst are also defined, just like usual regex and strings found in Base.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> collect(matched(x) for x in eachmatch(biore\"TATA*?\"d, dna\"TATTATAATTA\")) # overlap\n4-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TAT\n TATA\n TATAA\n\njulia> collect(matched(x) for x in eachmatch(biore\"TATA*\"d, dna\"TATTATAATTA\", false)) # no overlap\n2-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TATAA\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\")\n1:3\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\", 2)\n4:8\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Noteworthy differences from strings are:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Ambiguous characters match any compatible characters (e.g. biore\"N\"d is equivalent to biore\"[ACGT]\"d).\nWhitespaces are ignored (e.g. biore\"A C G\"d is equivalent to biore\"ACG\"d).","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite prefix and no symbol option is needed because it always describes patterns of amino acid sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(prosite\"[AC]-x-V-x(4)-{ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n\njulia> match(prosite\"[AC]xVx(4){ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n","category":"page"},{"location":"sequence_search/#Position-weight-matrix-search","page":"Pattern matching and searching","title":"Position weight matrix search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"A motif can be specified using position weight matrix (PWM) in a probabilistic way. This method searches for the first position in the sequence where a score calculated using a PWM is greater than or equal to a threshold. More formally, denoting the sequence as S and the PWM value of symbol s at position j as M_sj, the score starting from a position p is defined as","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"operatornamescore(S p) = sum_i=1^L M_Sp+i-1i","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"and the search returns the smallest p that satisfies operatornamescore(S p) ge t.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are two kinds of matrices in this package: PFM and PWM. The PFM type is a position frequency matrix and stores symbol frequencies for each position. The PWM is a position weight matrix and stores symbol scores for each position. You can create a PFM from a set of sequences with the same length and then create a PWM from the PFM object.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> pfm = PFM(motifs) # sequence set => PFM\n4×3 PFM{DNA, Int64}:\n A 1 0 5\n C 1 2 0\n G 1 0 0\n T 2 3 0\n\njulia> pwm = PWM(pfm) # PFM => PWM\n4×3 PWM{DNA, Float64}:\n A -0.321928 -Inf 2.0\n C -0.321928 0.678072 -Inf\n G -0.321928 -Inf -Inf\n T 0.678072 1.26303 -Inf\n\njulia> pwm = PWM(pfm .+ 0.01) # add pseudo counts to avoid infinite values\n4×3 PWM{DNA, Float64}:\n A -0.319068 -6.97728 1.99139\n C -0.319068 0.673772 -6.97728\n G -0.319068 -6.97728 -6.97728\n T 0.673772 1.25634 -6.97728\n\njulia> pwm = PWM(pfm .+ 0.01, prior=[0.2, 0.3, 0.3, 0.2]) # GC-rich prior\n4×3 PWM{DNA, Float64}:\n A 0.00285965 -6.65535 2.31331\n C -0.582103 0.410737 -7.24031\n G -0.582103 -7.24031 -7.24031\n T 0.9957 1.57827 -6.65535\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PWM_sj matrix is computed from PFM_sj and the prior probability p(s) as follows ([Wasserman2004]):","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"beginalign\n PWM_sj = log_2 fracp(sj)p(s) \n p(sj) = fracPFM_sjsum_s PFM_sj\nendalign","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"However, if you just want to quickly conduct a search, constructing the PFM and PWM is done for you as a convenience if you build a PWMSearchQuery, using a collection of sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> subject = dna\"TATTATAATTA\";\n\njulia> qa = PWMSearchQuery(motifs, 1.0);\n\njulia> findfirst(qa, subject)\n3\n\njulia> findall(qa, subject)\n3-element Vector{Int64}:\n 3\n 5\n 9","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"[Wasserman2004]: https://doi.org/10.1038/nrg1315","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"predicates/#Predicates","page":"Predicates","title":"Predicates","text":"","category":"section"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"isrepetitive\nispalindromic\nhasambiguity\niscanonical","category":"page"},{"location":"predicates/#BioSequences.isrepetitive","page":"Predicates","title":"BioSequences.isrepetitive","text":"isrepetitive(seq::BioSequence, n::Integer = length(seq))\n\nReturn true if and only if seq contains a repetitive subsequence of length ≥ n.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.ispalindromic","page":"Predicates","title":"BioSequences.ispalindromic","text":"ispalindromic(seq::BioSequence)\n\nReturn true if seq is a palindromic sequence; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.hasambiguity","page":"Predicates","title":"BioSequences.hasambiguity","text":"hasambiguity(seq::BioSequence)\n\nReturns true if seq has an ambiguous symbol; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.iscanonical","page":"Predicates","title":"BioSequences.iscanonical","text":"iscanonical(seq::NucleotideSeq)\n\nReturns true if seq is canonical.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\n\n\n\n\n","category":"function"},{"location":"#BioSequences","page":"Home","title":"BioSequences","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Pkg Status) (Image: Chat)","category":"page"},{"location":"#Description","page":"Home","title":"Description","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add BioSequences","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.","category":"page"},{"location":"#Testing","page":"Home","title":"Testing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Unit tests) (Image: Documentation) (Image: )","category":"page"},{"location":"#Contributing","page":"Home","title":"Contributing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.","category":"page"},{"location":"#Financial-contributions","page":"Home","title":"Financial contributions","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development of the community, it will be \"merged\" in the ledger of our open collective by the core contributors and the person who filed the expense will be reimbursed.","category":"page"},{"location":"#Backers-and-Sponsors","page":"Home","title":"Backers & Sponsors","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Thank you to all our backers and sponsors!","category":"page"},{"location":"","page":"Home","title":"Home","text":"Love our work and community? Become a backer.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: backers)","category":"page"},{"location":"","page":"Home","title":"Home","text":"Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project Your logo will show up here with a link to your website.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: ) (Image: ) (Image: ) (Image: ) (Image: ) (Image: ) (Image: ) (Image: ) (Image: ) (Image: )","category":"page"},{"location":"#Questions?","page":"Home","title":"Questions?","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you have a question about contributing or using BioJulia software, come on over and chat to us on Gitter, or you can try the Bio category of the Julia discourse site.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"types/#Abstract-Types","page":"BioSequences Types","title":"Abstract Types","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.","category":"page"},{"location":"types/#The-abstract-BioSequence","page":"BioSequences Types","title":"The abstract BioSequence","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequence","category":"page"},{"location":"types/#BioSequences.BioSequence","page":"BioSequences Types","title":"BioSequences.BioSequence","text":"BioSequence{A <: Alphabet}\n\nBioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.\n\nExtended help\n\nIts subtypes are characterized by:\n\nBeing a linear container type with random access and indices Base.OneTo(length(x)).\nContaining zero or more internal data elements of type encoded_data_eltype(typeof(x)).\nBeing associated with an Alphabet, A by being a subtype of BioSequence{A}.\n\nA BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.\n\nSubtypes T of BioSequence must implement the following, with E begin an encoded data type:\n\nBase.length(::T)::Int\nencoded_data_eltype(::Type{T})::Type{E}\nextract_encoded_element(::T, ::Integer)::E\ncopy(::T)\nT must be able to be constructed from any iterable with length defined and with a known, compatible element type.\n\nFurthermore, mutable sequences should implement\n\nencoded_setindex!(::T, ::E, ::Integer)\nT(undef, ::Int)\nresize!(::T, ::Int)\n\nFor compatibility with existing Alphabets, the encoded data eltype must be UInt.\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Some aliases for BioSequence are also provided for your convenience:","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"NucSeq\nAASeq","category":"page"},{"location":"types/#BioSequences.NucSeq","page":"BioSequences Types","title":"BioSequences.NucSeq","text":"An alias for BioSequence{<:NucleicAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AASeq","page":"BioSequences Types","title":"BioSequences.AASeq","text":"An alias for BioSequence{AminoAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"encoded_data_eltype\nextract_encoded_element\nencoded_setindex!","category":"page"},{"location":"types/#BioSequences.encoded_data_eltype","page":"BioSequences Types","title":"BioSequences.encoded_data_eltype","text":"encoded_data_eltype(::Type{<:BioSequence})\n\nReturns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.extract_encoded_element","page":"BioSequences Types","title":"BioSequences.extract_encoded_element","text":"extract_encoded_element(::BioSequence{A}, i::Integer)\n\nReturns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.encoded_setindex!","page":"BioSequences Types","title":"BioSequences.encoded_setindex!","text":"encoded_setindex!(seq::BioSequence, x::E, i::Integer)\n\nGiven encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.","category":"page"},{"location":"types/#The-abstract-Alphabet","page":"BioSequences Types","title":"The abstract Alphabet","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences.Alphabet\nBioSequences.AsciiAlphabet","category":"page"},{"location":"types/#BioSequences.Alphabet","page":"BioSequences Types","title":"BioSequences.Alphabet","text":"Alphabet\n\nAlphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.\n\nExtended help\n\nSubtypes of Alphabet are singleton structs that may or may not be parameterized.\nAlphabets span over a finite set of biological symbols.\nThe alphabet controls the encoding from some internal \"encoded data\" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.\nAn Alphabet's encode method must not produce invalid data. \n\nEvery subtype A of Alphabet must implement:\n\nBase.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.\nsymbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.\nencode(::A, ::S)::E encodes a symbol to an internal data eltype E.\ndecode(::A, ::E)::S decodes an internal data eltype E to a symbol S.\nExcept for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.\n\nIf you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:\n\nBitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].\n\nFor increased performance, see BioSequences.AsciiAlphabet\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AsciiAlphabet","page":"BioSequences Types","title":"BioSequences.AsciiAlphabet","text":"AsciiAlphabet\n\nTrait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).\n\n\n\n\n\n","category":"type"},{"location":"types/#Concrete-types","page":"BioSequences Types","title":"Concrete types","text":"","category":"section"},{"location":"types/#Implemented-alphabets","page":"BioSequences Types","title":"Implemented alphabets","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"DNAAlphabet\nRNAAlphabet\nAminoAcidAlphabet","category":"page"},{"location":"types/#BioSequences.DNAAlphabet","page":"BioSequences Types","title":"BioSequences.DNAAlphabet","text":"DNA nucleotide alphabet.\n\nDNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.RNAAlphabet","page":"BioSequences Types","title":"BioSequences.RNAAlphabet","text":"RNA nucleotide alphabet.\n\nRNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AminoAcidAlphabet","page":"BioSequences Types","title":"BioSequences.AminoAcidAlphabet","text":"Amino acid alphabet.\n\n\n\n\n\n","category":"type"},{"location":"types/#Long-Sequences","page":"BioSequences Types","title":"Long Sequences","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"LongSequence","category":"page"},{"location":"types/#BioSequences.LongSequence","page":"BioSequences Types","title":"BioSequences.LongSequence","text":"LongSequence{A <: Alphabet}\n\nGeneral-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.\n\nExtended help\n\nLongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.\n\nAs the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.\n\nFor example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.\n\nSymbols from multiple alphabets can't be intermixed in one sequence type.\n\nThe following table summarizes common LongSequence types that have been given aliases for convenience.\n\nType Symbol type Type alias\nLongSequence{DNAAlphabet{N}} DNA LongDNA{N}\nLongSequence{RNAAlphabet{N}} RNA LongRNA{N}\nLongSequence{AminoAcidAlphabet} AminoAcid LongAA\n\nThe LongDNA and LongRNA aliases use a DNAAlphabet{4}.\n\nDNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).\n\nIf you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.\n\nDNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).\n\nChanging this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.\n\nThe same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.\n\n\n\n\n\n","category":"type"},{"location":"types/#Sequence-views","page":"BioSequences Types","title":"Sequence views","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.","category":"page"}] +[{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"symbols/#Biological-symbols","page":"Biological Symbols","title":"Biological symbols","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Type Meaning\nDNA DNA nucleotide\nRNA RNA nucleotide\nAminoAcid Amino acid","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"These symbols are elements of biological sequence types, just as characters are elements of strings.","category":"page"},{"location":"symbols/#DNA-and-RNA-nucleotides","page":"Biological Symbols","title":"DNA and RNA nucleotides","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' DNA_A / RNA_A A; Adenine\n'C' DNA_C / RNA_C C; Cytosine\n'G' DNA_G / RNA_G G; Guanine\n'T' DNA_T T; Thymine (DNA only)\n'U' RNA_U U; Uracil (RNA only)\n'M' DNA_M / RNA_M A or C\n'R' DNA_R / RNA_R A or G\n'W' DNA_W / RNA_W A or T/U\n'S' DNA_S / RNA_S C or G\n'Y' DNA_Y / RNA_Y C or T/U\n'K' DNA_K / RNA_K G or T/U\n'V' DNA_V / RNA_V A or C or G; not T/U\n'H' DNA_H / RNA_H A or C or T; not G\n'D' DNA_D / RNA_D A or G or T/U; not C\n'B' DNA_B / RNA_B C or G or T/U; not A\n'N' DNA_N / RNA_N A or C or G or T/U\n'-' DNA_Gap / RNA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with DNA_ or RNA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> DNA_A\nDNA_A\n\njulia> DNA_T\nDNA_T\n\njulia> RNA_U\nRNA_U\n\njulia> DNA_Gap\nDNA_Gap\n\njulia> typeof(DNA_A)\nDNA\n\njulia> typeof(RNA_A)\nRNA\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(DNA, 'C')\nDNA_C\n\njulia> convert(DNA, 'C') === DNA_C\ntrue\n","category":"page"},{"location":"symbols/#Amino-acids","page":"Biological Symbols","title":"Amino acids","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbol Constant Meaning\n'A' AA_A Alanine\n'R' AA_R Arginine\n'N' AA_N Asparagine\n'D' AA_D Aspartic acid (Aspartate)\n'C' AA_C Cysteine\n'Q' AA_Q Glutamine\n'E' AA_E Glutamic acid (Glutamate)\n'G' AA_G Glycine\n'H' AA_H Histidine\n'I' AA_I Isoleucine\n'L' AA_L Leucine\n'K' AA_K Lysine\n'M' AA_M Methionine\n'F' AA_F Phenylalanine\n'P' AA_P Proline\n'S' AA_S Serine\n'T' AA_T Threonine\n'W' AA_W Tryptophan\n'Y' AA_Y Tyrosine\n'V' AA_V Valine\n'O' AA_O Pyrrolysine\n'U' AA_U Selenocysteine\n'B' AA_B Aspartic acid or Asparagine\n'J' AA_J Leucine or Isoleucine\n'Z' AA_Z Glutamine or Glutamic acid\n'X' AA_X Any amino acid\n'*' AA_Term Termination codon\n'-' AA_Gap Gap (none of the above)","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"https://www.bioinformatics.org/sms/iupac.html","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols are accessible as constants with AA_ prefix:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> AA_A\nAA_A\n\njulia> AA_Q\nAA_Q\n\njulia> AA_Term\nAA_Term\n\njulia> typeof(AA_A)\nAminoAcid\n","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"Symbols can be constructed by converting regular characters:","category":"page"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"julia> convert(AminoAcid, 'A')\nAA_A\n\njulia> convert(AminoAcid, 'P') === AA_P\ntrue\n","category":"page"},{"location":"symbols/#Other-functions","page":"Biological Symbols","title":"Other functions","text":"","category":"section"},{"location":"symbols/","page":"Biological Symbols","title":"Biological Symbols","text":"alphabet\ngap\niscompatible\nisambiguous","category":"page"},{"location":"symbols/#BioSymbols.alphabet","page":"Biological Symbols","title":"BioSymbols.alphabet","text":"alphabet(DNA)\n\nGet all symbols of DNA in sorted order.\n\nExamples\n\njulia> alphabet(DNA)\n(DNA_Gap, DNA_A, DNA_C, DNA_M, DNA_G, DNA_R, DNA_S, DNA_V, DNA_T, DNA_W, DNA_Y, DNA_H, DNA_K, DNA_D, DNA_B, DNA_N)\n\njulia> issorted(alphabet(DNA))\ntrue\n\n\n\n\n\n\nalphabet(RNA)\n\nGet all symbols of RNA in sorted order.\n\nExamples\n\njulia> alphabet(RNA)\n(RNA_Gap, RNA_A, RNA_C, RNA_M, RNA_G, RNA_R, RNA_S, RNA_V, RNA_U, RNA_W, RNA_Y, RNA_H, RNA_K, RNA_D, RNA_B, RNA_N)\n\njulia> issorted(alphabet(RNA))\ntrue\n\n\n\n\n\n\nalphabet(AminoAcid)\n\nGet all symbols of AminoAcid in sorted order.\n\nExamples\n\njulia> alphabet(AminoAcid)\n(AA_A, AA_R, AA_N, AA_D, AA_C, AA_Q, AA_E, AA_G, AA_H, AA_I, AA_L, AA_K, AA_M, AA_F, AA_P, AA_S, AA_T, AA_W, AA_Y, AA_V, AA_O, AA_U, AA_B, AA_J, AA_Z, AA_X, AA_Term, AA_Gap)\n\njulia> issorted(alphabet(AminoAcid))\ntrue\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.gap","page":"Biological Symbols","title":"BioSymbols.gap","text":"gap(::Type{T})::T\n\nReturn the gap (indel) representation of T. By default, gap is defined for DNA, RNA, AminoAcid and Char.\n\nExamples\n\njulia> gap(RNA)\nRNA_Gap\n\njulia> gap(Char)\n'-': ASCII/Unicode U+002D (category Pd: Punctuation, dash)\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.iscompatible","page":"Biological Symbols","title":"BioSymbols.iscompatible","text":"iscompatible(x::S, y::S) where S <: BioSymbol\n\nTest if x and y are compatible with each other.\n\nExamples\n\njulia> iscompatible(AA_A, AA_R)\nfalse\n\njulia> iscompatible(AA_A, AA_X)\ntrue\n\njulia> iscompatible(DNA_A, DNA_A)\ntrue\n\njulia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C\ntrue\n\njulia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C\nfalse\n\n\n\n\n\n\n","category":"function"},{"location":"symbols/#BioSymbols.isambiguous","page":"Biological Symbols","title":"BioSymbols.isambiguous","text":"isambiguous(nt::NucleicAcid)\n\nTest if nt is an ambiguous nucleotide.\n\n\n\n\n\nisambiguous(aa::AminoAcid)\n\nTest if aa is an ambiguous amino acid.\n\n\n\n\n\n","category":"function"},{"location":"io/#I/O-for-sequencing-file-formats","page":"I/O","title":"I/O for sequencing file formats","text":"","category":"section"},{"location":"io/","page":"I/O","title":"I/O","text":"Versions of BioSequences prior to v2.0 provided a FASTA, FASTQ, and 2Bit submodule for working with formatted sequence files.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"After version v2.0, in order to neatly separate concerns, these submodules were removed.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Instead there will now be dedicated BioJulia packages for each format. Each of these will be compatible with BioSequences.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"A list of all of the different formats and packages is provided below to help you find them quickly.","category":"page"},{"location":"io/","page":"I/O","title":"I/O","text":"Format Package\nFASTA FASTX.jl\nFASTQ FASTX.jl\n2Bit TwoBit.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"counting/#Counting","page":"Counting","title":"Counting","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"BioSequences extends the Base.count method to provide some useful utilities for counting the number of sites in biological sequences.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Most generically you can count the number of sites that satisfy some condition i.e. cause some function to return true:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"julia> count(isambiguous, dna\"ATCGM\")\n1\n","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"You can also use two sequences, for example to compute the number of matching or mismatching symbols:","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"julia> count(!=, dna\"ATCGM\", dna\"GCCGM\")\n2\n\njulia> count(==, dna\"ATCGM\", dna\"GCCGM\")\n3\n","category":"page"},{"location":"counting/#Alias-functions","page":"Counting","title":"Alias functions","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"A number of functions which are aliases for various invocations of Base.count are provided.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"Alias function Base.count call(s)\nn_ambiguous count(isambiguous, seq), count(isambiguous, seqa, seqb)\nn_certain count(iscertain, seq), count(iscertain, seqa, seqb)\nn_gap count(isgap, seq), count(isgap, seqa, seqb)\nmatches count(==, seqa, seqb)\nmismatches count(!=, seqa, seqb)","category":"page"},{"location":"counting/#Bit-parallel-optimisations","page":"Counting","title":"Bit-parallel optimisations","text":"","category":"section"},{"location":"counting/","page":"Counting","title":"Counting","text":"For the vast majority of Base.count(f, seq) and Base.count(f, seqa, seqb) methods, a naive counting is done: the internal count_naive function is called, which simply loops over each position, applies f, and accumulates the result.","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"However, for some functions, it is possible to implement highly efficient methods that use bit-parallelism to check many elements at one time. This is made possible by the succinct encoding of BioSequences. Usually f is one of the functions provided by BioSymbols.jl or by BioSequences.jl","category":"page"},{"location":"counting/","page":"Counting","title":"Counting","text":"For such sequence and function combinations, Base.count(f, seq) is overloaded to call an internal BioSequences.count_*_bitpar function, which is passed the sequence(s). If you want to force BioSequences to use naive counting for the purposes of testing or debugging for example, then you can call BioSequences.count_naive directly.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"interfaces/#Custom-BioSequences-types","page":"Implementing custom types","title":"Custom BioSequences types","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence types provided in this package are not optimal for your purposes.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.","category":"page"},{"location":"interfaces/#Implementing-custom-Alphabets","page":"Implementing custom types","title":"Implementing custom Alphabets","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the Alphabet interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct ReducedAAAlphabet <: Alphabet end\n\njulia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid\n\njulia> BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()\n\njulia> function BioSequences.symbols(::ReducedAAAlphabet)\n (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,\n AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)\n end\n\njulia> const (ENC_LUT, DEC_LUT) = let\n enc_lut = fill(0xff, length(alphabet(AminoAcid)))\n dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))\n for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))\n enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1\n dec_lut[i] = aa\n end\n (Tuple(enc_lut), Tuple(dec_lut))\n end\n((0x02, 0xff, 0x0b, 0x0a, 0x01, 0x0c, 0x09, 0x03, 0x0e, 0xff, 0x00, 0x0d, 0x0f, 0x07, 0x06, 0x04, 0x05, 0x08, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff), (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F, AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M))\n\njulia> function BioSequences.encode(::ReducedAAAlphabet, aa::AminoAcid)\n i = reinterpret(UInt8, aa) + 0x01\n (i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))\n (@inbounds ENC_LUT[i]) % UInt\n end\n\njulia> function BioSequences.decode(::ReducedAAAlphabet, x::UInt)\n x ≥ length(DEC_LUT) && throw(DomainError(aa))\n @inbounds DEC_LUT[x + UInt(1)]\n end\n\njulia> BioSequences.has_interface(Alphabet, ReducedAAAlphabet())\ntrue\n","category":"page"},{"location":"interfaces/#Implementing-custom-BioSequences","page":"Implementing custom types","title":"Implementing custom BioSequences","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Recall the required methods that define the BioSequence interface. ","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the BioSequence documentation.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"Let's do that for a custom sequence type that is optimised to represent a small sequence: A Codon. We can test that it conforms to the interface with the BioSequences.has_interface function.","category":"page"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"julia> struct Codon <: BioSequence{RNAAlphabet{2}}\n x::UInt8\n end\n\njulia> function Codon(iterable)\n length(iterable) == 3 || error(\"Must have length 3\")\n x = zero(UInt)\n for (i, nt) in enumerate(iterable)\n x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)\n end\n Codon(x % UInt8)\n end\nCodon\n\njulia> Base.length(::Codon) = 3\n\njulia> BioSequences.encoded_data_eltype(::Type{Codon}) = UInt\n\njulia> function BioSequences.extract_encoded_element(x::Codon, i::Int)\n ((x.x >>> (6-2i)) & 3) % UInt\n end\n\njulia> Base.copy(seq::Codon) = Codon(seq.x)\n\njulia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)\ntrue","category":"page"},{"location":"interfaces/#Interface-checking-functions","page":"Implementing custom types","title":"Interface checking functions","text":"","category":"section"},{"location":"interfaces/","page":"Implementing custom types","title":"Implementing custom types","text":"BioSequences.has_interface","category":"page"},{"location":"interfaces/#BioSequences.has_interface","page":"Implementing custom types","title":"BioSequences.has_interface","text":"function has_interface(::Type{Alphabet}, A::Alphabet)\n\nReturns whether A conforms to the Alphabet interface.\n\n\n\n\n\nhas_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)\n\nCheck if type T conforms to the BioSequence interface. A T is constructed from the vector of element types syms which must not be empty. If the mutable flag is set, also check the mutable interface. If the compat flag is set, check for compatibility with existing alphabets.\n\n\n\n\n\n","category":"function"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"random/#Generating-random-sequences","page":"Random sequences","title":"Generating random sequences","text":"","category":"section"},{"location":"random/#Long-sequences","page":"Random sequences","title":"Long sequences","text":"","category":"section"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"You can generate random long sequences using the randdna function and the Sampler's implemented in BioSequences:","category":"page"},{"location":"random/","page":"Random sequences","title":"Random sequences","text":"randseq\nranddnaseq\nrandrnaseq\nrandaaseq\nSamplerUniform\nSamplerWeighted","category":"page"},{"location":"random/#BioSequences.randseq","page":"Random sequences","title":"BioSequences.randseq","text":"randseq([rng::AbstractRNG], A::Alphabet, len::Integer)\n\nGenerate a LongSequence{A} of length len from the specified alphabet, drawn from the default distribution. User-defined alphabets should implement this method to implement random LongSequence generation.\n\nFor RNA and DNA alphabets, the default distribution is uniform across A, C, G, and T/U. For AminoAcidAlphabet, it is uniform across the 20 standard amino acids. For a user-defined alphabet A, default is uniform across all elements of symbols(A).\n\nExample:\n\njulia> seq = randseq(AminoAcidAlphabet(), 50)\n50aa Amino Acid Sequence:\nVFMHSIRMIRLMVHRSWKMHSARHVNFIRCQDKKWKSADGIYTDICKYSM\n\n\n\n\n\nrandseq([rng::AbstractRNG], A::Alphabet, sp::Sampler, len::Integer)\n\nGenerate a LongSequence{A} of length len with elements drawn from the given sampler.\n\nExample:\n\n# Generate 1000-length RNA with 4% chance of N, 24% for A, C, G, or U\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.24, 4))\njulia> seq = randseq(RNAAlphabet{4}(), sp, 50)\n50nt RNA Sequence:\nCUNGGGCCCGGGNAAACGUGGUACACCCUGUUAAUAUCAACNNGCGCUNU\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randdnaseq","page":"Random sequences","title":"BioSequences.randdnaseq","text":"randdnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{DNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, T]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randrnaseq","page":"Random sequences","title":"BioSequences.randrnaseq","text":"randrnaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{RNAAlphabet{4}} sequence of length len, with bases sampled uniformly from [A, C, G, U]\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.randaaseq","page":"Random sequences","title":"BioSequences.randaaseq","text":"randaaseq([rng::AbstractRNG], len::Integer)\n\nGenerate a random LongSequence{AminoAcidAlphabet} sequence of length len, with amino acids sampled uniformly from the 20 standard amino acids.\n\n\n\n\n\n","category":"function"},{"location":"random/#BioSequences.SamplerUniform","page":"Random sequences","title":"BioSequences.SamplerUniform","text":"SamplerUniform{T}\n\nUniform sampler of type T. Instantiate with a collection of eltype T containing the elements to sample.\n\nExamples\n\njulia> sp = SamplerUniform(rna\"ACGU\");\n\n\n\n\n\n","category":"type"},{"location":"random/#BioSequences.SamplerWeighted","page":"Random sequences","title":"BioSequences.SamplerWeighted","text":"SamplerWeighted{T}\n\nWeighted sampler of type T. Instantiate with a collection of eltype T containing the elements to sample, and an orderen collection of probabilities to sample each element except the last. The last probability is the remaining probability up to 1.\n\nExamples\n\njulia> sp = SamplerWeighted(rna\"ACGUN\", fill(0.2475, 4));\n\n\n\n\n\n","category":"type"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"transforms/#Indexing-and-modifying-sequences","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"","category":"section"},{"location":"transforms/#Indexing","page":"Indexing & modifying sequences","title":"Indexing","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For example, with LongSequences:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5]\nDNA_T\n\njulia> seq[6:end]\n14nt DNA Sequence:\nTANAGTNNAGTACC\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The biological symbol at a given locus in a biological sequence can be set using setindex:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTTTANAGTNNAGTACC\"\n19nt DNA Sequence:\nACGTTTANAGTNNAGTACC\n\njulia> seq[5] = DNA_A\nDNA_A\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"note: Note\nSome types such can be indexed using integers but not using ranges.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.","category":"page"},{"location":"transforms/#Modifying-sequences","page":"Indexing & modifying sequences","title":"Modifying sequences","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"push!(::BioSequences.BioSequence, ::Any)\npop!(::BioSequences.BioSequence)\npushfirst!(::BioSequences.BioSequence, ::Any)\npopfirst!(::BioSequences.BioSequence)\ninsert!(::BioSequences.BioSequence, ::Integer, ::Any)\ndeleteat!(::BioSequences.BioSequence, ::Integer)\nappend!(::BioSequences.BioSequence, ::BioSequences.BioSequence)\nresize!(::BioSequences.LongSequence, ::Integer)\nempty!(::BioSequences.BioSequence)","category":"page"},{"location":"transforms/#Base.push!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.push!","text":"push!(seq::BioSequence, x)\n\nAppend a biological symbol x to a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pop!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.pop!","text":"pop!(seq::BioSequence)\n\nRemove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.pushfirst!-Tuple{BioSequence, Any}","page":"Indexing & modifying sequences","title":"Base.pushfirst!","text":"pushfirst!(seq, x)\n\nInsert a biological symbol x at the beginning of a biological sequence seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.popfirst!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.popfirst!","text":"popfirst!(seq)\n\nRemove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.insert!-Tuple{BioSequence, Integer, Any}","page":"Indexing & modifying sequences","title":"Base.insert!","text":"insert!(seq::BioSequence, i, x)\n\nInsert a biological symbol x into a biological sequence seq, at the given index i.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.deleteat!-Tuple{BioSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.deleteat!","text":"deleteat!(seq::BioSequence, i::Integer)\n\nDelete a biological symbol at a single position i in a biological sequence seq.\n\nModifies the input sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.append!-Tuple{BioSequence, BioSequence}","page":"Indexing & modifying sequences","title":"Base.append!","text":"append!(seq, other)\n\nAdd a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.resize!-Tuple{LongSequence, Integer}","page":"Indexing & modifying sequences","title":"Base.resize!","text":"resize!(seq, size, [force::Bool])\n\nResize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.empty!-Tuple{BioSequence}","page":"Indexing & modifying sequences","title":"Base.empty!","text":"empty!(seq::BioSequence)\n\nCompletely empty a biological sequence seq of nucleotides.\n\n\n\n\n\n","category":"method"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Here are some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACG\"\n3nt DNA Sequence:\nACG\n\njulia> push!(seq, DNA_T)\n4nt DNA Sequence:\nACGT\n\njulia> append!(seq, dna\"AT\")\n6nt DNA Sequence:\nACGTAT\n\njulia> deleteat!(seq, 2)\n5nt DNA Sequence:\nAGTAT\n\njulia> deleteat!(seq, 2:3)\n3nt DNA Sequence:\nAAT\n","category":"page"},{"location":"transforms/#Additional-transformations","page":"Indexing & modifying sequences","title":"Additional transformations","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"reverse!(::BioSequences.LongSequence)\nreverse(::BioSequences.LongSequence{<:NucleicAcidAlphabet})\ncomplement!\ncomplement\nreverse_complement!\nreverse_complement\nungap!\nungap\ncanonical!\ncanonical","category":"page"},{"location":"transforms/#Base.reverse!-Tuple{LongSequence}","page":"Indexing & modifying sequences","title":"Base.reverse!","text":"reverse!(seq::LongSequence)\n\nReverse a biological sequence seq in place.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#Base.reverse-Tuple{LongSequence{<:NucleicAcidAlphabet}}","page":"Indexing & modifying sequences","title":"Base.reverse","text":"reverse(seq::BioSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\nreverse(seq::LongSequence)\n\nCreate reversed copy of a biological sequence.\n\n\n\n\n\n","category":"method"},{"location":"transforms/#BioSequences.complement!","page":"Indexing & modifying sequences","title":"BioSequences.complement!","text":"complement!(seq)\n\nMake a complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSymbols.complement","page":"Indexing & modifying sequences","title":"BioSymbols.complement","text":"complement(nt::NucleicAcid)\n\nReturn the complementary nucleotide of nt.\n\nThis function returns the union of all possible complementary nucleotides.\n\nExamples\n\njulia> complement(DNA_A)\nDNA_T\n\njulia> complement(DNA_N)\nDNA_N\n\njulia> complement(RNA_U)\nRNA_A\n\n\n\n\n\n\ncomplement(seq)\n\nMake a complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement!","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement!","text":"reverse_complement!(seq)\n\nMake a reversed complement sequence of seq in place.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.reverse_complement","page":"Indexing & modifying sequences","title":"BioSequences.reverse_complement","text":"reverse_complement(seq)\n\nMake a reversed complement sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap!","page":"Indexing & modifying sequences","title":"BioSequences.ungap!","text":"Remove gap characters from an input sequence.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ungap","page":"Indexing & modifying sequences","title":"BioSequences.ungap","text":"Create a copy of a sequence with gap characters removed.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical!","page":"Indexing & modifying sequences","title":"BioSequences.canonical!","text":"canonical!(seq::NucleotideSeq)\n\nTransforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\nUsing this function on a seq will ensure it is the canonical version.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.canonical","page":"Indexing & modifying sequences","title":"BioSequences.canonical","text":"canonical(seq::NucleotideSeq)\n\nCreate the canonical sequence of seq.\n\n\n\n\n\n","category":"function"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Some examples:","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> seq = dna\"ACGTAT\"\n6nt DNA Sequence:\nACGTAT\n\njulia> reverse!(seq)\n6nt DNA Sequence:\nTATGCA\n\njulia> complement!(seq)\n6nt DNA Sequence:\nATACGT\n\njulia> reverse_complement!(seq)\n6nt DNA Sequence:\nACGTAT\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!. ","category":"page"},{"location":"transforms/#Translation","page":"Indexing & modifying sequences","title":"Translation","text":"","category":"section"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"translate\nncbi_trans_table","category":"page"},{"location":"transforms/#BioSequences.translate","page":"Indexing & modifying sequences","title":"BioSequences.translate","text":"translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)\n\nTranslate an LongRNA or a LongDNA to an LongAA.\n\nTranslation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.\n\n\n\n\n\n","category":"function"},{"location":"transforms/#BioSequences.ncbi_trans_table","page":"Indexing & modifying sequences","title":"BioSequences.ncbi_trans_table","text":"Genetic code list of NCBI.\n\nThe standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.\n\n\n\n\n\n","category":"constant"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"julia> ncbi_trans_table\nTranslation Tables:\n 1. The Standard Code (standard_genetic_code)\n 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)\n 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)\n 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)\n 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)\n 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)\n 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)\n 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)\n 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)\n 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)\n 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)\n 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)\n 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)\n 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)\n 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)\n 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)\n 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)\n 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)\n","category":"page"},{"location":"transforms/","page":"Indexing & modifying sequences","title":"Indexing & modifying sequences","text":"https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/#Construction-and-conversion","page":"Constructing sequences","title":"Construction & conversion","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Here we will showcase the various ways you can construct the various sequence types in BioSequences.","category":"page"},{"location":"construction/#Constructing-sequences","page":"Constructing sequences","title":"Constructing sequences","text":"","category":"section"},{"location":"construction/#From-strings","page":"Constructing sequences","title":"From strings","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed from strings using their constructors:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongSequence{RNAAlphabet{2}}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Type alias' can also be used for brevity.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}(\"TTANC\")\n5nt DNA Sequence:\nTTANC\n\njulia> LongDNA{2}(\"TTAGC\")\n5nt DNA Sequence:\nTTAGC\n\njulia> LongRNA{4}(\"UUANC\")\n5nt RNA Sequence:\nUUANC\n\njulia> LongRNA{2}(\"UUAGC\")\n5nt RNA Sequence:\nUUAGC","category":"page"},{"location":"construction/#Constructing-sequences-from-arrays-of-BioSymbols","page":"Constructing sequences","title":"Constructing sequences from arrays of BioSymbols","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequences can be constructed using vectors or arrays of a BioSymbol type:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{4}([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])\n5nt DNA Sequence:\nTTANC\n\njulia> LongSequence{DNAAlphabet{2}}([DNA_T, DNA_T, DNA_A, DNA_G, DNA_C])\n5nt DNA Sequence:\nTTAGC\n","category":"page"},{"location":"construction/#Constructing-sequences-from-other-sequences","page":"Constructing sequences","title":"Constructing sequences from other sequences","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can create sequences, by concatenating other sequences together:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> LongDNA{2}(\"ACGT\") * LongDNA{2}(\"TGCA\")\n8nt DNA Sequence:\nACGTTGCA\n\njulia> repeat(LongDNA{4}(\"TA\"), 10)\n20nt DNA Sequence:\nTATATATATATATATATATA\n\njulia> LongDNA{4}(\"TA\") ^ 10\n20nt DNA Sequence:\nTATATATATATATATATATA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Sequence views (LongSubSeqs) are special, in that they do not own their own data, and must be constructed from a LongSequence or another LongSubSeq:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = LongDNA{4}(\"TACGGACATTA\")\n11nt DNA Sequence:\nTACGGACATTA\n\njulia> seqview = LongSubSeq(seq, 3:7)\n5nt DNA Sequence:\nCGGAC\n\njulia> seqview2 = @view seq[1:3]\n3nt DNA Sequence:\nTAC\n\njulia> typeof(seqview) == typeof(seqview2) && typeof(seqview) <: LongSubSeq\ntrue\n","category":"page"},{"location":"construction/#Conversion-of-sequence-types","page":"Constructing sequences","title":"Conversion of sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You can convert between sequence types, if the sequences are compatible - that is, if the source sequence does not contain symbols that are un-encodable by the destination type.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna = dna\"TTACGTAGACCG\"\n12nt DNA Sequence:\nTTACGTAGACCG\n\njulia> dna2 = convert(LongDNA{2}, dna)\n12nt DNA Sequence:\nTTACGTAGACCG","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DNA/RNA are special in that they can be converted to each other, despite containing distinct symbols. When doing so, DNA_T is converted to RNA_U and vice versa.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> convert(LongRNA{2}, dna\"TAGCTAGG\")\n8nt RNA Sequence:\nUAGCUAGG","category":"page"},{"location":"construction/#String-literals","page":"Constructing sequences","title":"String literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"BioSequences provides several string literal macros for creating sequences.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"note: Note\nWhen you use literals you may mix the case of characters.","category":"page"},{"location":"construction/#Long-sequence-literals","page":"Constructing sequences","title":"Long sequence literals","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> dna\"TACGTANNATC\"\n11nt DNA Sequence:\nTACGTANNATC\n\njulia> rna\"AUUUGNCCANU\"\n11nt RNA Sequence:\nAUUUGNCCANU\n\njulia> aa\"ARNDCQEGHILKMFPSTWYVX\"\n21aa Amino Acid Sequence:\nARNDCQEGHILKMFPSTWYVX","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, it should be noted that by default these sequence literals allocate the LongSequence object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"You might expect that every time you call foo, that a DNA sequence CTTA would be returned. You might expect that this is because every time foo is called, a new DNA sequence variable CTT is created, and the A nucleotide is pushed to it, and the result, CTTA is returned. In other words you might expect the following output:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"However, this is not what happens, instead the following happens:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"s\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n5nt DNA Sequence:\nCTTAA\n\njulia> foo()\n6nt DNA Sequence:\nCTTAAA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"The reason for this is because the sequence literal is allocated only once before the first time the function foo is called and run. Therefore, s in foo is always a reference to that one sequence that was allocated. So one sequence is created before foo is called, and then it is pushed to every time foo is called. Thus, that one allocated sequence grows with every call of foo.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"If you wanted foo to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated whilst the code is running, and not before. So to change foo so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> function foo()\n s = dna\"CTT\"d # 'd' flag appended to the string literal.\n push!(s, DNA_A)\n end\nfoo (generic function with 1 method)\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Now every time foo is called, a new sequence CTT is created, and an A nucleotide is pushed to it:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\n function foo()\n s = dna\"CTT\"d\n push!(s, DNA_A)\n end\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n\njulia> foo()\n4nt DNA Sequence:\nCTTA\n","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"DocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"So the take home message of sequence literals is this:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"@dna_str\n@rna_str\n@aa_str","category":"page"},{"location":"construction/#BioSequences.@dna_str","page":"Constructing sequences","title":"BioSequences.@dna_str","text":"@dna_str(seq, flag=\"s\") -> LongDNA{4}\n\nCreate a LongDNA{4} sequence at parse time from string seq. If flag is \"s\" ('static', the default), the sequence is created at parse time, and inserted directly into the returned expression. A static string ought not to be mutated Alternatively, if flag is \"d\" (dynamic), a new sequence is parsed and created whenever the code where is macro is placed is run.\n\nSee also: @aa_str, @rna_str\n\nExamples\n\nIn the example below, the static sequence is created once, at parse time, NOT when the function f is run. This means it is the same sequence that is pushed to repeatedly.\n\njulia> f() = dna\"TAG\";\n\njulia> string(push!(f(), DNA_A)) # NB: Mutates static string!\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGAA\"\n\njulia> f() = dna\"TAG\"d; # dynamically make seq\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\njulia> string(push!(f(), DNA_A))\n\"TAGA\"\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@rna_str","page":"Constructing sequences","title":"BioSequences.@rna_str","text":"The LongRNA{4} equivalent to @dna_str\n\nSee also: @dna_str, @aa_str\n\nExamples\n\njulia> rna\"UCGUGAUGC\"\n9nt RNA Sequence:\nUCGUGAUGC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#BioSequences.@aa_str","page":"Constructing sequences","title":"BioSequences.@aa_str","text":"The AminoAcidAlphabet equivalent to @dna_str\n\nSee also: @dna_str, @rna_str\n\nExamples\n\njulia> aa\"PKLEQC\"\n6aa Amino Acid Sequence:\nPKLEQC\n\n\n\n\n\n","category":"macro"},{"location":"construction/#Comparison-to-other-sequence-types","page":"Constructing sequences","title":"Comparison to other sequence types","text":"","category":"section"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:","category":"page"},{"location":"construction/","page":"Constructing sequences","title":"Constructing sequences","text":"julia> seq = dna\"GAGCTGA\"; vec = collect(seq);\n\njulia> seq == vec, isequal(seq, vec)\n(false, false)\n\njulia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))\ntrue ","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"sequence_search/#Searching-for-sequence-motifs","page":"Pattern matching and searching","title":"Searching for sequence motifs","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are many ways to search for particular motifs in biological sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Exact searches, where you are looking for exact matches of a particular character of substring.\nApproximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.\nSearches where you are looking for sequences that conform to some sort of pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.","category":"page"},{"location":"sequence_search/#Symbol-search","page":"Pattern matching and searching","title":"Symbol search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> seq = dna\"ACAGCGTAGCT\";\n\njulia> findfirst(DNA_A, seq)\n1\n\njulia> findlast(DNA_A, seq)\n8\n\njulia> findnext(DNA_A, seq, 2)\n3\n\njulia> findprev(DNA_A, seq, 7)\n3\n\njulia> findall(DNA_A, seq)\n3-element Vector{Int64}:\n 1\n 3\n 8","category":"page"},{"location":"sequence_search/#Exact-search","page":"Pattern matching and searching","title":"Exact search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ExactSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ExactSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ExactSearchQuery","text":"ExactSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for exact sequence search.\n\nAn exact search, is one where are you are looking in some given sequence, for exact instances of some given substring.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ExactSearchQuery(dna\"AGC\");\n\njulia> findfirst(query, seq)\n3:5\n\njulia> findlast(query, seq)\n8:10\n\njulia> findnext(query, seq, 6)\n8:10\n\njulia> findprev(query, seq, 7)\n3:5\n\njulia> findall(query, seq)\n2-element Vector{UnitRange{Int64}}:\n 3:5\n 8:10\n\njulia> occursin(query, seq)\ntrue\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ExactSearchQuery(dna\"CGT\", iscompatible);\n\njulia> findfirst(query, dna\"ACNT\") # 'N' matches 'G'\n2:4\n\njulia> findfirst(query, dna\"ACGT\") # 'G' matches 'N'\n2:4\n\njulia> occursin(ExactSearchQuery(dna\"CNT\", iscompatible), dna\"ACNT\")\ntrue\n\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Allowing-mismatches","page":"Pattern matching and searching","title":"Allowing mismatches","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"ApproximateSearchQuery","category":"page"},{"location":"sequence_search/#BioSequences.ApproximateSearchQuery","page":"Pattern matching and searching","title":"BioSequences.ApproximateSearchQuery","text":"ApproximateSearchQuery{F<:Function,S<:BioSequence}\n\nQuery type for approximate sequence search.\n\nThese queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.\n\nUsing these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.\n\nIn other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.\n\nExamples\n\njulia> seq = dna\"ACAGCGTAGCT\";\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\");\n\njulia> findfirst(query, 0, seq) == nothing # nothing matches with no errors\ntrue\n\njulia> findfirst(query, 1, seq) # seq[3:6] matches with one error\n3:6\n\njulia> findfirst(query, 2, seq) # seq[1:4] matches with two errors\n1:4\n\n\nYou can pass a comparator function such as isequal or iscompatible to its constructor to modify the search behaviour.\n\nThe default is isequal, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.\n\njulia> query = ApproximateSearchQuery(dna\"AGGG\", iscompatible);\n\njulia> occursin(query, 1, dna\"AAGNGG\") # 1 mismatch permitted (A vs G) & matched N\ntrue\n\njulia> findnext(query, 1, dna\"AAGNGG\", 1) # 1 mismatch permitted (A vs G) & matched N\n1:4\n\n\nnote: Note\nThis method of searching for motifs was implemented with smaller query motifs in mind.If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.\n\n\n\n\n\n","category":"type"},{"location":"sequence_search/#Searching-according-to-a-pattern","page":"Pattern matching and searching","title":"Searching according to a pattern","text":"","category":"section"},{"location":"sequence_search/#Regular-expression-search","page":"Pattern matching and searching","title":"Regular expression search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}(\"MV+\"). For bioregex literals, it is instead recommended using the @biore_str macro:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: \"dna\", \"rna\" or \"aa\". For example, biore\"A+\"dna is a regular expression for DNA sequences and biore\"A+\"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: \"d\", \"r\" or \"a\", respectively.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Here are examples of using the regular expression for BioSequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(biore\"A+C*\"dna, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> match(biore\"A+C*\"d, dna\"AAAACC\")\nRegexMatch(\"AAAACC\")\n\njulia> occursin(biore\"A+C*\"dna, dna\"AAC\")\ntrue\n\njulia> occursin(biore\"A+C*\"dna, dna\"C\")\nfalse\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"match will return a RegexMatch if a match is found, otherwise it will return nothing if no match is found.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The table below summarizes available syntax elements.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Syntax Description Example\n| alternation \"A|T\" matches \"A\" and \"T\"\n* zero or more times repeat \"TA*\" matches \"T\", \"TA\" and \"TAA\"\n+ one or more times repeat \"TA+\" matches \"TA\" and \"TAA\"\n? zero or one time \"TA?\" matches \"T\" and \"TA\"\n{n,} n or more times repeat \"A{3,}\" matches \"AAA\" and \"AAAA\"\n{n,m} n-m times repeat \"A{3,5}\" matches \"AAA\", \"AAAA\" and \"AAAAA\"\n^ the start of the sequence \"^TAN*\" matches \"TATGT\"\n$ the end of the sequence \"N*TA$\" matches \"GCTA\"\n(...) pattern grouping \"(TA)+\" matches \"TA\" and \"TATA\"\n[...] one of symbols \"[ACG]+\" matches \"AGGC\"","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"eachmatch and findfirst are also defined, just like usual regex and strings found in Base.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> collect(matched(x) for x in eachmatch(biore\"TATA*?\"d, dna\"TATTATAATTA\")) # overlap\n4-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TAT\n TATA\n TATAA\n\njulia> collect(matched(x) for x in eachmatch(biore\"TATA*\"d, dna\"TATTATAATTA\", false)) # no overlap\n2-element Vector{LongSequence{DNAAlphabet{4}}}:\n TAT \n TATAA\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\")\n1:3\n\njulia> findfirst(biore\"TATA*\"d, dna\"TATTATAATTA\", 2)\n4:8\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Noteworthy differences from strings are:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"Ambiguous characters match any compatible characters (e.g. biore\"N\"d is equivalent to biore\"[ACGT]\"d).\nWhitespaces are ignored (e.g. biore\"A C G\"d is equivalent to biore\"ACG\"d).","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite prefix and no symbol option is needed because it always describes patterns of amino acid sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> match(prosite\"[AC]-x-V-x(4)-{ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n\njulia> match(prosite\"[AC]xVx(4){ED}\", aa\"CPVPQARG\")\nRegexMatch(\"CPVPQARG\")\n","category":"page"},{"location":"sequence_search/#Position-weight-matrix-search","page":"Pattern matching and searching","title":"Position weight matrix search","text":"","category":"section"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"A motif can be specified using position weight matrix (PWM) in a probabilistic way. This method searches for the first position in the sequence where a score calculated using a PWM is greater than or equal to a threshold. More formally, denoting the sequence as S and the PWM value of symbol s at position j as M_sj, the score starting from a position p is defined as","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"operatornamescore(S p) = sum_i=1^L M_Sp+i-1i","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"and the search returns the smallest p that satisfies operatornamescore(S p) ge t.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"There are two kinds of matrices in this package: PFM and PWM. The PFM type is a position frequency matrix and stores symbol frequencies for each position. The PWM is a position weight matrix and stores symbol scores for each position. You can create a PFM from a set of sequences with the same length and then create a PWM from the PFM object.","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> pfm = PFM(motifs) # sequence set => PFM\n4×3 PFM{DNA, Int64}:\n A 1 0 5\n C 1 2 0\n G 1 0 0\n T 2 3 0\n\njulia> pwm = PWM(pfm) # PFM => PWM\n4×3 PWM{DNA, Float64}:\n A -0.321928 -Inf 2.0\n C -0.321928 0.678072 -Inf\n G -0.321928 -Inf -Inf\n T 0.678072 1.26303 -Inf\n\njulia> pwm = PWM(pfm .+ 0.01) # add pseudo counts to avoid infinite values\n4×3 PWM{DNA, Float64}:\n A -0.319068 -6.97728 1.99139\n C -0.319068 0.673772 -6.97728\n G -0.319068 -6.97728 -6.97728\n T 0.673772 1.25634 -6.97728\n\njulia> pwm = PWM(pfm .+ 0.01, prior=[0.2, 0.3, 0.3, 0.2]) # GC-rich prior\n4×3 PWM{DNA, Float64}:\n A 0.00285965 -6.65535 2.31331\n C -0.582103 0.410737 -7.24031\n G -0.582103 -7.24031 -7.24031\n T 0.9957 1.57827 -6.65535\n","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"The PWM_sj matrix is computed from PFM_sj and the prior probability p(s) as follows ([Wasserman2004]):","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"beginalign\n PWM_sj = log_2 fracp(sj)p(s) \n p(sj) = fracPFM_sjsum_s PFM_sj\nendalign","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"However, if you just want to quickly conduct a search, constructing the PFM and PWM is done for you as a convenience if you build a PWMSearchQuery, using a collection of sequences:","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"julia> motifs = [dna\"TTA\", dna\"CTA\", dna\"ACA\", dna\"TCA\", dna\"GTA\"]\n5-element Vector{LongSequence{DNAAlphabet{4}}}:\n TTA\n CTA\n ACA\n TCA\n GTA\n\njulia> subject = dna\"TATTATAATTA\";\n\njulia> qa = PWMSearchQuery(motifs, 1.0);\n\njulia> findfirst(qa, subject)\n3\n\njulia> findall(qa, subject)\n3-element Vector{Int64}:\n 3\n 5\n 9","category":"page"},{"location":"sequence_search/","page":"Pattern matching and searching","title":"Pattern matching and searching","text":"[Wasserman2004]: https://doi.org/10.1038/nrg1315","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"predicates/#Predicates","page":"Predicates","title":"Predicates","text":"","category":"section"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"A number of predicate or query functions are supported for sequences, allowing you to check for certain properties of a sequence.","category":"page"},{"location":"predicates/","page":"Predicates","title":"Predicates","text":"isrepetitive\nispalindromic\nhasambiguity\niscanonical","category":"page"},{"location":"predicates/#BioSequences.isrepetitive","page":"Predicates","title":"BioSequences.isrepetitive","text":"isrepetitive(seq::BioSequence, n::Integer = length(seq))\n\nReturn true if and only if seq contains a repetitive subsequence of length ≥ n.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.ispalindromic","page":"Predicates","title":"BioSequences.ispalindromic","text":"ispalindromic(seq::BioSequence)\n\nReturn true if seq is a palindromic sequence; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.hasambiguity","page":"Predicates","title":"BioSequences.hasambiguity","text":"hasambiguity(seq::BioSequence)\n\nReturns true if seq has an ambiguous symbol; otherwise return false.\n\n\n\n\n\n","category":"function"},{"location":"predicates/#BioSequences.iscanonical","page":"Predicates","title":"BioSequences.iscanonical","text":"iscanonical(seq::NucleotideSeq)\n\nReturns true if seq is canonical.\n\nFor any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:\n\n------->\nATCGATCG\nCGATCGAT\n<-------\n\nnote: Note\nUsing the reverse_complement of a DNA sequence will give give this reverse complement.\n\nOf the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.\n\n\n\n\n\n","category":"function"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\n using BioSymbols\nend","category":"page"},{"location":"recipes/#Recipes","page":"Recipes","title":"Recipes","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"This page provides tested example code to solve various common problems using BioSequences.","category":"page"},{"location":"recipes/#One-hot-encoding-biosequences","page":"Recipes","title":"One-hot encoding biosequences","text":"","category":"section"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"The types DNA, RNA and AminoAcid expose a binary representation through the exported function BioSymbols.compatbits, which is a one-hot encoding of:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> using BioSymbols\n\njulia> compatbits(DNA_W)\n0x09\n\njulia> compatbits(AA_J)\n0x00000600","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Each set bit in the encoding corresponds to a compatible unambiguous symbol. For example, for RNA, the four lower bits encode A, C, G, and U, in order. Hence, the symbol D, which is short for A, G or U, is encoded as 0x01 | 0x04 | 0x08 == 0x0d:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"julia> compatbits(RNA_D)\n0x0d\n\njulia> compatbits(RNA_A) | compatbits(DNA_G) | compatbits(RNA_U)\n0x0d","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"Using this, we can construct a function to one-hot encode sequences - in this example, nucleic acid sequences:","category":"page"},{"location":"recipes/","page":"Recipes","title":"Recipes","text":"function one_hot(s::NucSeq)\n M = falses(4, length(s))\n for (i, s) in enumerate(s)\n bits = compatbits(s)\n while !iszero(bits)\n M[trailing_zeros(bits) + 1, i] = true\n bits &= bits - one(bits) # clear lowest bit\n end\n end\n M\nend\n\none_hot(dna\"TGNTKCTW-T\")\n\n# output\n\n4×10 BitMatrix:\n 0 0 1 0 0 0 0 1 0 0\n 0 0 1 0 0 1 0 0 0 0\n 0 1 1 0 1 0 0 0 0 0\n 1 0 1 1 1 0 1 1 0 1","category":"page"},{"location":"#BioSequences","page":"Home","title":"BioSequences","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Pkg Status)","category":"page"},{"location":"#Description","page":"Home","title":"Description","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install BioSequences from the julia REPL. Press ] to enter pkg mode again, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add BioSequences","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.","category":"page"},{"location":"#Testing","page":"Home","title":"Testing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"BioSequences is tested against Julia 1.X on Linux, OS X, and Windows.","category":"page"},{"location":"","page":"Home","title":"Home","text":"(Image: Unit tests) (Image: Documentation) (Image: )","category":"page"},{"location":"#Contributing","page":"Home","title":"Contributing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.","category":"page"},{"location":"#Questions?","page":"Home","title":"Questions?","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"If you have a question about contributing or using BioJulia software, come on over and chat to us on the #biology channel on the Julia SLack, or you can try the Bio category of the Julia discourse site.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"CurrentModule = BioSequences\nDocTestSetup = quote\n using BioSequences\nend","category":"page"},{"location":"types/#Abstract-Types","page":"BioSequences Types","title":"Abstract Types","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.","category":"page"},{"location":"types/#The-abstract-BioSequence","page":"BioSequences Types","title":"The abstract BioSequence","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequence","category":"page"},{"location":"types/#BioSequences.BioSequence","page":"BioSequences Types","title":"BioSequences.BioSequence","text":"BioSequence{A <: Alphabet}\n\nBioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.\n\nExtended help\n\nIts subtypes are characterized by:\n\nBeing a linear container type with random access and indices Base.OneTo(length(x)).\nContaining zero or more internal data elements of type encoded_data_eltype(typeof(x)).\nBeing associated with an Alphabet, A by being a subtype of BioSequence{A}.\n\nA BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.\n\nSubtypes T of BioSequence must implement the following, with E begin an encoded data type:\n\nBase.length(::T)::Int\nencoded_data_eltype(::Type{T})::Type{E}\nextract_encoded_element(::T, ::Integer)::E\ncopy(::T)\nT must be able to be constructed from any iterable with length defined and with a known, compatible element type.\n\nFurthermore, mutable sequences should implement\n\nencoded_setindex!(::T, ::E, ::Integer)\nT(undef, ::Int)\nresize!(::T, ::Int)\n\nFor compatibility with existing Alphabets, the encoded data eltype must be UInt.\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Some aliases for BioSequence are also provided for your convenience:","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"NucSeq\nAASeq","category":"page"},{"location":"types/#BioSequences.NucSeq","page":"BioSequences Types","title":"BioSequences.NucSeq","text":"An alias for BioSequence{<:NucleicAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AASeq","page":"BioSequences Types","title":"BioSequences.AASeq","text":"An alias for BioSequence{AminoAcidAlphabet}\n\n\n\n\n\n","category":"type"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"encoded_data_eltype\nextract_encoded_element\nencoded_setindex!","category":"page"},{"location":"types/#BioSequences.encoded_data_eltype","page":"BioSequences Types","title":"BioSequences.encoded_data_eltype","text":"encoded_data_eltype(::Type{<:BioSequence})\n\nReturns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.extract_encoded_element","page":"BioSequences Types","title":"BioSequences.extract_encoded_element","text":"extract_encoded_element(::BioSequence{A}, i::Integer)\n\nReturns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/#BioSequences.encoded_setindex!","page":"BioSequences Types","title":"BioSequences.encoded_setindex!","text":"encoded_setindex!(seq::BioSequence, x::E, i::Integer)\n\nGiven encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.\n\nSee also: BioSequence \n\n\n\n\n\n","category":"function"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.","category":"page"},{"location":"types/#The-abstract-Alphabet","page":"BioSequences Types","title":"The abstract Alphabet","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"BioSequences.Alphabet\nBioSequences.AsciiAlphabet","category":"page"},{"location":"types/#BioSequences.Alphabet","page":"BioSequences Types","title":"BioSequences.Alphabet","text":"Alphabet\n\nAlphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.\n\nExtended help\n\nSubtypes of Alphabet are singleton structs that may or may not be parameterized.\nAlphabets span over a finite set of biological symbols.\nThe alphabet controls the encoding from some internal \"encoded data\" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.\nAn Alphabet's encode method must not produce invalid data. \n\nEvery subtype A of Alphabet must implement:\n\nBase.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.\nsymbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.\nencode(::A, ::S)::E encodes a symbol to an internal data eltype E.\ndecode(::A, ::E)::S decodes an internal data eltype E to a symbol S.\nExcept for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.\n\nIf you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:\n\nBitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].\n\nFor increased performance, see BioSequences.AsciiAlphabet\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AsciiAlphabet","page":"BioSequences Types","title":"BioSequences.AsciiAlphabet","text":"AsciiAlphabet\n\nTrait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).\n\n\n\n\n\n","category":"type"},{"location":"types/#Concrete-types","page":"BioSequences Types","title":"Concrete types","text":"","category":"section"},{"location":"types/#Implemented-alphabets","page":"BioSequences Types","title":"Implemented alphabets","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"DNAAlphabet\nRNAAlphabet\nAminoAcidAlphabet","category":"page"},{"location":"types/#BioSequences.DNAAlphabet","page":"BioSequences Types","title":"BioSequences.DNAAlphabet","text":"DNA nucleotide alphabet.\n\nDNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.RNAAlphabet","page":"BioSequences Types","title":"BioSequences.RNAAlphabet","text":"RNA nucleotide alphabet.\n\nRNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.\n\n\n\n\n\n","category":"type"},{"location":"types/#BioSequences.AminoAcidAlphabet","page":"BioSequences Types","title":"BioSequences.AminoAcidAlphabet","text":"Amino acid alphabet.\n\n\n\n\n\n","category":"type"},{"location":"types/#Long-Sequences","page":"BioSequences Types","title":"Long Sequences","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"LongSequence","category":"page"},{"location":"types/#BioSequences.LongSequence","page":"BioSequences Types","title":"BioSequences.LongSequence","text":"LongSequence{A <: Alphabet}\n\nGeneral-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.\n\nExtended help\n\nLongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.\n\nAs the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.\n\nFor example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.\n\nSymbols from multiple alphabets can't be intermixed in one sequence type.\n\nThe following table summarizes common LongSequence types that have been given aliases for convenience.\n\nType Symbol type Type alias\nLongSequence{DNAAlphabet{N}} DNA LongDNA{N}\nLongSequence{RNAAlphabet{N}} RNA LongRNA{N}\nLongSequence{AminoAcidAlphabet} AminoAcid LongAA\n\nThe LongDNA and LongRNA aliases use a DNAAlphabet{4}.\n\nDNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).\n\nIf you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.\n\nDNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).\n\nChanging this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.\n\nThe same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.\n\n\n\n\n\n","category":"type"},{"location":"types/#Sequence-views","page":"BioSequences Types","title":"Sequence views","text":"","category":"section"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.","category":"page"},{"location":"types/","page":"BioSequences Types","title":"BioSequences Types","text":"The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.","category":"page"}] } diff --git a/dev/sequence_search/index.html b/dev/sequence_search/index.html index fa68e90e..db17e091 100644 --- a/dev/sequence_search/index.html +++ b/dev/sequence_search/index.html @@ -1,5 +1,5 @@ -Pattern matching and searching · BioSequences.jl

Searching for sequence motifs

There are many ways to search for particular motifs in biological sequences:

  1. Exact searches, where you are looking for exact matches of a particular character of substring.
  2. Approximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.
  3. Searches where you are looking for sequences that conform to some sort of pattern.

Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.

All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.

The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.

julia> seq = dna"ACAGCGTAGCT";
+Pattern matching and searching · BioSequences.jl

Searching for sequence motifs

There are many ways to search for particular motifs in biological sequences:

  1. Exact searches, where you are looking for exact matches of a particular character of substring.
  2. Approximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.
  3. Searches where you are looking for sequences that conform to some sort of pattern.

Like other Julia sequences such as Vector, you can search a BioSequence with the findfirst(predicate, collection) method pattern.

All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext, findprev, and occursin patterns established in Base for String and collections like Vector.

The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match pattern established in Base for pcre and Strings.

julia> seq = dna"ACAGCGTAGCT";
 
 julia> findfirst(DNA_A, seq)
 1
@@ -50,7 +50,7 @@
 
 julia> occursin(ExactSearchQuery(dna"CNT", iscompatible), dna"ACNT")
 true
-
source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
+
source

Allowing mismatches

BioSequences.ApproximateSearchQueryType
ApproximateSearchQuery{F<:Function,S<:BioSequence}

Query type for approximate sequence search.

These queries are used as a predicate for the Base.findnext, Base.findprev, Base.occursin, Base.findfirst, and Base.findlast functions.

Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.

In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.

Examples

julia> seq = dna"ACAGCGTAGCT";
 
 julia> query = ApproximateSearchQuery(dna"AGGG");
 
@@ -69,7 +69,7 @@
 
 julia> findnext(query, 1, dna"AAGNGG", 1) # 1 mismatch permitted (A vs G) & matched N
 1:4
-
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
+
Note

This method of searching for motifs was implemented with smaller query motifs in mind.

If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.

source

Searching according to a pattern

Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.

Biological regexes can be constructed using the BioRegex constructor, for example by doing BioRegex{AminoAcid}("MV+"). For bioregex literals, it is instead recommended using the @biore_str macro:

The Perl-like syntax starts with biore (BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna is a regular expression for DNA sequences and biore"A+"aa is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.

Here are examples of using the regular expression for BioSequences:

julia> match(biore"A+C*"dna, dna"AAAACC")
 RegexMatch("AAAACC")
 
 julia> match(biore"A+C*"d, dna"AAAACC")
@@ -159,4 +159,4 @@
 3-element Vector{Int64}:
  3
  5
- 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

+ 9

[Wasserman2004]: https://doi.org/10.1038/nrg1315

diff --git a/dev/symbols/index.html b/dev/symbols/index.html index 5e99ed53..7d35eae9 100644 --- a/dev/symbols/index.html +++ b/dev/symbols/index.html @@ -1,5 +1,5 @@ -Biological Symbols · BioSequences.jl

Biological symbols

The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:

TypeMeaning
DNADNA nucleotide
RNARNA nucleotide
AminoAcidAmino acid

These symbols are elements of biological sequence types, just as characters are elements of strings.

DNA and RNA nucleotides

Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:

SymbolConstantMeaning
'A'DNA_A / RNA_AA; Adenine
'C'DNA_C / RNA_CC; Cytosine
'G'DNA_G / RNA_GG; Guanine
'T'DNA_TT; Thymine (DNA only)
'U'RNA_UU; Uracil (RNA only)
'M'DNA_M / RNA_MA or C
'R'DNA_R / RNA_RA or G
'W'DNA_W / RNA_WA or T/U
'S'DNA_S / RNA_SC or G
'Y'DNA_Y / RNA_YC or T/U
'K'DNA_K / RNA_KG or T/U
'V'DNA_V / RNA_VA or C or G; not T/U
'H'DNA_H / RNA_HA or C or T; not G
'D'DNA_D / RNA_DA or G or T/U; not C
'B'DNA_B / RNA_BC or G or T/U; not A
'N'DNA_N / RNA_NA or C or G or T/U
'-'DNA_Gap / RNA_GapGap (none of the above)

https://www.bioinformatics.org/sms/iupac.html

Symbols are accessible as constants with DNA_ or RNA_ prefix:

julia> DNA_A
+Biological Symbols · BioSequences.jl

Biological symbols

The BioSequences module reexports the biological symbol (character) types that are provided by BioSymbols.jl:

TypeMeaning
DNADNA nucleotide
RNARNA nucleotide
AminoAcidAmino acid

These symbols are elements of biological sequence types, just as characters are elements of strings.

DNA and RNA nucleotides

Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:

SymbolConstantMeaning
'A'DNA_A / RNA_AA; Adenine
'C'DNA_C / RNA_CC; Cytosine
'G'DNA_G / RNA_GG; Guanine
'T'DNA_TT; Thymine (DNA only)
'U'RNA_UU; Uracil (RNA only)
'M'DNA_M / RNA_MA or C
'R'DNA_R / RNA_RA or G
'W'DNA_W / RNA_WA or T/U
'S'DNA_S / RNA_SC or G
'Y'DNA_Y / RNA_YC or T/U
'K'DNA_K / RNA_KG or T/U
'V'DNA_V / RNA_VA or C or G; not T/U
'H'DNA_H / RNA_HA or C or T; not G
'D'DNA_D / RNA_DA or G or T/U; not C
'B'DNA_B / RNA_BC or G or T/U; not A
'N'DNA_N / RNA_NA or C or G or T/U
'-'DNA_Gap / RNA_GapGap (none of the above)

https://www.bioinformatics.org/sms/iupac.html

Symbols are accessible as constants with DNA_ or RNA_ prefix:

julia> DNA_A
 DNA_A
 
 julia> DNA_T
@@ -70,4 +70,4 @@
 
 julia> iscompatible(DNA_C, DNA_R)  # DNA_R (A or G) cannot be DNA_C
 false
-
source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
+source
BioSymbols.isambiguousFunction
isambiguous(nt::NucleicAcid)

Test if nt is an ambiguous nucleotide.

source
isambiguous(aa::AminoAcid)

Test if aa is an ambiguous amino acid.

source
diff --git a/dev/transforms/index.html b/dev/transforms/index.html index 8038630b..dc4f663e 100644 --- a/dev/transforms/index.html +++ b/dev/transforms/index.html @@ -1,5 +1,5 @@ -Indexing & modifying sequences · BioSequences.jl

Indexing & modifying sequences

Indexing

Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:

For example, with LongSequences:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
+Indexing & modifying sequences · BioSequences.jl

Indexing & modifying sequences

Indexing

Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:

For example, with LongSequences:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
 19nt DNA Sequence:
 ACGTTTANAGTNNAGTACC
 
@@ -15,7 +15,7 @@
 
 julia> seq[5] = DNA_A
 DNA_A
-
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
+
Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push!Method
push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

source
Base.pop!Method
pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.pushfirst!Method
pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

source
Base.popfirst!Method
popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

source
Base.insert!Method
insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i.

source
Base.deleteat!Method
deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

source
Base.append!Method
append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

source
Base.resize!Method
resize!(seq, size, [force::Bool])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

source
Base.empty!Method
empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

source

Here are some examples:

julia> seq = dna"ACG"
 3nt DNA Sequence:
 ACG
 
@@ -34,7 +34,7 @@
 julia> deleteat!(seq, 2:3)
 3nt DNA Sequence:
 AAT
-

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
+

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse!Method
reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

source
Base.reverseMethod
reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

source
reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

source
BioSymbols.complementFunction
complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
 DNA_T
 
 julia> complement(DNA_N)
@@ -42,10 +42,10 @@
 
 julia> complement(RNA_U)
 RNA_A
-
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
+
source
complement(seq)

Make a complement sequence of seq.

source
BioSequences.canonical!Function
canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
 ATCGATCG
 CGATCGAT
-<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source

Some examples:

julia> seq = dna"ACGTAT"
+<-------
Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

source

Some examples:

julia> seq = dna"ACGTAT"
 6nt DNA Sequence:
 ACGTAT
 
@@ -60,7 +60,7 @@
 julia> reverse_complement!(seq)
 6nt DNA Sequence:
 ACGTAT
-

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
+

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translateFunction
translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

source
BioSequences.ncbi_trans_tableConstant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

source
julia> ncbi_trans_table
 Translation Tables:
   1. The Standard Code (standard_genetic_code)
   2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
@@ -80,4 +80,4 @@
  23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
  24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
  25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
-

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

+

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes

diff --git a/dev/types/index.html b/dev/types/index.html index fcc6ed86..6b3ae1a8 100644 --- a/dev/types/index.html +++ b/dev/types/index.html @@ -1,2 +1,2 @@ -BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.

+BioSequences Types · BioSequences.jl

Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.