-
-------------------------------------------------------------------------
-
-## Overview
-
-> This is a species-agnostic, algorithm extensible, sequence-anonymous
-> (genome, metagenomes) *gene finder* library for the Julia Language.
-
-The main goal is to create a versatile module that enables apply
-different implemented algorithm to DNA sequences. See, for instance,
-BioAlignment implementations of different sequence alignment algorithms
-(local, global, edit-distance).
-
-## Installation
-
-You can install GeneFinder from the julia REPL. Press `]` to enter pkg
-mode, and enter the following:
-
- add GeneFinder
-
-If you are interested in the cutting edge of the development, please
-check out the master branch to try new features before release.
-
-## Algorithms
-
-### Coding genes (CDS - ORFs)
-
-- ☒ [Simple
- finder](https://camilogarciabotero.github.io/GeneFinder.jl/dev/simplefinder/)
-- ☐ EasyGene
-- ☐ GLIMMER
-- ☐ Prodigal - Pyrodigal
-- ☐ PHANOTATE
-- ☐ k-mer based gene finders (?)
-- ☐ Augustus (?)
-
-### Non-coding genes (RNA)
-
-- ☐ Infernal
-- ☐ tRNAscan
-
-## Other features
-
-- ☐ parallelism SIMD ?
-- ☐ memory management (?)
-- ☐ specialized types
- - ☒ Gene
- - ☒ ORF
- - ☒ CDS
- - ☐ EukaryoticGene (?)
- - ☐ ProkaryoticGene (?)
- - ☐ Codon
- - ☐ Intron
- - ☐ Exon
- - ☐ GFF –\> See other packages
- - ☐ FASTX –\> See I/O in other packages
-
-## Compatibilities
-
-Must interact with or extend:
-
-- GenomicAnnotations.jl
-- BioSequences.jl
-- SequenceVariation.jl
-- GenomicFeatures.jl
-- FASTX.jl
-- Kmers.jl
-
-## Contributing
-
-## Citing
-
diff --git a/docs/src/index.md b/docs/src/index.md
index aa26b0e..8d1d39e 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -32,6 +32,9 @@
+
+
+
@@ -63,7 +66,7 @@ add GeneFinder
author = {Camilo García},
title = {GeneFinder.jl},
url = {https://github.com/camilogarciabotero/GeneFinder.jl},
- version = {v0.0.23},
+ version = {v0.2.0},
year = {2022},
month = {11}
}
diff --git a/docs/src/iodocs.md b/docs/src/iodocs.md
new file mode 100644
index 0000000..f698e35
--- /dev/null
+++ b/docs/src/iodocs.md
@@ -0,0 +1,96 @@
+## Writting ORF information into bioinformatic formats
+
+This package facilitates the creation of `FASTA`, `BED`, and `GFF` files, specifically extracting Open Reading Frame (ORF) information from `BioSequence` instances, particularly those of type `NucleicSeqOrView{A} where A`, and then writing the information into the desired format.
+
+Functionality:
+
+The package provides four distinct functions for writing files in different formats:
+
+| Function | Description |
+|-------------------|--------------------------------------------------------|
+| `write_orfs_fna` | Writes nucleotide sequences in FASTA format. |
+| `write_orfs_faa` | Writes amino acid sequences in FASTA format. |
+| `write_orfs_bed` | Outputs information in BED format. |
+| `write_orfs_gff` | Generates files in GFF format. |
+
+
+All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a `BioSequence`, consider the following example:
+
+```julia
+using BioSequences, GeneFinder
+
+# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
+seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
+```
+
+Once a `BioSequence` object has been instantiated, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
+
+```julia
+outfile = "LFLS01000089.fna"
+
+open(outfile, "w") do io
+ write_orfs_fna(seq, io)
+end
+```
+
+```bash
+cat LFLS01000089.fna
+
+>ORF01 id=01 start=29 stop=40 strand=+ frame=2
+ATGCAACCCTGA
+>ORF02 id=02 start=137 stop=145 strand=+ frame=2
+ATGCGCTGA
+>ORF03 id=03 start=164 stop=184 strand=+ frame=2
+ATGCGTCGAATGGCACGGTGA
+>ORF04 id=04 start=173 stop=184 strand=+ frame=2
+ATGGCACGGTGA
+>ORF05 id=05 start=236 stop=241 strand=+ frame=2
+ATGTGA
+>ORF06 id=06 start=248 stop=268 strand=+ frame=2
+ATGTGTCCAACGGCAGTCTGA
+>ORF07 id=07 start=362 stop=373 strand=+ frame=2
+ATGCAACCCTGA
+>ORF08 id=08 start=470 stop=496 strand=+ frame=2
+ATGCACTGGCTGGTCCTGTCAATCTGA
+>ORF09 id=09 start=551 stop=574 strand=+ frame=2
+ATGTCACCGCACAAGGCAATGTGA
+>ORF10 id=10 start=569 stop=574 strand=+ frame=2
+ATGTGA
+>ORF11 id=11 start=581 stop=601 strand=+ frame=2
+ATGTGTCCAACGGCAGCCTGA
+>ORF12 id=12 start=695 stop=706 strand=+ frame=2
+ATGCAACCCTGA
+```
+
+### Combining `FASTX` for reading and writing fastas
+
+We can now combine the `FASTX` package with the function `write_orfs_faa` to write a `FASTA` file with the protein sequences of the translated ORFs obtained from an external `FASTA` file.
+
+```julia
+infile = "test/data/NC_001884.fasta"
+outfile = "test/data/NC_001884-orfs.faa"
+
+open(inputfile) do io
+ write_orfs_faa(infile, outfile)
+end
+```
+
+```bash
+head test/data/NC_001884-orfs.faa
+
+>ORF0001 id=0001 start=41 stop=145 strand=- frame=2
+MTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
+>ORF0002 id=0002 start=41 stop=172 strand=- frame=2
+MVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
+>ORF0003 id=0003 start=41 stop=454 strand=- frame=2
+MSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDL
+TATNSFH*
+>ORF0004 id=0004 start=41 stop=472 strand=- frame=2
+MKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRF
+NFIFDLTATNSFH*
+>ORF0005 id=0005 start=41 stop=505 strand=- frame=2
+MLSKYEDDNSNMKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIP
+AQFEITPILRFNFIFDLTATNSFH*
+```
+
+This could also be done to writting a `FASTA` file with the nucleotide sequences of the ORFs using the `write_orfs_fna` function. Similarly for the `BED` and `GFF` files using the `write_orfs_bed` and `write_orfs_gff` functions respectively.
\ No newline at end of file
diff --git a/docs/src/markovchains.html b/docs/src/markovchains.html
deleted file mode 100644
index b451516..0000000
--- a/docs/src/markovchains.html
+++ /dev/null
@@ -1,445 +0,0 @@
-
-
-
-
-
-
-
-
-
-markovchains
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Representing DNA sequences as a Markov chain
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
One important step towards gene finding algorithms is to represent a DNA sequence as a Markov chain Figure 1. In this representation a DNA sequence of a reduced alphabet \(\scrA = {A, C, G, T}\) is draw as a four-vertex graph, where each letter of the \(\scrA\) is a state and the edges of the graph represent transitions from one nucleotide to another in a sequence. This is also consdired more specifically as a Discrete Markov chain @axelson-fisk_comparative_2015.
-
-
-
-
-
\ No newline at end of file
diff --git a/docs/src/simplecodingrule.md b/docs/src/simplecodingrule.md
new file mode 100644
index 0000000..6175547
--- /dev/null
+++ b/docs/src/simplecodingrule.md
@@ -0,0 +1,29 @@
+## The *log-odds ratio* decision rule
+
+The sequence probability given a transition probability model (eq. 2) could be used as the source of a sequence classification based on a decision rule to classify whether a sequence correspond to a model or another. Now, imagine we got two DNA sequence transition models, a CDS model and a No-CDS model. The *log-odds ratio* decision rule could be establish as:
+
+``` math
+\begin{align}
+S(X) = \log \frac{{P_C(X_1=i_1, \ldots, X_T=i_T)}}{{P_N(X_1=i_1, \ldots, X_T=i_T)}} \begin{cases} > \eta & \Rightarrow \text{coding} \\ < \eta & \Rightarrow \text{noncoding} \end{cases}
+\end{align}
+```
+
+Where the ``P_{C}`` is the probability of the sequence given a
+CDS model, ``P_{N}`` is the probability of the sequence given a
+No-CDS model, the decision rule is finally based on whether the ratio is
+greater or lesser than a given threshold *η* of significance level.
+
+In the GeneFinder we have implemented this rule and a couple of basic
+transition probability models of CDS and No-CDS of *E. coli* from
+Axelson-Fisk (2015) work. To check whether a random sequence could be
+coding based on these decision we use the predicate `iscoding` with the
+`ECOLICDS` and `ECOLINOCDS` models:
+
+``` julia
+using GeneFinder, BioSequences
+randseq = get_orfs_dna(randdnaseq(99))[1] # this will retrieved a random coding ORF
+
+iscoding(randseq, ECOLICDS, ECOLINOCDS)
+```
+
+ true
\ No newline at end of file
diff --git a/docs/src/simplefinder.html b/docs/src/simplefinder.html
deleted file mode 100644
index 9ff4243..0000000
--- a/docs/src/simplefinder.html
+++ /dev/null
@@ -1,555 +0,0 @@
-
-
-
-
-
-
-
-
-
-simplefinder
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
A simple algorithm
-
The first implemented function is simplefinder a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.
-
-
usingBioSequences, GeneFinder
-
-# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
-seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
-
-
726nt DNA Sequence:
-AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTG…GCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC
Two other functions (cdsgenerator and proteingenerator) pass the sequence to simplefinder take the ORFs and act as generators of the sequence, so this way the can be collected in the REPL as an standard output or writeen into a file more conviniently using the FASTX IO system:
-
-
-
-
\ No newline at end of file
diff --git a/docs/src/simplefinder.md b/docs/src/simplefinder.md
index 3a69ab3..faac22a 100644
--- a/docs/src/simplefinder.md
+++ b/docs/src/simplefinder.md
@@ -1,5 +1,5 @@
-## Finding complete and internal (overlapped) ORFs
+## Finding complete and overlapped ORFs
The first implemented function is `findorfs` a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider `aa"M*"` a posible encoding protein from the resulting ORFs.
@@ -29,10 +29,10 @@ findorfs(seq)
ORF(695:706, '+', 2)
```
-Two other functions (`getorfdna` and `getorfaa`) pass the sequence to `findorfs` take the ORFs and act as generators of the sequence, so this way the can be `collect`ed in the REPL as an standard output or writteen into a file more conviniently using the `FASTX` IO system:
+Two other functions (`get_orfs_dna` and `get_orfs_aa`) are implemented to get the ORFs in DNA and amino acid sequences, respectively. They use the `findorfs` function to first get the ORFs and then get the correspondance array of `BioSequence` objects.
```julia
-getorfdna(seq)
+get_orfs_dna(seq)
12-element Vector{LongSubSeq{DNAAlphabet{4}}}:
ATGCAACCCTGA
@@ -50,7 +50,7 @@ getorfdna(seq)
```
```julia
-getorfaa(seq)
+get_orfs_aa(seq)
12-element Vector{LongSubSeq{AminoAcidAlphabet}}:
MQP*
@@ -67,94 +67,30 @@ getorfaa(seq)
MQP*
```
-### Writting cds, proteins fastas, bed and gffs whether from a `LongSeq` or from a external fasta file.
+## The ORF type
-```julia
-write_cds("cds.fasta", seq)
-```
-
-```bash
-cat cds.fasta
-
->location=29:40 strand=+ frame=2
-ATGCAACCCTGA
->location=137:145 strand=+ frame=2
-ATGCGCTGA
->location=164:184 strand=+ frame=2
-ATGCGTCGAATGGCACGGTGA
->location=173:184 strand=+ frame=2
-ATGGCACGGTGA
->location=236:241 strand=+ frame=2
-ATGTGA
->location=248:268 strand=+ frame=2
-ATGTGTCCAACGGCAGTCTGA
->location=362:373 strand=+ frame=2
-ATGCAACCCTGA
->location=470:496 strand=+ frame=2
-ATGCACTGGCTGGTCCTGTCAATCTGA
->location=551:574 strand=+ frame=2
-ATGTCACCGCACAAGGCAATGTGA
->location=569:574 strand=+ frame=2
-ATGTGA
->location=581:601 strand=+ frame=2
-ATGTGTCCAACGGCAGCCTGA
->location=695:706 strand=+ frame=2
-ATGCAACCCTGA
-```
-
-### Combining `FASTX` for reading and writing fastas
+For convenience, the `ORF` type is more stringent in preventing the creation of incompatible instances. As a result, attempting to create an instance with incompatible parameters will result in an error. For instance, the following code snippet will trigger an error:
```julia
-using FASTX
-
-write_proteins("test/data/NC_001884.fasta", "proteins.fasta")
-```
-
-```bash
-head proteins.fasta
-
->location=41:145 strand=- frame=2
-MTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:172 strand=- frame=2
-MVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:454 strand=- frame=2
-MSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:472 strand=- frame=2
-MKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:505 strand=- frame=2
-MLSKYEDDNSNMKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
-```
-
-## The *log-odds ratio* decision rule
-
-The sequence probability given a transition probability model (eq. 2)
-could be used as the source of a sequence classification based on a
-decision rule to classify whether a sequence correspond to a model or
-another. Now, imagine we got two DNA sequence transition models, a CDS
-model and a No-CDS model. The *log-odds ratio* decision rule could be
-establish as:
-
-``` math
-\begin{align}
-S(X) = \log \frac{{P_C(X_1=i_1, \ldots, X_T=i_T)}}{{P_N(X_1=i_1, \ldots, X_T=i_T)}} \begin{cases} > \eta & \Rightarrow \text{coding} \\ < \eta & \Rightarrow \text{noncoding} \end{cases}
-\end{align}
+ORF(1:10, '+', 4)
+
+ERROR: AssertionError: Invalid frame value. Frame must be 1, 2, or 3.
+Stacktrace:
+ [1] ORF(location::UnitRange{Int64}, strand::Char, frame::Int64)
+ @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:47
+ [2] top-level scope
+ @ REPL[25]:1
```
+
+Similar behavior will be encountered when the strand is neither `+` nor `-`. This precautionary measure helps prevent the creation of invalid ORFs, ensuring greater stability and enabling the extension of its interface. For example, after creating a specific `ORF`, users can seamlessly iterate over a sequence of interest and verify whether the ORF is contained within the sequence.
-Where the ``P_{C}`` is the probability of the sequence given a
-CDS model, ``P_{N}`` is the probability of the sequence given a
-No-CDS model, the decision rule is finally based on whether the ratio is
-greater or lesser than a given threshold *η* of significance level.
-
-In the GeneFinder we have implemented this rule and a couple of basic
-transition probability models of CDS and No-CDS of *E. coli* from
-Axelson-Fisk (2015) work. To check whether a random sequence could be
-coding based on these decision we use the predicate `iscoding` with the
-`ECOLICDS` and `ECOLINOCDS` models:
-
-``` julia
-randseq = getorfdna(randdnaseq(99))[1] # this will retrieved a random coding ORF
+```julia
+orf = ORF(137:145, '+', 2)
+seq[orf]
-iscoding(randseq, ECOLICDS, ECOLINOCDS)
+9nt DNA Sequence:
+ATGCGCTGA
```
- true
\ No newline at end of file
+!!! warning
+ It is still possible to create an `ORF` and pass it to a sequence that does not necessarily contain an actual open reading frame. This will be addressed in future versions of the package. But the benefit of having it is that it will retrieve the corresponding subsequence of the sequence in a convinient way (5' to 3') regardless of the strand.
\ No newline at end of file
diff --git a/docs/src/simplefinder.qmd b/docs/src/simplefinder.qmd
index 32d9da7..44feb26 100644
--- a/docs/src/simplefinder.qmd
+++ b/docs/src/simplefinder.qmd
@@ -47,7 +47,7 @@ getproteins(seq)
#| eval: false
using FASTX
-write_proteins("../../test/data/NC_001884.fasta", "proteins.fasta")
+write_orfs_faa("../../test/data/NC_001884.fasta", "proteins.fasta")
```
```bash
@@ -101,7 +101,7 @@ ATGCAACCCTGA
```
```julia
-write_proteins("proteins.fasta", seq)
+write_orfs_faa("proteins.fasta", seq)
```
```bash
diff --git a/references.bib b/references.bib
deleted file mode 100644
index ab7c59b..0000000
--- a/references.bib
+++ /dev/null
@@ -1,15 +0,0 @@
-@book{axelson-fisk_comparative_2015,
- address = {London},
- series = {Computational {Biology}},
- title = {Comparative {Gene} {Finding}},
- volume = {20},
- isbn = {978-1-4471-6692-4 978-1-4471-6693-1},
- url = {http://link.springer.com/10.1007/978-1-4471-6693-1},
- language = {en},
- urldate = {2022-12-18},
- publisher = {Springer London},
- author = {Axelson-Fisk, Marina},
- year = {2015},
- doi = {10.1007/978-1-4471-6693-1},
- keywords = {bioinformatics, programming, gene finding, software}
-}
\ No newline at end of file
diff --git a/src/GeneFinder.jl b/src/GeneFinder.jl
index 828e0b5..c9f0868 100644
--- a/src/GeneFinder.jl
+++ b/src/GeneFinder.jl
@@ -6,35 +6,36 @@ using BioSequences:
NucleicAcidAlphabet,
DNAAlphabet,
+ RNAAlphabet,
AminoAcidAlphabet,
LongDNA,
LongAA,
LongSequence,
LongSubSeq,
+ NucleicSeqOrView,
@biore_str,
@dna_str,
GeneticCode,
reverse_complement,
- randdnaseq
+ randdnaseq,
+ ncbi_trans_table,
+ translate
-
-using BioMarkovChains: BioMarkovChain
-using FASTX: FASTA, sequence, FASTAReader
+using FASTX: FASTAReader, FASTARecord, description, sequence
using IterTools: takewhile, iterated
using PrecompileTools: @setup_workload, @compile_workload
-using StatsBase: countmap
include("types.jl")
export ORF
include("algorithms/findorfs.jl")
-export locationiterator, findorfs, getorfdna, getorfaa
+export locationiterator, findorfs, get_orfs_dna, get_orfs_aa, record_orfs_fna, record_orfs_faa
include("io.jl")
-export write_cds, write_proteins, write_bed, write_gff
+export write_orfs_fna, write_orfs_faa, write_orfs_bed, write_orfs_gff
include("utils.jl")
-export fasta_to_dna, nucleotidefreqs, hasprematurestop, dnaseqprobability, iscoding
+export fasta_to_dna, hasprematurestop
include("extended.jl")
@@ -46,9 +47,9 @@ include("extended.jl")
# all calls in this block will be precompiled, regardless of whether
# they belong to your package or not (on Julia 1.8 and higher)
findorfs(seq)
- getorfdna(seq)
- # getorfaa(seq)
+ get_orfs_dna(seq)
+ # get_orfs_aa(seq)
end
end
-end
+end
\ No newline at end of file
diff --git a/src/algorithms/findorfs.jl b/src/algorithms/findorfs.jl
index c9e06f1..2ef8e8f 100644
--- a/src/algorithms/findorfs.jl
+++ b/src/algorithms/findorfs.jl
@@ -1,5 +1,6 @@
"""
- locationiterator(sequence::LongSequence{DNAAlphabet{4}}; alternative_start::Bool=false)
+ locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}
+ locationiterator(sequence::NucleicSeqOrView{RNAAlphabet{N}}; alternative_start::Bool=false) where {N}
This is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a `LongSequence{DNAAlphabet{4}}` sequence.
It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.
@@ -14,7 +15,10 @@ This is an iterator function that uses regular expressions to search the entire
See more about the discussion [here](https://discourse.julialang.org/t/how-to-improve-a-generator-to-be-more-memory-efficient-when-it-is-collected/92932/8?u=camilogarciabotero)
"""
-function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool = false) where N
+function locationiterator(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
+ alternative_start::Bool = false
+) where {N}
regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
# regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG([N]{3})*T(AG|AA|GA)?"dna # an attempt to make it non PCRE non-determinsitic
finder(x) = findfirst(regorf, sequence, first(x) + 1) # + 3
@@ -22,9 +26,10 @@ function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool =
return itr
end
+const FRAMEDICT = Dict(0 => 3, 1 => 1, 2 => 2) # sets the int conversion for frame assignment later
+
"""
- findorfs(sequence::LongSequence{DNAAlphabet{4}}; kwargs...)::Vector{ORF}
- findorfs(sequence::String; kwargs...)::Vector{ORF}
+ findorfs(sequence::NucleicAlphabet{DNAAlphabet{N}}; alternative_start::Bool=false, min_len::Int64=6)::Vector{ORF} where {N}
A simple implementation that finds ORFs in a DNA sequence.
@@ -40,19 +45,21 @@ The `findorfs` function takes a LongSequence{DNAAlphabet{4}} sequence and return
- `alternative_start::Bool=false`: If true will pass the extended start codons to search. This will increase 3x the exec. time.
- `min_len::Int64=6`: Length of the allowed ORF. Default value allow `aa"M*"` a posible encoding protein from the resulting ORFs.
"""
-function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, min_len::Int64 = 6) where N
- orfs = Vector{ORF}()
+function findorfs(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = Vector{ORF}(undef, 0)
reversedseq = reverse_complement(sequence)
seqlen = length(sequence)
- frames = Dict(0 => 3, 1 => 1, 2 => 2)
-
for strand in ('+', '-')
seq = strand == '-' ? reversedseq : sequence
- @inbounds for location in locationiterator(seq; alternative_start)
+ @inbounds for location in @views locationiterator(seq; alternative_start)
if length(location) >= min_len
- frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
+ frame = strand == '+' ? FRAMEDICT[location.start % 3] : FRAMEDICT[(seqlen - location.stop + 1) % 3]
push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
end
end
@@ -60,36 +67,16 @@ function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, m
return sort(orfs)
end
-function findorfs(sequence::String; alternative_start::Bool = false, min_len::Int64 = 6)
- sequence = LongSequence{DNAAlphabet{4}}(sequence)
- orfs = Vector{ORF}()
- reversedseq = reverse_complement(sequence)
- seqlen = length(sequence)
-
- frames = Dict(0 => 3, 1 => 1, 2 => 2)
-
- for strand in ('+', '-')
- seq = strand == '-' ? reversedseq : sequence
-
- @inbounds for location in locationiterator(seq; alternative_start)
- if length(location) >= min_len
- frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
- push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
- end
- end
- end
- return sort(orfs)
-end
+#### get_orfs_* methods ####
"""
- getorfdna(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
- getorfdna(input::String; kwargs...) ## for strings per se
+ get_orfs_dna(sequence::NucleicSeqOrView{DNAAlphabet{N}}, output::String; kwargs...) where {N}
-This function takes a `LongSequence{DNAAlphabet{4}}` or `String` sequence and identifies the open reading frames (ORFs) using the `findorfs()` function. The function then extracts the DNA sequence of each ORF and stores it in a `Vector{LongSubSeq{DNAAlphabet{4}}}`.
+This function takes a `NucleicSeqOrView{DNAAlphabet{N}}` sequence and identifies the open reading frames (ORFs) using the `findorfs()` function. The function then extracts the DNA sequence of each ORF and stores it in a `Vector{LongSubSeq{DNAAlphabet{4}}}`.
# Arguments
-- `input`: The input sequence as a `LongSequence{DNAAlphabet{4}}` or `String`.
+- `sequence`: The input sequence as a `NucleicSeqOrView{DNAAlphabet{N}}`
# Keyword Arguments
@@ -97,35 +84,27 @@ This function takes a `LongSequence{DNAAlphabet{4}}` or `String` sequence and id
- `min_len::Int64=6`: The minimum length of the allowed ORF. By default, it allows ORFs that can encode at least one amino acid (e.g., `aa"M*"`).
"""
-function getorfdna(
- sequence::LongNucOrView{N};
+function get_orfs_dna(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
alternative_start::Bool = false,
min_len::Int64 = 6
-) where N
+) where {N}
orfs = findorfs(sequence; alternative_start, min_len)
- seqs = Vector{LongSubSeq{DNAAlphabet{4}}}()
- @inbounds for i in orfs
- if i.strand == '+'
- push!(seqs, @view sequence[i.location])
- else
- newseq = reverse_complement(@view sequence[i.location])
- push!(seqs, newseq)
- end
+ seqs = Vector{LongSubSeq{DNAAlphabet{N}}}(undef, length(orfs)) #Vector{}(undef, length(orfs)) # todo correct the output type
+ @inbounds for (i, orf) in enumerate(orfs)
+ seqs[i] = sequence[orf]
end
return seqs
end
-import BioSequences.standard_genetic_code
-
"""
- getorfaa(input::LongSequence{DNAAlphabet{4}}; kwargs...)
- getorfaa(input::String; kwargs...) ## for strings per se
+ get_orfs_aa(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N}
-This function takes a `LongSequence{DNAAlphabet{4}}` or `String` sequence and identifies the open reading frames (ORFs) using the `findorfs()` function. The function then translates each ORF into an amino acid sequence and stores it in a `Vector{LongSubSeq{AminoAcidAlphabet}}`.
+This function takes a `NucleicSeqOrView{DNAAlphabet{N}}` sequence and identifies the open reading frames (ORFs) using the `findorfs()` function. The function then translates each ORF into an amino acid sequence and stores it in a `Vector{LongSubSeq{AminoAcidAlphabet}}`.
# Arguments
-- `input`: The input sequence as a `LongSequence{DNAAlphabet{4}}` or `String`.
+- `sequence`: The input sequence as a `NucleicSeqOrView{DNAAlphabet{N}}`
# Keyword Arguments
@@ -133,21 +112,87 @@ This function takes a `LongSequence{DNAAlphabet{4}}` or `String` sequence and id
- `min_len::Int64=6`: The minimum length of the allowed ORF. By default, it allows ORFs that can encode at least one amino acid (e.g., `aa"M*"`).
"""
-function getorfaa(
- sequence::LongNucOrView{N};
+function get_orfs_aa(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
alternative_start::Bool = false,
- code::GeneticCode = standard_genetic_code,
+ code::GeneticCode = ncbi_trans_table[1],
min_len::Int64 = 6
-) where N
+) where {N}
orfs = findorfs(sequence; alternative_start, min_len)
- aas = Vector{LongSubSeq{AminoAcidAlphabet}}()
- @inbounds for i in orfs
- if i.strand == '+'
- push!(aas, translate(@view sequence[i.location]))
- else
- newaa = translate(reverse_complement(@view sequence[i.location]); code)
- push!(aas, newaa)
- end
+ aas = Vector{LongSubSeq{AminoAcidAlphabet}}(undef, length(orfs))
+ @inbounds for (i, orf) in enumerate(orfs)
+ aas[i] = translate(sequence[orf]; code)
end
return aas
+end
+
+#### record_orfs_* methods ####
+
+"""
+ record_orfs_fna(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N}
+
+Record Open Reading Frames (ORFs) in a nucleic acid sequence.
+
+# Arguments
+- `sequence::NucleicSeqOrView{DNAAlphabet{N}}`: The nucleic acid sequence to search for ORFs.
+- `alternative_start::Bool=false`: If `true`, consider alternative start codons for ORFs.
+- `min_len::Int=6`: The minimum length of an ORF to be recorded.
+
+# Returns
+An array of `FASTARecord` objects representing the identified ORFs.
+
+# Description
+This function searches for Open Reading Frames (ORFs) in a given nucleic acid sequence. An ORF is a sequence of DNA that starts with a start codon and ends with a stop codon, without any other stop codons in between. By default, only the standard start codon (ATG) is considered, but if `alternative_start` is set to `true`, alternative start codons are also considered. The minimum length of an ORF to be recorded can be specified using the `min_len` argument.
+"""
+function record_orfs_fna(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(sequence; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ records = FASTARecord[]
+ @inbounds for (index, orf) in enumerate(orfs)
+ id = string(lpad(string(index), padding, "0"))
+ header = "ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)"
+ record = FASTARecord(header, sequence[orf])
+ push!(records, record)
+ end
+ return records
+end
+
+"""
+ record_orfs_faa(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N}
+
+This function takes a nucleic acid sequence and finds all open reading frames (ORFs) in the sequence.
+An ORF is a sequence of DNA that starts with a start codon and ends with a stop codon.
+The function returns a list of FASTA records, where each record represents an ORF in the sequence.
+
+# Arguments
+- `sequence`: The nucleic acid sequence to search for ORFs.
+- `alternative_start`: A boolean indicating whether to consider alternative start codons. Default is `false`.
+- `code`: The genetic code to use for translation. Default is the NCBI translation table 1.
+- `min_len`: The minimum length of an ORF. ORFs shorter than this length will be ignored. Default is 6.
+
+# Returns
+- A list of FASTA records representing the ORFs found in the sequence.
+"""
+function record_orfs_faa(
+ sequence::NucleicSeqOrView{DNAAlphabet{N}};
+ alternative_start::Bool = false,
+ code::GeneticCode = ncbi_trans_table[1],
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(sequence; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ records = FASTARecord[]
+ @inbounds for (index, orf) in enumerate(orfs)
+ id = string(lpad(string(index), padding, "0"))
+ header = "ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)"
+ record = FASTARecord(header, translate(sequence[orf]; code))
+ push!(records, record)
+ end
+ return records
end
\ No newline at end of file
diff --git a/src/extended.jl b/src/extended.jl
index de38b2c..8fb8854 100644
--- a/src/extended.jl
+++ b/src/extended.jl
@@ -1,7 +1,8 @@
# Methods from main packages that expand their fuctions to this package structs
-import Base: show, length, iterate, sort
+import Base: length, iterate, sort, getindex
Base.sort(v::Vector{<:ORF}) = sort(v, by = _orf_sort_key)
+Base.getindex(sequence::NucleicSeqOrView{A}, orf::ORF) where {A} = orf.strand == '+' ? (@view sequence[orf.location]) : reverse_complement(@view sequence[orf.location])
function _orf_sort_key(orf::ORF)
return (orf.location, orf.strand)
diff --git a/src/io.jl b/src/io.jl
index 57b9bdf..1b94522 100644
--- a/src/io.jl
+++ b/src/io.jl
@@ -1,65 +1,166 @@
"""
- write_bed(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
- write_bed(input::String, output::String; kwargs...)
+ write_orfs_bed(input::NucleicAcidAlphabet{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}; kwargs...) where {N}
+ write_orfs_bed(input::NucleicAcidAlphabet{DNAAlphabet{N}}, output::String; kwargs...) where {N}
+ write_orfs_bed(input::String, output::Union{IOStream, IOBuffer}; kwargs...)
+ write_orfs_bed(input::String, output::String; kwargs...)
Write BED data to a file.
-# Keywords
-
-- `alternative_start::Bool=false`: If true will pass the extended start codons to search. This will increase 3x the exec. time.
-- `min_len::Int64=6`: Length of the allowed ORF. Default value allow `aa"M*"` a posible encoding protein from the resulting ORFs.
+# Arguments
+- `input::NucleicAcidAlphabet{DNAAlphabet{N}}`: The input DNA sequence.
+- `output::String`: The output file path.
+- `alternative_start::Bool=false`: If true, alternative start codons will be used when identifying CDSs. Default is `false`.
+- `min_len::Int64=6`: The minimum length that a CDS must have in order to be included in the output file. Default is `6`.
"""
-function write_bed(input::LongSequence{DNAAlphabet{4}}, output::String; alternative_start = false, min_len = 6)
+function write_orfs_bed(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
+ @inbounds for orf in orfs
+ println(output, "$(orf.location.start)\t$(orf.location.stop)\t$(orf.strand)\t$(orf.frame)")
+ end
+end
+
+function write_orfs_bed(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
open(output, "w") do f
- @simd for i in findorfs(input; alternative_start, min_len)
- write(f, "$(i.location.start)\t$(i.location.stop)\t$(i.strand)\t$(i.frame)\n")
+ @inbounds for orf in orfs
+ write(f, "$(orf.location.start)\t$(orf.location.stop)\t$(orf.strand)\t$(orf.frame)\n")
end
end
end
-function write_bed(input::String, output::String; alternative_start = false, min_len = 6)
- dnaseq = fasta_to_dna(input)[1]
+function write_orfs_bed(
+ input::String,
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1] # rewrite input to be a DNA sequence
+ orfs = findorfs(input; alternative_start, min_len)
+ @inbounds for orf in orfs
+ println(output, "$(orf.location.start)\t$(orf.location.stop)\t$(orf.strand)\t$(orf.frame)")
+ end
+end
+function write_orfs_bed(
+ input::String,
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
open(output, "w") do f
- @simd for i in findorfs(dnaseq; alternative_start, min_len)
- write(f, "$(i.location.start)\t$(i.location.stop)\t$(i.strand)\t$(i.frame)\n")
+ @inbounds for orf in orfs
+ write(f, "$(orf.location.start)\t$(orf.location.stop)\t$(orf.strand)\t$(orf.frame)\n")
end
end
end
"""
- write_cds(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
- write_cds(input::String, output::String; kwargs...)
+ write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}; kwargs...) where {N}
+ write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String; kwargs...) where {N}
+ write_orfs_fna(input::String, output::Union{IOStream, IOBuffer}; kwargs...)
+ write_orfs_fna(input::String, output::String; kwargs...)
Write a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.
# Keywords
-- `alternative_start`: A boolean value indicating whether alternative start codons should be used when identifying CDSs. Default is `false`.
-- `min_len`: An integer representing the minimum length that a CDS must have in order to be included in the output file. Default is `6`.
+- `alternative_start::Bool=false`: If true, alternative start codons will be used when identifying CDSs. Default is `false`.
+- `min_len::Int64=6`: The minimum length that a CDS must have in order to be included in the output file. Default is `6`.
+# Examples
+
+```julia
+filename = "output.fna"
+
+seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+
+open(filename, "w") do file
+ write_orfs_fna(seq, file)
+end
+```
"""
-function write_cds(input::LongSequence{DNAAlphabet{4}}, output::String; alternative_start = false, min_len = 6)
+function write_orfs_fna(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ @inbounds for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ println(output, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(input[orf])") # i.strand == '+' ? input[i.location] : reverse_complement(@view input[i.location])
+ end
+end
+
+function write_orfs_fna(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
- for i in findorfs(input; alternative_start, min_len)
- # sequence = i.strand == '+' ? input[i.location] : reverse_complement(@view input[i.location])
- write(f, ">location=$(i.location) strand=$(i.strand) frame=$(i.frame)\n$(i.strand == '+' ? dnaseq[i.location] : reverse_complement(@view dnaseq[i.location]))\n")
+ for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ write(f, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(input[orf])\n") # i.strand == '+' ? input[i.location] : reverse_complement(@view input[i.location])
end
end
end
-function write_cds(input::String, output::String; alternative_start = false, min_len = 6)
- dnaseq = fasta_to_dna(input)[1]
+function write_orfs_fna(
+ input::String,
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ @inbounds for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ println(output, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(input[orf])") # i.strand == '+' ? input[i.location] : reverse_complement(@view input[i.location])
+ end
+end
+function write_orfs_fna(
+ input::String,
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
- for i in findorfs(dnaseq; alternative_start, min_len)
- # sequence = i.strand == '+' ? dnaseq[i.location] : reverse_complement(@view dnaseq[i.location])
- write(f, ">location=$(i.location) strand=$(i.strand) frame=$(i.frame)\n$(i.strand == '+' ? dnaseq[i.location] : reverse_complement(@view dnaseq[i.location]))\n")
+ for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ write(f, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(input[orf])\n") # i.strand == '+' ? input[i.location] : reverse_complement(@view input[i.location])
end
end
end
"""
- write_proteins(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
+ write_orfs_faa(input::LongSequence{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}; kwargs...) where {N}
+ write_orfs_faa(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...) where {N}
+ write_orfs_faa(input::String, output::Union{IOStream, IOBuffer}; kwargs...)
+ write_orfs_faa(input::String, output::String; kwargs...)
Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.
@@ -68,75 +169,183 @@ Write the protein sequences encoded by the coding sequences (CDSs) of a given DN
- `code::GeneticCode=BioSequences.standard_genetic_code`: The genetic code by which codons will be translated. See `BioSequences.ncbi_trans_table` for more info.
- `alternative_start::Bool=false`: If true will pass the extended start codons to search. This will increase 3x the exec. time.
- `min_len::Int64=6`: Length of the allowed ORF. Default value allow `aa"M*"` a posible encoding protein from the resulting ORFs.
+# Examples
+
+```julia
+filename = "output.faa"
+
+seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+
+open(filename, "w") do file
+ write_orfs_faa(seq, file)
+end
+```
"""
-function write_proteins(
- input::LongSequence{DNAAlphabet{4}},
+function write_orfs_faa(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ code::GeneticCode = ncbi_trans_table[1],
+ min_len::Int64 = 6
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ @inbounds for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ println(output, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(translate(input[orf]; code))")
+ end
+end
+
+function write_orfs_faa(
+ input::NucleicSeqOrView{DNAAlphabet{N}},
output::String;
- alternative_start = false,
- code::GeneticCode = BioSequences.standard_genetic_code,
- min_len = 6,
-)
+ alternative_start::Bool = false,
+ code::GeneticCode = ncbi_trans_table[1],
+ min_len::Int64 = 6,
+) where {N}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
- for i in findorfs(input; alternative_start, code, min_len)
- translation = i.strand == '+' ? translate(dnaseq[i.location]) : translate(reverse_complement(@view dnaseq[i.location]))
- write(f, ">location=$(i.location) strand=$(i.strand) frame=$(i.frame)\n$(translation)\n")
+ for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ write(f, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(translate(input[orf]; code))\n")
end
end
end
-function write_proteins(
+function write_orfs_faa(
+ input::String,
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ code::GeneticCode = ncbi_trans_table[1],
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ @inbounds for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ println(output, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(translate(input[orf]; code))")
+ end
+end
+
+function write_orfs_faa(
input::String,
output::String;
- alternative_start = false,
- min_len = 6,
+ alternative_start::Bool = false,
+ code::GeneticCode = ncbi_trans_table[1],
+ min_len::Int64 = 6,
)
- dnaseq = fasta_to_dna(input)[1]
-
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
- for i in findorfs(dnaseq; alternative_start, min_len)
- translation = i.strand == '+' ? translate(dnaseq[i.location]) : translate(reverse_complement(@view dnaseq[i.location]))
- write(f, ">location=$(i.location) strand=$(i.strand) frame=$(i.frame)\n$(translation)\n")
+ for (i, orf) in enumerate(orfs)
+ id = string(lpad(string(i), padding, "0"))
+ write(f, ">ORF$(id) id=$(id) start=$(orf.location.start) stop=$(orf.location.stop) strand=$(orf.strand) frame=$(orf.frame)\n$(translate(input[orf]; code))\n")
end
end
end
"""
- write_gff(input::LongSequence{DNAAlphabet{4}}, output::String; kwargs...)
- write_gff(input::String, output::String; kwargs...)
+ write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}; kwargs...) where {N}
+ write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String; kwargs...) where {N}
+ write_orfs_gff(input::String, output::Union{IOStream, IOBuffer}; kwargs...)
+ write_orfs_gff(input::String, output::String; kwargs...)
Write GFF data to a file.
-# Keywords
+# Arguments
+- `input`: The input DNA sequence.
+- `output`: The output file to write the GFF data to.
+- `alternative_start::Bool=false`: If true, extended start codons will be considered during the search, increasing the execution time.
+- `min_len::Int64=6`: The minimum length of the allowed ORF. The default value allows for possible encoding proteins with the `aa"M*"` sequence.
-- `alternative_start::Bool=false`: If true will pass the extended start codons to search. This will increase 3x the exec. time.
-- `min_len::Int64=6`: Length of the allowed ORF. Default value allow `aa"M*"` a posible encoding protein from the resulting ORFs.
"""
-function write_gff(input::String, output::String; alternative_start = false, min_len = 6)
- dnaseq = fasta_to_dna(input)[1]
+function write_orfs_gff(
+ input::NucleicSeqOrView{A},
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {A}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ println(output, "##gff-version 3\n##sequence-region Chr 1 $(length(input))")
+ for (i, orf) in enumerate(orfs)
+ id = string("ORF", lpad(string(i), padding, "0"))
+ println(
+ output,
+ "Chr\t.\tORF\t$(orf.location.start)\t$(orf.location.stop)\t.\t$(orf.strand)\t.\tID=$(id);Name=$(id);Frame=$(orf.frame)",
+ )
+ end
+end
+function write_orfs_gff(
+ input::NucleicSeqOrView{A},
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+) where {A}
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
- write(f, "##gff-version 3\n##sequence-region Chr 1 $(length(dnaseq))\n")
- for (index, i) in enumerate(findorfs(dnaseq; alternative_start, min_len))
- id = string("ORF", lpad(string(index), 5, "0"))
+ write(f, "##gff-version 3\n##sequence-region Chr 1 $(length(input))\n")
+ for (i, orf) in enumerate(orfs)
+ id = string("ORF", lpad(string(i), padding, "0"))
write(
f,
- "Chr\t.\tORF\t$(i.location.start)\t$(i.location.stop)\t.\t$(i.strand)\t.\tID=$(id);Name=$(id);Frame=$(i.frame)\n",
+ "Chr\t.\tORF\t$(orf.location.start)\t$(orf.location.stop)\t.\t$(orf.strand)\t.\tID=$(id);Name=$(id);Frame=$(orf.frame)\n",
)
end
end
end
-function write_gff(input::LongSequence{DNAAlphabet{4}}, output::String; alternative_start = false, min_len = 6)
+function write_orfs_gff(
+ input::String,
+ output::Union{IOStream, IOBuffer};
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
+ println(output, "##gff-version 3\n##sequence-region Chr 1 $(length(input))")
+ for (i, orf) in enumerate(orfs)
+ id = string("ORF", lpad(string(i), padding, "0"))
+ println(
+ output,
+ "Chr\t.\tORF\t$(orf.location.start)\t$(orf.location.stop)\t.\t$(orf.strand)\t.\tID=$(id);Name=$(id);Frame=$(orf.frame)",
+ )
+ end
+end
+
+function write_orfs_gff(
+ input::String,
+ output::String;
+ alternative_start::Bool = false,
+ min_len::Int64 = 6
+)
+ input = fasta_to_dna(input)[1]
+ orfs = findorfs(input; alternative_start, min_len)
+ norfs = length(orfs)
+ padding = norfs < 10 ? length(string(norfs)) + 1 : length(string(norfs))
open(output, "w") do f
write(f, "##gff-version 3\n##sequence-region Chr 1 $(length(input))\n")
- for (index, i) in enumerate(findorfs(input; alternative_start, min_len))
- id = string("ORF", lpad(string(index), 5, "0"))
+ for (i, orf) in enumerate(orfs)
+ id = string("ORF", lpad(string(i), padding, "0"))
write(
f,
- "Chr\t.\tORF\t$(i.location.start)\t$(i.location.stop)\t.\t$(i.strand)\t.\tID=$(id);Name=$(id);Frame=$(i.frame)\n",
+ "Chr\t.\tORF\t$(orf.location.start)\t$(orf.location.stop)\t.\t$(orf.strand)\t.\tID=$(id);Name=$(id);Frame=$(orf.frame)\n",
)
end
end
end
-# Chr needs to be changed to fit well on IGV
\ No newline at end of file
+# Chr needs to be changed for fasta (?) or the name of seq (?) to fit well on IGV
\ No newline at end of file
diff --git a/src/types.jl b/src/types.jl
index 7e00ac9..503af3a 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -1,8 +1,5 @@
# Structs associated with gene models
-
-abstract type Gene end
-
-const LongNucOrView{N} = Union{LongSequence{<:NucleicAcidAlphabet{N}},LongSubSeq{<:NucleicAcidAlphabet{N}}}
+abstract type AbstractGene end
"""
struct GeneFeatures
@@ -19,21 +16,21 @@ This is the main Gene struct, based on the fields that could be found in a GFF3,
The idea is correct the frame and attributes that will have something like a possible list (id=Char;name=;locus_tag).
The `write` and `get` functions should have a dedicated method for this struct.
"""
-struct GeneFeatures <: Gene
- seqname::String
- start::Int64
- stop::Int64
- score::Float64
- strand::Char
- frame::Int8 # But maybe a Union to allow empty when reading a GFF?
- attribute::String # Should be a Dict perhaps
-end
+# struct GeneFeatures <: AbstractGene
+# seqname::String
+# start::Int64
+# stop::Int64
+# score::Float64
+# strand::Char
+# frame::Int8 # But maybe a Union to allow empty when reading a GFF?
+# attribute::Dict # Should be a Dict perhaps
+# end
"""
struct ORF
location::UnitRange{Int64}
strand::Char
- frame::Integer
+ frame::Int
end
The ORF struct represents an open reading frame in a DNA sequence. It has two fields:
@@ -41,58 +38,14 @@ The ORF struct represents an open reading frame in a DNA sequence. It has two fi
- `location`: which is a UnitRange{Int64} indicating the start and end locations of the ORF in the sequence
- `strand`: is a Char type indicating whether the ORF is on the forward ('+') or reverse ('-') strand of the sequence.
"""
-struct ORF <: Gene
- location::UnitRange{Int64} # Note that it is also called position for gene struct in GenomicAnotations
+struct ORF <: AbstractGene
+ location::UnitRange{Int64} # Note that it is also called position for gene struct in GenomicAnotations
strand::Char
- frame::Integer # 1, 2, or 3
-end
-
-##### The following implementation is from https://biojulia.dev/BioSequences.jl/stable/interfaces/ #####
-# """
-# Codon <: BioSequence{DNAAlphabet{2}}
-
-# A `Struct` representing a codon, which is a subtype of `BioSequence` with
-# an `Alphabet` of type `DNAAlphabet{2}`. It has a single field `x` of type
-# `UInt8`. This was implemente in The following implementation is from https://biojulia.dev/BioSequences.jl/stable/interfaces/
-# """
-# struct Codon <: BioSequence{DNAAlphabet{2}}
-# x::UInt8
-
-# function Codon(iterable)
-# length(iterable) == 3 || error("Must have length 3")
-# x = zero(UInt)
-# for (i, nt) in enumerate(iterable)
-# x |= BioSequences.encode(Alphabet(Codon), convert(DNA, nt)) << (6 - 2i)
-# end
-# new(x % UInt8)
-# end
-# end
-
-# Base.copy(seq::Codon) = Codon(seq.x)
-# Base.count(codon::Codon, sequence::LongSequence{DNAAlphabet{4}}) = count(codon, sequence)
-# Base.length(::Codon) = 3
+ frame::Int # Use Int64 instead of Int
-# function Base.count(codons::Vector{Codon}, sequence::LongSequence{DNAAlphabet{4}})
-# a = 0
-# @inbounds for i in codons
-# a += count(i, sequence)
-# end
-# return a
-# end
-
-# # has_interface(BioSequence, Codon, [DNA_C, DNA_T, DNA_G], false)
-
-# function extract_encoded_element(x::Codon, i::Int)
-# ((x.x >>> (6 - 2i)) & 3) % UInt
-# end
-
-# function translate(
-# ntseq::Codon;
-# code::GeneticCode = standard_genetic_code,
-# allow_ambiguous_codons = true,
-# alternative_start = false,
-# )
-# translate(ntseq; code, allow_ambiguous_codons, alternative_start)
-# end
-
-##### ---------------------------------------- #####
+ function ORF(location::UnitRange{Int64}, strand::Char, frame::Int)
+ @assert frame in (1, 2, 3) "Invalid frame value. Frame must be 1, 2, or 3."
+ @assert strand in ('+', '-') "Invalid strand value. Strand must be '+' or '-'."
+ new(location, strand, frame)
+ end
+end
\ No newline at end of file
diff --git a/src/utils.jl b/src/utils.jl
index b2caa66..71ba839 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -16,37 +16,6 @@ end
# end
# end
-"""
- nucleotidefreqs(sequence::LongSequence{DNAAlphabet{4}}) -> Dict{DNA, Float64}
-
-Calculate the frequency of each nucleotide in a DNA sequence.
-
-# Arguments
-- `sequence::LongSequence{DNAAlphabet{4}}`: A `LongSequence{DNAAlphabet{4}}` sequence.
-
-# Returns
-A dictionary with each nucleotide in the sequence as a key, and its frequency as a value.
-
-# Example
-```
-seq = dna"CCTCCCGGACCCTGGGCTCGGGAC"
-
-nucleotidefreqs(seq)
-
-Dict{DNA, Float64} with 4 entries:
-DNA_T => 0.125
-DNA_A => 0.0833333
-DNA_G => 0.333333
-DNA_C => 0.458333
-```
-"""
-function nucleotidefreqs(sequence::LongNucOrView{4})::Dict{DNA,Float64}
- counts = countmap(sequence)
- T = length(sequence)
- F = Dict(i => counts[i] / T for i in keys(counts))
- return F
-end
-
"""
hasprematurestop(sequence::LongNucOrView{4})::Bool
@@ -54,7 +23,7 @@ Determine whether the `sequence` of type `LongSequence{DNAAlphabet{4}}` contains
Returns a boolean indicating whether the `sequence` has more than one stop codon.
"""
-function hasprematurestop(sequence::LongNucOrView{N})::Bool where N
+function hasprematurestop(sequence::NucleicSeqOrView{A})::Bool where {A}
stopcodons = [LongDNA{4}("TAA"), LongDNA{4}("TAG"), LongDNA{4}("TGA")] # Create a set of stop codons
diff --git a/test/findorfstest.jl b/test/findorfstest.jl
index 57c7607..6ff0bcc 100644
--- a/test/findorfstest.jl
+++ b/test/findorfstest.jl
@@ -39,20 +39,21 @@
@test length(NC_001416_orfs) == 885
end
-@testset "getorfdna" begin
+@testset "get_orfs_dna" begin
seq01 = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
- orfseqs = getorfdna(seq01)
+ sseq01 = "ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+ orfseqs = get_orfs_dna(seq01)
@test length(orfseqs) == 5
@test orfseqs[1] == dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAG"
end
-# @testset "getorfaa" begin
+@testset "get_orfs_aa" begin
-# seq01 = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
-# aas = getorfaa(seq01)
+ seq01 = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+ aas = get_orfs_aa(seq01)
-# @test length(aas) == 5
-# @test aas[1] == aa"MMHACMLVTS*"
-# end
\ No newline at end of file
+ @test length(aas) == 5
+ @test aas[1] == aa"MMHACMLVTS*"
+end
\ No newline at end of file
diff --git a/test/runtests.jl b/test/runtests.jl
index ffb199c..e9211a3 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -7,6 +7,21 @@ using GeneFinder
include("findorfstest.jl")
+using Aqua:
+ test_ambiguities,
+ test_persistent_tasks,
+ test_piracies,
+ test_stale_deps,
+ test_unbound_args,
+ test_undefined_exports
+
+test_ambiguities(GeneFinder)
+test_persistent_tasks(GeneFinder)
+test_piracies(GeneFinder)
+test_stale_deps(GeneFinder)
+test_unbound_args(GeneFinder)
+test_undefined_exports(GeneFinder)
+
end
# @testset "GeneFinder.jl" begin
# # Write your tests here.