diff --git a/README.md b/README.md index 7d9d162..dc922ac 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ add GeneFinder If you are interested in the cutting edge of the development, please check out the master branch to try new features before release. -## Finding complete and internal (overlapped) ORFs +## Finding complete and overlapped ORFs The first implemented function is `findorfs` a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider `aa"M*"` a posible encoding protein from the resulting ORFs. @@ -67,7 +67,7 @@ findorfs(seq) ORF(695:706, '+', 2) ``` -Two other functions (`get_orfs_dna` and `get_orfs_aa`) pass the sequence to `findorfs` take the ORFs and act as generators of the sequence, so this way the can be `collect`ed in the REPL as an standard output or writteen into a file more conviniently using the `FASTX` IO system: +Two other functions (`get_orfs_dna` and `get_orfs_aa`) are implemented to get the ORFs in DNA and amino acid sequences, respectively. They use the `findorfs` function to get the ORFs to first get the ORFs and then get the correspondance array of `BioSequence` objects. ```julia get_orfs_dna(seq) @@ -87,78 +87,37 @@ get_orfs_dna(seq) ATGCAACCCTGA ``` -```julia -get_orfs_aa(seq) - -12-element Vector{LongSubSeq{AminoAcidAlphabet}}: - MQP* - MR* - MRRMAR* - MAR* - M* - MCPTAV* - MQP* - MHWLVLSI* - MSPHKAM* - M* - MCPTAA* - MQP* -``` +## Writting ORF information into bioinformatic formats -### Writting cds, proteins fastas, bed and gffs whether from a `LongSeq` or from a external fasta file. +This package facilitates now the creation of `FASTA`, `BED`, and `GFF` files, specifically extracting Open Reading Frame (ORF) information from `BioSequence` instances, particularly those of type `NucleicSeqOrView{A} where A`, and then writing the information into the desired format. -```julia -write_cds("cds.fasta", seq) -``` +Functionality: + +The package provides four distinct functions for writing files in different formats: + +| Function | Description | +|-------------------|--------------------------------------------------------| +| `write_orfs_fna` | Writes nucleotide sequences in FASTA format. | +| `write_orfs_faa` | Writes amino acid sequences in FASTA format. | +| `write_orfs_bed` | Outputs information in BED format. | +| `write_orfs_gff` | Generates files in GFF format. | -```bash -cat cds.fasta - ->location=29:40 strand=+ frame=2 -ATGCAACCCTGA ->location=137:145 strand=+ frame=2 -ATGCGCTGA ->location=164:184 strand=+ frame=2 -ATGCGTCGAATGGCACGGTGA ->location=173:184 strand=+ frame=2 -ATGGCACGGTGA ->location=236:241 strand=+ frame=2 -ATGTGA ->location=248:268 strand=+ frame=2 -ATGTGTCCAACGGCAGTCTGA ->location=362:373 strand=+ frame=2 -ATGCAACCCTGA ->location=470:496 strand=+ frame=2 -ATGCACTGGCTGGTCCTGTCAATCTGA ->location=551:574 strand=+ frame=2 -ATGTCACCGCACAAGGCAATGTGA ->location=569:574 strand=+ frame=2 -ATGTGA ->location=581:601 strand=+ frame=2 -ATGTGTCCAACGGCAGCCTGA ->location=695:706 strand=+ frame=2 -ATGCAACCCTGA -``` -### Combining `FASTX` for reading and writing fastas +All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a `BioSequence`, consider the following example: ```julia -using FASTX +using BioSequences, GeneFinder -write_orfs_faa("test/data/NC_001884.fasta", "proteins.fasta") +# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests) +seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC" ``` -```bash -head proteins.fasta - ->location=41:145 strand=- frame=2 -MTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH* ->location=41:172 strand=- frame=2 -MVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH* ->location=41:454 strand=- frame=2 -MSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH* ->location=41:472 strand=- frame=2 -MKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH* ->location=41:505 strand=- frame=2 -MLSKYEDDNSNMKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH* +Once a `BioSequence` object has been instantiated, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file. + +```julia +outfile = "LFLS01000089.fna" + +open(outfile, "w") do io + write_orfs_fna(seq, io) +end ``` \ No newline at end of file diff --git a/docs/src/iodocs.md b/docs/src/iodocs.md index e24eb30..f698e35 100644 --- a/docs/src/iodocs.md +++ b/docs/src/iodocs.md @@ -4,7 +4,7 @@ This package facilitates the creation of `FASTA`, `BED`, and `GFF` files, specif Functionality: -The package provides three distinct functions for writing files in different formats: +The package provides four distinct functions for writing files in different formats: | Function | Description | |-------------------|--------------------------------------------------------| @@ -14,7 +14,7 @@ The package provides three distinct functions for writing files in different for | `write_orfs_gff` | Generates files in GFF format. | -All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a BioSequence object, consider the following example: +All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a `BioSequence`, consider the following example: ```julia using BioSequences, GeneFinder