Update ORF finding and writing functions in README.md

camilogarciabotero · Jan 10, 2024 · 3a220cc · 3a220cc
1 parent 0c98b38
commit 3a220cc
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 69 deletions.
diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ add GeneFinder
 If you are interested in the cutting edge of the development, please check out
 the master branch to try new features before release.
 
-## Finding complete and internal (overlapped) ORFs
+## Finding complete and overlapped ORFs
 
 The first implemented function is `findorfs` a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider `aa"M*"` a posible encoding protein from the resulting ORFs.
 
@@ -67,7 +67,7 @@ findorfs(seq)
  ORF(695:706, '+', 2)
 ```
 
-Two other functions (`get_orfs_dna` and `get_orfs_aa`) pass the sequence to `findorfs` take the ORFs and act as generators of the sequence, so this way the can be `collect`ed in the REPL as an standard output or writteen into a file more conviniently using the `FASTX` IO system:
+Two other functions (`get_orfs_dna` and `get_orfs_aa`) are implemented to get the ORFs in DNA and amino acid sequences, respectively. They use the `findorfs` function to get the ORFs to first get the ORFs and then get the correspondance array of `BioSequence` objects.
 
 ```julia
 get_orfs_dna(seq)
@@ -87,78 +87,37 @@ get_orfs_dna(seq)
  ATGCAACCCTGA
 ```
 
-```julia
-get_orfs_aa(seq)
-
-12-element Vector{LongSubSeq{AminoAcidAlphabet}}:
- MQP*
- MR*
- MRRMAR*
- MAR*
- M*
- MCPTAV*
- MQP*
- MHWLVLSI*
- MSPHKAM*
- M*
- MCPTAA*
- MQP*
-```
+## Writting ORF information into bioinformatic formats
 
-### Writting cds, proteins fastas, bed and gffs whether from a `LongSeq` or from a external fasta file.
+This package facilitates now the creation of `FASTA`, `BED`, and `GFF` files, specifically extracting Open Reading Frame (ORF) information from `BioSequence` instances, particularly those of type `NucleicSeqOrView{A} where A`, and then writing the information into the desired format.
 
-```julia
-write_cds("cds.fasta", seq)
-```
+Functionality:
+
+The package provides four distinct functions for writing files in different formats:
+
+| Function          | Description                                            |
+|-------------------|--------------------------------------------------------|
+| `write_orfs_fna`    | Writes nucleotide sequences in FASTA format.     |
+| `write_orfs_faa`    | Writes amino acid sequences in FASTA format.  |
+| `write_orfs_bed`    | Outputs information in BED format.                           |
+| `write_orfs_gff`    | Generates files in GFF format.                              |
 
-```bash
-cat cds.fasta
-
->location=29:40 strand=+ frame=2
-ATGCAACCCTGA
->location=137:145 strand=+ frame=2
-ATGCGCTGA
->location=164:184 strand=+ frame=2
-ATGCGTCGAATGGCACGGTGA
->location=173:184 strand=+ frame=2
-ATGGCACGGTGA
->location=236:241 strand=+ frame=2
-ATGTGA
->location=248:268 strand=+ frame=2
-ATGTGTCCAACGGCAGTCTGA
->location=362:373 strand=+ frame=2
-ATGCAACCCTGA
->location=470:496 strand=+ frame=2
-ATGCACTGGCTGGTCCTGTCAATCTGA
->location=551:574 strand=+ frame=2
-ATGTCACCGCACAAGGCAATGTGA
->location=569:574 strand=+ frame=2
-ATGTGA
->location=581:601 strand=+ frame=2
-ATGTGTCCAACGGCAGCCTGA
->location=695:706 strand=+ frame=2
-ATGCAACCCTGA
-```
 
-### Combining `FASTX` for reading and writing fastas
+All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a `BioSequence`, consider the following example:
 
 ```julia
-using FASTX
+using BioSequences, GeneFinder
 
-write_orfs_faa("test/data/NC_001884.fasta", "proteins.fasta")
+# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
+seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
 ```
 
-```bash
-head proteins.fasta
-
->location=41:145 strand=- frame=2
-MTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:172 strand=- frame=2
-MVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:454 strand=- frame=2
-MSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:472 strand=- frame=2
-MKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
->location=41:505 strand=- frame=2
-MLSKYEDDNSNMKTKKQMSEHLSQKEKELKNKENFIFDKYESGIYSDELFLKRKAALDEEFKELQNAKNELNGLQDTQSEIDSNTVRNNINKIIDQYHIESSSEKKNELLRMVLKDVIVNMTQKRKGPIPAQFEITPILRFNFIFDLTATNSFH*
+Once a `BioSequence` object has been instantiated, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
+
+```julia
+outfile = "LFLS01000089.fna"
+
+open(outfile, "w") do io
+    write_orfs_fna(seq, io)
+end
 ```
diff --git a/docs/src/iodocs.md b/docs/src/iodocs.md
@@ -4,7 +4,7 @@ This package facilitates the creation of `FASTA`, `BED`, and `GFF` files, specif
 
 Functionality:
 
-The package provides three distinct functions for writing files in different formats:
+The package provides four distinct functions for writing files in different formats:
 
 | Function          | Description                                            |
 |-------------------|--------------------------------------------------------|
@@ -14,7 +14,7 @@ The package provides three distinct functions for writing files in different for
 | `write_orfs_gff`    | Generates files in GFF format.                              |
 
 
-All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a BioSequence object, consider the following example:
+All these functions support processing both `BioSequence` instances and external `FASTA` files. In the case of a `BioSequence` instace into external files, simply provide the path to the `FASTA` file using a `String` to the path. To demonstrate the use of the `write_*` methods with a `BioSequence`, consider the following example:
 
 ```julia
 using BioSequences, GeneFinder