Refactor io (#59)

* removed dead code in io.go. * merged Sequence structure into AnnotatedSequence and renamed AnnotatedSequence Sequence. * added new mvp fasta io. * rewrote FASTA IO to be sturdier. * renamed feature.ParentAnnotatedSequence to feature.ParentSequence. * removed whitespace constants and replaced with whitespace function. * made .AddFeature() method public. * added comment to .AddFeature() method. * fixed comments in FASTA IO.
bebop · Oct 30, 2020 · a54050d · a54050d
1 parent 7304bf6
commit a54050d
Show file tree

Hide file tree

Showing 12 changed files with 334 additions and 374 deletions.
diff --git a/docs/library-hashing.md b/docs/library-hashing.md
@@ -12,7 +12,7 @@ Hashes make incredibly powerful unique identifiers and with a wide array of hash
 The golang team is currently figuring out the best way to implement blake3 into the standard library but in the meantime `poly` provides this special function and method wrapper to hash sequences using blake3. This will eventually be deprecated in favor of only using the `GenericSequenceHash()` function and `.Hash()` method wrapper.
 
 ```go
-  // getting our example AnnotatedSequence struct
+  // getting our example Sequence struct
   puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
 
   // there are two ways to use the blake3 Least Rotation hasher.
@@ -21,7 +21,7 @@ The golang team is currently figuring out the best way to implement blake3 into
   puc19Blake3Hash := puc19AnnotatedSequence.Blake3Hash()
   fmt.Println(puc19Blake3Hash)
 
-  // the second is with the Blake3SequenceHash(annotatedSequence AnnotatedSequence) function.
+  // the second is with the Blake3SequenceHash(sequence Sequence) function.
   puc19Blake3Hash = puc19AnnotatedSequence.Blake3Hash()
   fmt.Println(puc19Blake3Hash)
 ```
@@ -33,7 +33,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
 `poly` also provides a generic hashing function and method wrapper for hashing sequences with arbitrary hashing functions that use the golang standard library's hash function interface. Check out this switch statement in the [hash command source code](https://github.com/TimothyStiles/poly/blob/f51ec1c08820394d7cab89a5a4af92d9b803f0a4/commands.go#L261) to see all that `poly` provides in the command line utility alone.
 
 ```go
-  // getting our example AnnotatedSequence struct
+  // getting our example Sequence struct
   puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
 
   // there are two ways to use the Least Rotation generic hasher.
@@ -42,7 +42,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
   puc19Sha1Hash := puc19AnnotatedSequence.Hash(crypto.SHA1)
   fmt.Println(puc19Sha1Hash)
 
-  // the second is with the GenericSequenceHash() function where you pass an AnnotatedSequence along with a hash function as arguments.
+  // the second is with the GenericSequenceHash() function where you pass an Sequence along with a hash function as arguments.
   puc19Sha1Hash = GenericSequenceHash(puc19AnnotatedSequence, crypto.SHA1)
   fmt.Println(puc19Sha1Hash)
 ```

diff --git a/docs/library-io.md b/docs/library-io.md
@@ -3,37 +3,37 @@ id: library-io
 title: Sequence Input Output
 ---
 
-At the center of `poly`'s annotated sequence support is the `AnnotatedSequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).
+At the center of `poly`'s annotated sequence support is the `Sequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).
 
-Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `AnnotatedSequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.
+Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `Sequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.
 
 ## Readers
 
-For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `AnnotatedSequence` struct.
+For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `Sequence` struct.
 
 ```go
   bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
   ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")
   puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
 ```
 
-These AnnotatedSequence structs contain all sorts of goodies but can be broken down into three sub main structs. `AnnotatedSequence.Meta`, `AnnotatedSequence.Features`, and `AnnotatedSequence.Sequence`.
+These Sequence structs contain all sorts of goodies but can be broken down into three sub main structs. `Sequence.Meta`, `Sequence.Features`, and `Sequence.Sequence`.
 
 > Before we move on with the rest of IO I think it'd be good to go over these sub structs in the next section but of course you can skip to [writers](#writers) if you'd like.
 
-## AnnotatedSequence structs
+## Sequence structs
 
-Like I just said these AnnotatedSequence structs contain all sorts of goodies but can be broken down into three main sub structs:
+Like I just said these Sequence structs contain all sorts of goodies but can be broken down into three main sub structs:
 
-  * [AnnotatedSequence.Meta](#annotatedsequencemeta)
-  * [AnnotatedSequence.Features](#annotatedsequencefeatures)
-  * [AnnotatedSequence.Sequence](#annotatedsequencesequence)
+  * [Sequence.Meta](#annotatedsequencemeta)
+  * [Sequence.Features](#annotatedsequencefeatures)
+  * [Sequence.Sequence](#annotatedsequencesequence)
 
-Here's how the AnnotatedSequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).
+Here's how the Sequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).
 
 ```go
-  // AnnotatedSequence holds all sequence information in a single struct.
-  type AnnotatedSequence struct {
+  // Sequence holds all sequence information in a single struct.
+  type Sequence struct {
     Meta     Meta
     Features []Feature
     Sequence Sequence
@@ -42,11 +42,11 @@ Here's how the AnnotatedSequence struct is actually implemented as of [commit c4
 
 > You can check out the original implementation [here](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108) but I warn you that this is a snapshot and likely has been updated since last writing.
 
-### AnnotatedSequence.Meta
+### Sequence.Meta
 
 The Meta substruct contains various meta information about whatever record was parsed. Things like name, version, genbank references, etc.
 
-So if I wanted to get something like the Genbank Accession number for a AnnotatedSequence I'd get it like this:
+So if I wanted to get something like the Genbank Accession number for a Sequence I'd get it like this:
 
 ```go
   bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -68,7 +68,7 @@ Same goes for a lot of other stuff:
 Here's how the Meta struct is actually implemented in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L34) which is the latest as of writing.
 
 ```go
-  // Meta Holds all the meta information of an AnnotatedSequence struct.
+  // Meta Holds all the meta information of an Sequence struct.
   type Meta struct {
     Name            string
     GffVersion      string
@@ -93,9 +93,9 @@ Here's how the Meta struct is actually implemented in [commit c4fc7e](https://gi
 
 You'll notice that there are actually three more substructs towards the bottom. They hold extra genbank specific information that's handy to have grouped together. More about how genbank files are structered can be found [here](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html).
 
-### AnnotatedSequence.Features
+### Sequence.Features
 
-The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `AnnotatedSequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.
+The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `Sequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.
 
 ```go
   bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -106,12 +106,12 @@ The `Features` substruct is actually a slice (golang term for what is essentiall
 
 The `Feature` struct has about 10 or so fields which you can learn more about from this section in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L80).
 
-### AnnotatedSequence.Sequence
+### Sequence.Sequence
 
-The AnnotatedSequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.
+The Sequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.
 
 ```go
-  // Sequence holds raw sequence information in an AnnotatedSequence struct.
+  // Sequence holds raw sequence information in an Sequence struct.
   type Sequence struct {
     Description  string
     Hash         string
@@ -122,7 +122,7 @@ The AnnotatedSequence Sequence substruct is by far the most basic and critical.
 
 The `Description`, `Hash`, and `HashFunction` are at all identifying fields of the Sequence string. The `Description` is the same kind of short description you'd find in a `fasta` or `fastq` file. The `Hash` and `HashFunction` are used to create a unique identifier specify to the sequence string which you'll learn more about in the next chapter on sequence hashing.
 
-To get an AnnotatedSequence sequence you can address it like so:
+To get an Sequence sequence you can address it like so:
 
 ```go
   bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -133,10 +133,10 @@ To get an AnnotatedSequence sequence you can address it like so:
 
  `poly` tries to supply a writer for all supported file formats that have a reader.
 
- Writers take two arguments. The first is an AnnotatedSequence struct, the second is a path to write out to.
+ Writers take two arguments. The first is an Sequence struct, the second is a path to write out to.
 
  ```go
-  // getting AnnotatedSequence(s) to write out again.
+  // getting Sequence(s) to write out again.
   bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
 
   // writing out gbk file input as json.
@@ -154,7 +154,7 @@ To get an AnnotatedSequence sequence you can address it like so:
 
 ## Parsers
 
-`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into AnnotatedSequence structs.
+`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into Sequence structs.
 
 ```go
   puc19AnnotatedSequence := ParseGbk("imagine this is actually a gbk in string format.")
@@ -164,10 +164,10 @@ That's it. The reason we don't have a `ParseJSON()` is that golang, like almost
 
 ## Builders
 
-`poly` builders take AnnotatedSequence structs and use them to build strings for different file formats. 
+`poly` builders take Sequence structs and use them to build strings for different file formats. 
 
 ```go
-  // generating an AnnotatedSequence struct from a gff file.
+  // generating an Sequence struct from a gff file.
   ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")
 
   // generating a gff string that then can be piped to stdout or written to a database.

diff --git a/hash.go b/hash.go
@@ -95,12 +95,12 @@ func RotateSequence(sequence string) string {
 	return sequence
 }
 
-// Hash is a method wrapper for hashing AnnotatedSequence structs.
-func (annotatedSequence AnnotatedSequence) Hash(hash hash.Hash) string {
-	if annotatedSequence.Meta.Locus.Circular {
-		annotatedSequence.Sequence.Sequence = RotateSequence(annotatedSequence.Sequence.Sequence)
+// Hash is a method wrapper for hashing Sequence structs.
+func (sequence Sequence) Hash(hash hash.Hash) string {
+	if sequence.Meta.Locus.Circular {
+		sequence.Sequence = RotateSequence(sequence.Sequence)
 	}
-	seqHash, _ := hashSequence(annotatedSequence.Sequence.Sequence, hash)
+	seqHash, _ := hashSequence(sequence.Sequence, hash)
 	return seqHash
 }
 

diff --git a/hash_test.go b/hash_test.go
@@ -29,10 +29,10 @@ func TestHashRegression(t *testing.T) {
 }
 
 func TestLeastRotation(t *testing.T) {
-	annotatedSequence := ReadGbk("data/puc19.gbk")
+	sequence := ReadGbk("data/puc19.gbk")
 	var sequenceBuffer bytes.Buffer
 
-	sequenceBuffer.WriteString(annotatedSequence.Sequence.Sequence)
+	sequenceBuffer.WriteString(sequence.Sequence)
 	bufferLength := sequenceBuffer.Len()
 
 	var rotatedSequence string