Skip to content

Commit

Permalink
Refactor io (#59)
Browse files Browse the repository at this point in the history
* removed dead code in io.go.

* merged Sequence structure into AnnotatedSequence and renamed AnnotatedSequence Sequence.

* added new mvp fasta io.

* rewrote FASTA IO to be sturdier.

* renamed feature.ParentAnnotatedSequence to feature.ParentSequence.

* removed whitespace constants and replaced with whitespace function.

* made .AddFeature() method public.

* added comment to .AddFeature() method.

* fixed comments in FASTA IO.
  • Loading branch information
TimothyStiles authored Oct 30, 2020
1 parent 7304bf6 commit a54050d
Show file tree
Hide file tree
Showing 12 changed files with 334 additions and 374 deletions.
8 changes: 4 additions & 4 deletions docs/library-hashing.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Hashes make incredibly powerful unique identifiers and with a wide array of hash
The golang team is currently figuring out the best way to implement blake3 into the standard library but in the meantime `poly` provides this special function and method wrapper to hash sequences using blake3. This will eventually be deprecated in favor of only using the `GenericSequenceHash()` function and `.Hash()` method wrapper.

```go
// getting our example AnnotatedSequence struct
// getting our example Sequence struct
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")

// there are two ways to use the blake3 Least Rotation hasher.
Expand All @@ -21,7 +21,7 @@ The golang team is currently figuring out the best way to implement blake3 into
puc19Blake3Hash := puc19AnnotatedSequence.Blake3Hash()
fmt.Println(puc19Blake3Hash)

// the second is with the Blake3SequenceHash(annotatedSequence AnnotatedSequence) function.
// the second is with the Blake3SequenceHash(sequence Sequence) function.
puc19Blake3Hash = puc19AnnotatedSequence.Blake3Hash()
fmt.Println(puc19Blake3Hash)
```
Expand All @@ -33,7 +33,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
`poly` also provides a generic hashing function and method wrapper for hashing sequences with arbitrary hashing functions that use the golang standard library's hash function interface. Check out this switch statement in the [hash command source code](https://github.com/TimothyStiles/poly/blob/f51ec1c08820394d7cab89a5a4af92d9b803f0a4/commands.go#L261) to see all that `poly` provides in the command line utility alone.

```go
// getting our example AnnotatedSequence struct
// getting our example Sequence struct
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")

// there are two ways to use the Least Rotation generic hasher.
Expand All @@ -42,7 +42,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
puc19Sha1Hash := puc19AnnotatedSequence.Hash(crypto.SHA1)
fmt.Println(puc19Sha1Hash)

// the second is with the GenericSequenceHash() function where you pass an AnnotatedSequence along with a hash function as arguments.
// the second is with the GenericSequenceHash() function where you pass an Sequence along with a hash function as arguments.
puc19Sha1Hash = GenericSequenceHash(puc19AnnotatedSequence, crypto.SHA1)
fmt.Println(puc19Sha1Hash)
```
Expand Down
52 changes: 26 additions & 26 deletions docs/library-io.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,37 @@ id: library-io
title: Sequence Input Output
---

At the center of `poly`'s annotated sequence support is the `AnnotatedSequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).
At the center of `poly`'s annotated sequence support is the `Sequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).

Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `AnnotatedSequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.
Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `Sequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.

## Readers

For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `AnnotatedSequence` struct.
For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `Sequence` struct.

```go
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
```

These AnnotatedSequence structs contain all sorts of goodies but can be broken down into three sub main structs. `AnnotatedSequence.Meta`, `AnnotatedSequence.Features`, and `AnnotatedSequence.Sequence`.
These Sequence structs contain all sorts of goodies but can be broken down into three sub main structs. `Sequence.Meta`, `Sequence.Features`, and `Sequence.Sequence`.

> Before we move on with the rest of IO I think it'd be good to go over these sub structs in the next section but of course you can skip to [writers](#writers) if you'd like.
## AnnotatedSequence structs
## Sequence structs

Like I just said these AnnotatedSequence structs contain all sorts of goodies but can be broken down into three main sub structs:
Like I just said these Sequence structs contain all sorts of goodies but can be broken down into three main sub structs:

* [AnnotatedSequence.Meta](#annotatedsequencemeta)
* [AnnotatedSequence.Features](#annotatedsequencefeatures)
* [AnnotatedSequence.Sequence](#annotatedsequencesequence)
* [Sequence.Meta](#annotatedsequencemeta)
* [Sequence.Features](#annotatedsequencefeatures)
* [Sequence.Sequence](#annotatedsequencesequence)

Here's how the AnnotatedSequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).
Here's how the Sequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).

```go
// AnnotatedSequence holds all sequence information in a single struct.
type AnnotatedSequence struct {
// Sequence holds all sequence information in a single struct.
type Sequence struct {
Meta Meta
Features []Feature
Sequence Sequence
Expand All @@ -42,11 +42,11 @@ Here's how the AnnotatedSequence struct is actually implemented as of [commit c4

> You can check out the original implementation [here](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108) but I warn you that this is a snapshot and likely has been updated since last writing.
### AnnotatedSequence.Meta
### Sequence.Meta

The Meta substruct contains various meta information about whatever record was parsed. Things like name, version, genbank references, etc.

So if I wanted to get something like the Genbank Accession number for a AnnotatedSequence I'd get it like this:
So if I wanted to get something like the Genbank Accession number for a Sequence I'd get it like this:

```go
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
Expand All @@ -68,7 +68,7 @@ Same goes for a lot of other stuff:
Here's how the Meta struct is actually implemented in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L34) which is the latest as of writing.

```go
// Meta Holds all the meta information of an AnnotatedSequence struct.
// Meta Holds all the meta information of an Sequence struct.
type Meta struct {
Name string
GffVersion string
Expand All @@ -93,9 +93,9 @@ Here's how the Meta struct is actually implemented in [commit c4fc7e](https://gi

You'll notice that there are actually three more substructs towards the bottom. They hold extra genbank specific information that's handy to have grouped together. More about how genbank files are structered can be found [here](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html).

### AnnotatedSequence.Features
### Sequence.Features

The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `AnnotatedSequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.
The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `Sequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.

```go
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
Expand All @@ -106,12 +106,12 @@ The `Features` substruct is actually a slice (golang term for what is essentiall

The `Feature` struct has about 10 or so fields which you can learn more about from this section in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L80).

### AnnotatedSequence.Sequence
### Sequence.Sequence

The AnnotatedSequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.
The Sequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.

```go
// Sequence holds raw sequence information in an AnnotatedSequence struct.
// Sequence holds raw sequence information in an Sequence struct.
type Sequence struct {
Description string
Hash string
Expand All @@ -122,7 +122,7 @@ The AnnotatedSequence Sequence substruct is by far the most basic and critical.

The `Description`, `Hash`, and `HashFunction` are at all identifying fields of the Sequence string. The `Description` is the same kind of short description you'd find in a `fasta` or `fastq` file. The `Hash` and `HashFunction` are used to create a unique identifier specify to the sequence string which you'll learn more about in the next chapter on sequence hashing.

To get an AnnotatedSequence sequence you can address it like so:
To get an Sequence sequence you can address it like so:

```go
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
Expand All @@ -133,10 +133,10 @@ To get an AnnotatedSequence sequence you can address it like so:

`poly` tries to supply a writer for all supported file formats that have a reader.

Writers take two arguments. The first is an AnnotatedSequence struct, the second is a path to write out to.
Writers take two arguments. The first is an Sequence struct, the second is a path to write out to.

```go
// getting AnnotatedSequence(s) to write out again.
// getting Sequence(s) to write out again.
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")

// writing out gbk file input as json.
Expand All @@ -154,7 +154,7 @@ To get an AnnotatedSequence sequence you can address it like so:

## Parsers

`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into AnnotatedSequence structs.
`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into Sequence structs.

```go
puc19AnnotatedSequence := ParseGbk("imagine this is actually a gbk in string format.")
Expand All @@ -164,10 +164,10 @@ That's it. The reason we don't have a `ParseJSON()` is that golang, like almost

## Builders

`poly` builders take AnnotatedSequence structs and use them to build strings for different file formats.
`poly` builders take Sequence structs and use them to build strings for different file formats.

```go
// generating an AnnotatedSequence struct from a gff file.
// generating an Sequence struct from a gff file.
ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")

// generating a gff string that then can be piped to stdout or written to a database.
Expand Down
10 changes: 5 additions & 5 deletions hash.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,12 @@ func RotateSequence(sequence string) string {
return sequence
}

// Hash is a method wrapper for hashing AnnotatedSequence structs.
func (annotatedSequence AnnotatedSequence) Hash(hash hash.Hash) string {
if annotatedSequence.Meta.Locus.Circular {
annotatedSequence.Sequence.Sequence = RotateSequence(annotatedSequence.Sequence.Sequence)
// Hash is a method wrapper for hashing Sequence structs.
func (sequence Sequence) Hash(hash hash.Hash) string {
if sequence.Meta.Locus.Circular {
sequence.Sequence = RotateSequence(sequence.Sequence)
}
seqHash, _ := hashSequence(annotatedSequence.Sequence.Sequence, hash)
seqHash, _ := hashSequence(sequence.Sequence, hash)
return seqHash
}

Expand Down
4 changes: 2 additions & 2 deletions hash_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ func TestHashRegression(t *testing.T) {
}

func TestLeastRotation(t *testing.T) {
annotatedSequence := ReadGbk("data/puc19.gbk")
sequence := ReadGbk("data/puc19.gbk")
var sequenceBuffer bytes.Buffer

sequenceBuffer.WriteString(annotatedSequence.Sequence.Sequence)
sequenceBuffer.WriteString(sequence.Sequence)
bufferLength := sequenceBuffer.Len()

var rotatedSequence string
Expand Down
Loading

0 comments on commit a54050d

Please sign in to comment.