-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
created a super minimal mash function for sketching sequences. #344
Changes from 2 commits
53bd8df
b66bb6b
d85424d
ca6c697
f73adda
0698669
3367e08
a47f96c
ba3d892
bc48aa2
528129d
56abb66
40323db
f5982b6
396a4db
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
/* | ||
Package mash is for sketching sequence data to make it easier to search for and against. | ||
|
||
The package is named mash after the mash sketching algorithm, which is based on the MinHash algorithm. | ||
|
||
Mash: fast genome and metagenome distance estimation using MinHash. | ||
Ondov, B.D., Treangen, T.J., Melsted, P. et al. | ||
Genome Biol 17, 132 (2016). | ||
https://doi.org/10.1186/s13059-016-0997-x | ||
|
||
Mash Screen: high-throughput sequence containment estimation for genome discovery. | ||
Ondov, B., Starrett, G., Sappington, A. et al. | ||
Genome Biol 20, 232 (2019). | ||
https://doi.org/10.1186/s13059-019-1841-x | ||
|
||
The idea is that we can sketch a sequence of data, and then compare the sketch to other sketches to see how similar they are. | ||
This saves a bunch of computation time and memory, because we don't have to compare the entire sequence to another sequence. | ||
|
||
TTFN, | ||
Tim | ||
*/ | ||
package mash | ||
|
||
import "github.com/spaolacci/murmur3" | ||
TimothyStiles marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// sketch algorithm | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be nice to have more context for this. Where is the sketch algorithm? Comments should go up to 80char For this long of multiline, especially if just context, should have a multiline comment style There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's fine to have multiline comments that start each line with |
||
// slide a window of size k along the sequence | ||
// for each kmer in the window, hash it to a 32 or 64 bit number | ||
// keep the minimum hash value of all the kmers in the window up to a given sketch size | ||
// the sketch is a vector of the minimum hash values | ||
|
||
// what are mash's inputs and outputs? | ||
// inputs: a | ||
|
||
type Mash struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like a description now what a mash is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Koeng101 does the top level package comment resolve this? |
||
KmerSize uint | ||
SketchSize uint | ||
Sketches []uint32 | ||
} | ||
|
||
func NewMash(kmerSize uint, sketchSize uint) *Mash { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems to me that this API could be simplified quite a lot and be made more direct and performant: // WriteSketches writes `n` sketches into dstSketches by splitting sequence
// into chunks of `kmerSize`.
func WriteSketches(dstSketches []uint32, sequence string, kmerSize int) (n int, _ error) {
} This would also give us leeway by not exposing leaky abstractions in case in the future we realize we want to be able to configure how we write our Sketches with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also I'd avoid sketchSize := uint(len(sketchPool) - totalProcessed) In the code above the user calculates the sketch size from a pair of Compared to negative size panics the OOM panic can affect other goroutines running in the same program. |
||
return &Mash{ | ||
KmerSize: kmerSize, | ||
SketchSize: sketchSize, | ||
Sketches: make([]uint32, sketchSize), | ||
} | ||
} | ||
|
||
func (m *Mash) Sketch(sequence string) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the NewMash function? If so, a function description string would be nice |
||
// slide a window of size k along the sequence | ||
for kmerStart := 0; kmerStart < len(sequence)-int(m.KmerSize); kmerStart++ { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for standardization sequence should be capitalized or lowercased There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
preference? |
||
kmer := sequence[kmerStart : kmerStart+int(m.KmerSize)] | ||
// hash the kmer to a 32 bit number | ||
hash := murmur3.Sum32([]byte(kmer)) | ||
// keep the minimum hash value of all the kmers in the window up to a given sketch size | ||
// the sketch is a vector of the minimum hash values | ||
var biggestHashIndex int | ||
|
||
// find the biggest hash value in the sketch | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OH THIS is how it works. Basically, you hash each kmer window, and then keep a list of hashes, small to large. The shared hashes are then compared. There should be a comment somewhere generally explaining the concept behind this in a bit more detail than just the sketches. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that's more or less how it works. Keep it mind that sketches generated via different kmer window sizes are non-comparable. That's why I thought to store kmer size in the Mash struct @soypat because it could be relevant later. Open to better ways/ more idiomatic ways of doing this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Kmer size is important then changes I proposed are probably not the best way to go. Knowing this I'm thinking the following usage would be nice: m1, err :=mash.NewMash(seq1, MashConfig{Sketches: 23, KmerSize:k1})
// handle your errors...
m2, err := mash.NewMash(seq2, MashConfig{Sketches: 23, KmerSize:k2})
// Don't let users modify kmerSize field since it "invalidates" calculated data, do expose it via a method.
if m1.KmerSize() != m2.KmerSize() {
panic("oh no!")
}
if mash.Equal(m1, m2) {
print("mashes equal!")
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @soypat think I'm going to try writing it like this today. Mashes are measured by their Jaccard Similarity (total intersection of hashes / actual hashes so instead will likely make it One other thing I can probably handle by myself is keeping the hashes organized by lexical least to lexically greatest during creation. Any advice on the fastest/most idiomatic way of doing this in Go? |
||
for i := 0; i < len(m.Sketches); i++ { | ||
TimothyStiles marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if m.Sketches[i] == 0 { | ||
biggestHashIndex = i | ||
break | ||
} else if m.Sketches[i] > m.Sketches[biggestHashIndex] { | ||
biggestHashIndex = i | ||
} | ||
} | ||
m.Sketches[biggestHashIndex] = hash | ||
} | ||
} | ||
|
||
// distance algorithm | ||
TimothyStiles marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// compare the sketches of two sequences | ||
TimothyStiles marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// count the number of hashes that are the same | ||
// divide the number of hashes that are the same by the total number of hashes | ||
// the result is the distance between the two sequences | ||
func (m *Mash) Distance(other *Mash) float64 { | ||
Koeng101 marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the above change func Distance(a, b []uint32) float64 |
||
var sameHashes int | ||
for i := 0; i < len(m.Sketches); i++ { | ||
for j := 0; j < len(other.Sketches); j++ { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This nested loop could be changed for two separate loops, one where the shorter sketch slice is stored into a |
||
if m.Sketches[i] == other.Sketches[j] { | ||
sameHashes++ | ||
break | ||
} | ||
} | ||
} | ||
return 1 - (float64(sameHashes) / float64(len(m.Sketches))) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
package mash_test | ||
|
||
import ( | ||
"testing" | ||
|
||
"github.com/TimothyStiles/poly/mash" | ||
) | ||
|
||
func TestMash(t *testing.T) { | ||
fingerprint1 := mash.NewMash(17, 10) | ||
fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA") | ||
|
||
fingerprint2 := mash.NewMash(17, 10) | ||
fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA") | ||
|
||
distance := fingerprint1.Distance(fingerprint2) | ||
if distance != 0 { | ||
t.Errorf("Expected distance to be 0, got %f", distance) | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No tests for affirming the algorithm actually works on more complicated sequences or in the null case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Any ideas for a good test case @Koeng101? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have some context on how this differs from algorithms like minimap2. Computation upfront?