Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created a super minimal mash function for sketching sequences. #344

Merged
merged 15 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ require (
require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/mattn/go-sqlite3 v1.14.13 // indirect
github.com/spaolacci/murmur3 v1.1.0 // indirect
)

require (
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZb
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/sergi/go-diff v1.2.0 h1:XU+rvMAioB0UC3q1MFrIQy4Vo5/4VsRDQQXHsEya6xQ=
github.com/sergi/go-diff v1.2.0/go.mod h1:STckp+ISIX8hZLjrqAeVduY0gWCT9IjLuqbuNXdaHfM=
github.com/spaolacci/murmur3 v1.1.0 h1:7c1g84S4BPRrfL5Xrdp6fOJ206sU9y293DDHaoy0bLI=
github.com/spaolacci/murmur3 v1.1.0/go.mod h1:JwIasOWyU6f++ZhiEuf87xNszmSA2myDM2Kzu9HwQUA=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.4.0 h1:2E4SXV/wtOkTonXsotYi4li6zVWxYlZuYNCXe9XRJyk=
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
Expand Down
88 changes: 88 additions & 0 deletions mash/mash.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
/*
Package mash is for sketching sequence data to make it easier to search for and against.

The package is named mash after the mash sketching algorithm, which is based on the MinHash algorithm.

Mash: fast genome and metagenome distance estimation using MinHash.
Ondov, B.D., Treangen, T.J., Melsted, P. et al.
Genome Biol 17, 132 (2016).
https://doi.org/10.1186/s13059-016-0997-x

Mash Screen: high-throughput sequence containment estimation for genome discovery.
Ondov, B., Starrett, G., Sappington, A. et al.
Genome Biol 20, 232 (2019).
https://doi.org/10.1186/s13059-019-1841-x

The idea is that we can sketch a sequence of data, and then compare the sketch to other sketches to see how similar they are.
This saves a bunch of computation time and memory, because we don't have to compare the entire sequence to another sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have some context on how this differs from algorithms like minimap2. Computation upfront?


TTFN,
Tim
*/
package mash

import "github.com/spaolacci/murmur3"
TimothyStiles marked this conversation as resolved.
Show resolved Hide resolved

// sketch algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have more context for this. Where is the sketch algorithm?

Comments should go up to 80char

For this long of multiline, especially if just context, should have a multiline comment style

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have multiline comments that start each line with //, and IMO we should standardize this across the repo to be one way or the other - not a huge deal though.

// slide a window of size k along the sequence
// for each kmer in the window, hash it to a 32 or 64 bit number
// keep the minimum hash value of all the kmers in the window up to a given sketch size
// the sketch is a vector of the minimum hash values

// what are mash's inputs and outputs?
// inputs: a

type Mash struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a description now what a mash is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Koeng101 does the top level package comment resolve this?

KmerSize uint
SketchSize uint
Sketches []uint32
}

func NewMash(kmerSize uint, sketchSize uint) *Mash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that this API could be simplified quite a lot and be made more direct and performant:

// WriteSketches writes `n` sketches into dstSketches by splitting sequence
// into chunks of `kmerSize`.
func WriteSketches(dstSketches []uint32, sequence string, kmerSize int) (n int, _ error) {
 
} 

This would also give us leeway by not exposing leaky abstractions in case in the future we realize we want to be able to configure how we write our Sketches with a Masher/Mash type which takes several configuration parameters which may not be apparent to us today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd avoid uint in size definitions to catch programmer errors early. Having sizes as uint can do more harm than good i.e:

sketchSize := uint(len(sketchPool) - totalProcessed)

In the code above the user calculates the sketch size from a pair of ints. If the user is mistaken and the result of the calculation is negative then uint converts the negative number into a massively huge number which would cause a out of memory panic.

Compared to negative size panics the OOM panic can affect other goroutines running in the same program.

return &Mash{
KmerSize: kmerSize,
SketchSize: sketchSize,
Sketches: make([]uint32, sketchSize),
}
}

func (m *Mash) Sketch(sequence string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the NewMash function? If so, a function description string would be nice

// slide a window of size k along the sequence
for kmerStart := 0; kmerStart < len(sequence)-int(m.KmerSize); kmerStart++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for standardization sequence should be capitalized or lowercased

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalized or lowercased

preference?

kmer := sequence[kmerStart : kmerStart+int(m.KmerSize)]
// hash the kmer to a 32 bit number
hash := murmur3.Sum32([]byte(kmer))
// keep the minimum hash value of all the kmers in the window up to a given sketch size
// the sketch is a vector of the minimum hash values
var biggestHashIndex int

// find the biggest hash value in the sketch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH THIS is how it works. Basically, you hash each kmer window, and then keep a list of hashes, small to large. The shared hashes are then compared. There should be a comment somewhere generally explaining the concept behind this in a bit more detail than just the sketches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's more or less how it works.

Keep it mind that sketches generated via different kmer window sizes are non-comparable. That's why I thought to store kmer size in the Mash struct @soypat because it could be relevant later. Open to better ways/ more idiomatic ways of doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Kmer size is important then changes I proposed are probably not the best way to go. Knowing this I'm thinking the following usage would be nice:

m1, err :=mash.NewMash(seq1, MashConfig{Sketches: 23, KmerSize:k1})
// handle your errors...
m2, err := mash.NewMash(seq2,  MashConfig{Sketches: 23, KmerSize:k2})

 // Don't let users modify kmerSize field since it "invalidates" calculated data, do expose it via a method.
if m1.KmerSize() != m2.KmerSize() {
    panic("oh no!")
}

if mash.Equal(m1, m2) {
   print("mashes equal!")
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soypat think I'm going to try writing it like this today. Mashes are measured by their Jaccard Similarity (total intersection of hashes / actual hashes so instead will likely make it mash.Simularity and have it return a floating point between 0-1.

One other thing I can probably handle by myself is keeping the hashes organized by lexical least to lexically greatest during creation. Any advice on the fastest/most idiomatic way of doing this in Go?

for i := 0; i < len(m.Sketches); i++ {
TimothyStiles marked this conversation as resolved.
Show resolved Hide resolved
if m.Sketches[i] == 0 {
biggestHashIndex = i
break
} else if m.Sketches[i] > m.Sketches[biggestHashIndex] {
biggestHashIndex = i
}
}
m.Sketches[biggestHashIndex] = hash
}
}

// distance algorithm
TimothyStiles marked this conversation as resolved.
Show resolved Hide resolved
// compare the sketches of two sequences
TimothyStiles marked this conversation as resolved.
Show resolved Hide resolved
// count the number of hashes that are the same
// divide the number of hashes that are the same by the total number of hashes
// the result is the distance between the two sequences
func (m *Mash) Distance(other *Mash) float64 {
Koeng101 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above change Distance could be rewritten as

func Distance(a, b []uint32) float64

var sameHashes int
for i := 0; i < len(m.Sketches); i++ {
for j := 0; j < len(other.Sketches); j++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested loop could be changed for two separate loops, one where the shorter sketch slice is stored into a map[uint32]struct{} type and the other where they are checked to see if other long uint32s slice contains entries which are in the map. Would speed up for large sketches.

if m.Sketches[i] == other.Sketches[j] {
sameHashes++
break
}
}
}
return 1 - (float64(sameHashes) / float64(len(m.Sketches)))
}
20 changes: 20 additions & 0 deletions mash/mash_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
package mash_test

import (
"testing"

"github.com/TimothyStiles/poly/mash"
)

func TestMash(t *testing.T) {
fingerprint1 := mash.NewMash(17, 10)
fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

fingerprint2 := mash.NewMash(17, 10)
fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

distance := fingerprint1.Distance(fingerprint2)
if distance != 0 {
t.Errorf("Expected distance to be 0, got %f", distance)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for affirming the algorithm actually works on more complicated sequences or in the null case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Any ideas for a good test case @Koeng101?

Loading