-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#360 Bwt #411
#360 Bwt #411
Conversation
Very cool! Tell me when you need a code review Agree that the first version should be naive. Can improve from there! |
Hey @TwFlem thanks for submitting this draft! For naive I meant really, really naive. I'm not sure what everything here is doing yet but here's a simple 30 line BWT implementation that works: func burrowsWheelerTransform(input string) string {
// Add a sentinel character to the end of the input
input += "$"
// Create a matrix to store rotations
rotations := make([]string, len(input))
for i := 0; i < len(input); i++ {
rotations[i] = input[i:] + input[:i]
}
// Sort the rotations lexicographically
sort.Strings(rotations)
// Extract the last characters of each rotation to get the transformed string
var result string
for _, rotation := range rotations {
result += string(rotation[len(rotation)-1])
}
return result
} Obviously yours is more complex and very likely faster but I'm not sure what's going on yet. What's the best way to compare this^^ toy version with your own in this PR? What references did you use to create your draft? Any links? -Tim |
Sure! So I read through many materials, but the most salient of them all came from Ben Langmead. He has a whole youtube playlist about the subject. I wish it wasn't so complex! So the BWT by itself is actually a small piece of the puzzle- there are auxiliary components that go along with the BWT to make it possible to both compress the size from the original string while also making it usable as a search index to perform other kinds of analysis like sequence alignment. The equivalent to what you wrote above is here in the pr (ignoring the skiplist and the part about wavelet tree). The only difference with what I have here is that it saves some memory because we don't have to store NxN matrix of the string before we sort it. The getBWTIndex is what allows us to get that original character from the provided sequence that would show up in the L column for that given rotation. Everything else in this PR is what is necessary to enable search and alignment. To sum up what I have in the PR and why it is there at a high level:
Also, I mentioned the RLFM and r-index. The only reason why I didn't continue with doing these is because this has gotten complex enough... that and I figured it'd be good to make sure you want all of this in the first place. With that being said, I highly recommend we pursue them. In a very similar manner to how the skipList in this PR is constructed, we can compress the L column substantially- especially for sequences that are very repetitive. He has some very compelling numbers in this video towards the end. Also, once you see the compression, it makes sense how much of an improvement this could be. Btw, I will fix the names and do some refactoring before taking this out of draft :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's late and I've realized that the linter has probably caught the typos that I did but way faster 😅
I like where this PR is heading . I don't understand everything yet and I definitely need to check out Ben Langmead's videos.
It'd be cool to do a little talk and demo for this. It's a great example for teaching and maybe we should share it back with Ben? Just some late night thoughts.
bwt/bwt.go
Outdated
|
||
const nullChar = "0" | ||
|
||
// BWT Burrow Wheeler Transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this opening explanation. Makes it very clear why this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've probably put this somewhere but I think the naive implementation as pseudocode may be good to explain the very simple core of the algo and help demystify it.
The stuff from Langmead should also be cited somewhere in the code itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure! I'll do something similar with what I did with the wavelet tree to illustrate things like:
- the BWT itself
- how the first and last column have characters of the same rank in the same order that enables all the magic
- LF mapping
- Compressing "runny" sequences like the first column.
I'd be happy to cite him. Are there any examples of citation that exist in Poly? If not, is there a preferred way to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is a great explanation!
I think that we don't have a standard way of citing things, we generally just provide a link to the original source. Might be worth compiling a list of properly formatted citations in the README, especially if we are planning on publishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @carreter . For now, I Cited him at the end of the BWT's comment at the top.
bwt/bwt.go
Outdated
"golang.org/x/exp/slices" | ||
) | ||
|
||
const nullChar = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the null char similar to the "$" I've seen appended to the input string in other implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. It was "$" at first. I made it 0 for now because lexicographically, that's the character with the lowest value. This is just so if someone were to try and use the BWT with a string with spaces in it just to mess around and understand how it works, then it would work for them. Some follow up work that should be done:
- modify the sorting so if we encounter the
nullChar
we always make sure that is the least. Sort everything else lexographically. I can give this a shot in this PR if you want though. I didn't want to add more noise than necessary. - Make that nullChar configurable. Neither "$" or "0" should show up in the protein, dna, or rna alphabets but... you never know how users might want to use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be whatever we want now! Replied to your other comment.
bwt/bwt.go
Outdated
return i | ||
} | ||
} | ||
panic("Unable to find the corresponding orginal positiong for a character in the original sequence in the suffix array. This should not be possible and indicates a malformed BWT.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little meta but what are your thoughts on panics and when to use them? I almost always try to return an error and have it be handled upstream for fear of killing a user's long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. I'm usually very hesitant to put them in anywhere as well. However in these cases in the PR, the panics are analogous to having an index out of bounds error for an array. If something wrong is happening like this in the internals of any of these data structures, then something has gone horribly wrong and the program can't recover since the BWT is likely borked.
That or if whoever is using is upstream is using it incorrectly, they can't recover from that incorrectness. Like in the case of the panic on the out of bounds on Extract. Imagine having to handle errors from index out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this some more... Also, sounds like the Go team has some opinions on this: https://go.dev/blog/defer-panic-and-recover... Also read some opinions in the Chat :)
It does seem weird to bubble up errors like this because it is an internal problem, but Poly is also the first open source project I've contributed to. In my day job it's easier to force certain expectations and constraints everywhere since it is a controlled environment.
The more I think about it, the more is seems appropriate to give the users of Poly the choice of how they want to handle potential errors from bugs and or inappropriate API usage.
Maybe the example of the json package they mentioned in the blog post would also be appropriate here? We could panic internally and recover at the API boundary. That or we could just bubble errors all the way up for safety if for some reason we ever wanted to reuse the auxiliary components in here so we don't have any hidden panics that bubble up in the future if things get moved around. Maybe a combination of the three 🤔
Happy to change it up in any way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still happy to change it up. With that said, I think the panics in bitvector are appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it up so callers of BWT will not get any unexpected panics. Changes in f5b459b. LMK What you think.
bwt/bwt.go
Outdated
prefixArray[i] = sequence[len(sequence)-i-1:] | ||
} | ||
|
||
// TODO: at the time of writing, the nullChar is 0, this is to ensure correctness in most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be a good idea? Is this why I've been seeing "$" in other implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll just give the custom sorting function a shot on the string to ensure that the nullChar is always the least no matter what so we can put back the $.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, LMK what you think. nullChar could be configurable as a functional option now if users need to have $ as a part of their alphabet for the index they are throwing in to the BWT for whatever reason.
expected int | ||
} | ||
|
||
func TestBWT_Count(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this case making a substring histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure what Count could be used for in the real world tbh.
I'm thinking that it could be a quick way to check for the existence of an exact match? That or for statistics like you mentioned?
Might be a good question for the community- is this useful functionality to expose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Koeng101 is the Count of sub-sequences useful to Expose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is! There are instances where you are looking for multiple counts of sub-sequences (like restriction enzyme sites). Generally speaking, useful to have counts.
bwt/wavelet.go
Outdated
// ## RSA Intuition | ||
// | ||
// From here you may be able to build some intuition as to how we can take RSA queries given | ||
// a characters path encoding and which character we'd like to Rank, Select, and Access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RSA is spelled out here! I was beginning to think I wouldn't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. Yeah, I was thinking that was more appropriate to spell out in the RSABitvector and just telling readers to look there to learn more about RSA. Happy to make that redundant though so maintainers don't have to jump around to Grok all of these concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put Definitions of Acronyms towards the top of explanations like this. LMK if that works out.
I'd be happy to do that. Right now, there's not too much to demo or talk about maybe. We can talk about saving memory, indexing strings, the magic of the BWT and how rank allows us to take advantage of that magic. Maybe it'd be even more valuable to have a demo once we have the fuzzy matching that Koeng101 mentioned in #360 along with having the RLFM and r-index? At that point, the BWT seems like it would be feature complete enough to reach a critical mass of usefulness for Poly's users.
I'd definitely like to at least thank him! It's amazing that he is giving that knowledge to the community like that. Perhaps his channel deserves a spot in https://github.com/TimothyStiles/how-to-synbio. I'm happy to share anything else that he might appreciate as well! |
Typo Co-authored-by: Willow Carretero Chavez <[email protected]>
doc correction Co-authored-by: Willow Carretero Chavez <[email protected]>
typo Co-authored-by: Willow Carretero Chavez <[email protected]>
doc clarity Co-authored-by: Willow Carretero Chavez <[email protected]>
Link to additional explanation for Wavelet Trees. Co-authored-by: Willow Carretero Chavez <[email protected]>
doc improvement Co-authored-by: Willow Carretero Chavez <[email protected]>
Doc improvement. Co-authored-by: Willow Carretero Chavez <[email protected]>
Doc Improvement Co-authored-by: Willow Carretero Chavez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TwFlem this is looking really good!
Exposed test structs should be removed and example tests should be in their own separate example_test.go
file in a separate bwt_test package
.
It's not clear where a lot of the test cases are coming from and there should be more comments about where they're from, and what corner case their testing.
I'm with @carreter that it'd be good to wrap my head around the actual concepts but I'm also thinking we should find some outside folk more familiar with bwt to review this too if we can.
"testing" | ||
) | ||
|
||
type GetBitTestCase struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It Shouldn't be. Definitions in _test aren't public.
package poly
import (
"fmt"
"github.com/bebop/poly/bwt"
)
func polyRoot() {
thing := bwt.GetBitTestCase{}
fmt.Println(thing)
}
^ That won't build
bwt/bwt_test.go
Outdated
"golang.org/x/exp/slices" | ||
) | ||
|
||
func ExampleBWT_Count() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example tests should be in their own example_test.go file with the package name bwt_test
so that package functions are called like bwt.Count
to more closely emulate usage outside the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Output: 10 | ||
} | ||
|
||
type BWTCountTestCase struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be. Explained Above.
bwt/bwt_test.go
Outdated
} | ||
} | ||
|
||
func ExampleBWT_Locate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in example_test.go file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bwt/bwt_test.go
Outdated
func ExampleBWT_Extract() { | ||
inputSequence := "AACCTGCCGTCGGGGCTGCCCGTCGCGGGACGTCGAAACGTGGGGCGAAACGTG" | ||
|
||
bwt, err := New(inputSequence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to example_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Output: AACGTG | ||
} | ||
|
||
type BWTExtractTestCase struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be. Explained above.
bwt/bwt.go
Outdated
return searchRange.end - searchRange.start, nil | ||
} | ||
|
||
// Locate returns a list of offsets at which the begging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: begging should be beginning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
bwt/wavelet.go
Outdated
path bitvector | ||
} | ||
|
||
func NewWaveletTreeFromString(str string) waveletTree { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be public? Needs Doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not. ty. Made it private.
I added some comments about some edge cases that are being tested like with Jacobson's rank and the rsa_bitvector rank tests. I added a mention of how useful the the reconstruction test is for the WaveletTree. I also added a reconstruction test that pretty much guarantees that whatever we use for the suffix array in the future is well formed- otherwise we would never be able to reconstruct the string. I added some more tests and more validation. So now we throw errors for empty and other invalid inputs. I fixed an edge case where a user could supply a string with a single alphabet character. Otherwise, there aren't really too many edge cases to go around with the FM index BWT. It's all predicated on the fact that the LF search works. LF search is all predicated on the fact that the Wavelet Tree and other auxiliary DSAs work as expected. The other tests are just to meant the show that auxiliary pieces work as expected. The RSA and Bit vector test cases are verbose because of all the bit math involved. Those are there to give us peace of mind that it works as expected. There will be some more BWT edge cases with the other improvements that I'll follow up with like the RLFM index (done) and the r-index (in progress). Those I can definitely call out. Those changes will be pretty small btw. Nothing like getting the basic BWT off the ground 😭
Sounds good! It might be better to emphasize the correctness over performance improvements. Reasonable performance improvements are already on the way 👀 |
|
||
Lets use banana as an example. | ||
|
||
banana$ $banana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can there be an example test that takes banana$
and returns the last column as a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyStiles updated
bwt/bwt.go
Outdated
return bwt.firstColumnSkipList[i] | ||
} | ||
} | ||
panic("figure out what to do here") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should be done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ty, updated this along with one hiding in the bitvector 1cc3755
"github.com/bebop/poly/bwt" | ||
"golang.org/x/exp/slices" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have an Example_basic test here showing how to use the package's basic functionality end-to-end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated 1cc3755
Changes in this PR
This is for #360. @TimothyStiles mentioned implementing a naive version of the BWT. At first I took his suggestion. As I read up on the applications of BWT for genomic data, the more it seemed important that we have as small as a memory footprint as possible. I then began to optimize Some things up front. And well... I ultimately followed Tim's suggestion about keeping it naive so we have something to start with :)
Also, this does not include the fuzzy matching that @Koeng101. LMK if we want that in here, but that might be appropriate as its own thing since there's already a fair amount going on in this PR.
There are a good chunk of auxiliary pieces to the BWT. I will talk about what was used in this PR along with some suggested ways we move forward to improve things. I will also point out the especially naive parts that most definitely need to be addressed.
The BWT and other data structures added to enable alignment are documented in comments.
Things to Consider for the future:
Why are you making these changes?
Tim hinted at a need for a way to compress sequences while still being able to analyze them in #360
Are any changes breaking? (IMPORTANT)
This is all net new!
Pre-merge checklist
primers/primers_test.go
for what this might look like.CHANGELOG.md
in the[Unreleased]
section.