-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define public API #1
Comments
Accept ? :
In discussion:
|
Can we let the user choose the type used to store compacted nucleotide? We can use u16, u32, u64 or u128 to store more nucleotide in one variable. If it's possible
I think we should provide in addition to these functions an iterator to facilitate the use of nuc2bit in some scenarios. The iterator would apply the function
|
Rust doesn't play well with generic numeric types, so we have to pick a single type, like I think it would be easy to have two features:
and bulk-processing (similar to my implementations):
These all encode/decode We can also have a check function to verify it a string is valid nucleotides, and another function to get the complement of a string of nucleotides. We can do this later. Also, let's not worry about implementing the triplet encoding for undetermined nucleotides that I talked about in my repo. |
Ok I agree. (I edit first comment)
Why did you want to have two features, for me iterator it's just sugar on top of
This plan is perfect for me.
Maybe add a revcomp function too.
Hey I see, but you didn't want add this work in |
Using an iterator means that the nucleotides cannot be packed into The undetermined nucleotide encoding technique is a bit annoying to implement in a cross-platform way, so we will worry about it in the future. |
I disagree with you we can pack into u64 chunks and unpack it on demand. def nuc2bit_iterator(dna):
u64_vec = encode(dna)
for u64_val in u64_vec:
for _ in range(0, 32):
yield u64_val & 0b11000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 >> 62
u64_val << 2 (we can probably do better stuff) Ok we lose the advantage of compression but we keep the speedup of encoding all sequences speedily and gain the flexibility for the user. My actual main usage of
Ok (I remove it from the actual public api) |
Ok, I think that encoding all the sequences using my SIMD methods and then unpacking it is slow. Most of the time spent is packing it into u64 chunks. If we don't have to pack it, then we save a lot of time. For the iterator implementation, it is better to just use a lookup table to convert byte -> byte, like what you already have. |
Ok perfect. We have, an interface to work at the chunk level and one to work at nucleotide level, it seems good for me. Plus function to validate DNA and function to perform complement it seems a good API for me.
|
Looks good, I'll start working on it when I have time. |
I've started to work on this, and I've realized that to have efficient complement, we have to encode |
Are you sure the xor with And in fact I use the Maybe we can allow the user to choose is encoding? It's probably very hard to implement this functionality |
Yeah, I realized right after I made that comment that XORing with |
Do you want to implement a function for checking whether a sequence of bytes represents a valid nucleotide sequence? It shouldn't be too hard to implement, by using |
I added a popcount implementation, which should make Hamming distance calculations very fast. This isn't directly part of our API, but I think it would be very useful. |
Yes ok no problem, I didn't see you made lot of work ! |
I add a first version of I add a benchmark to check the effect of length on computation time and we have a with different GC%. For this, I need to create I create some issues and a milestone to follow our progression. |
Alright, that is great! I like how internal benchmarks are done; I haven't thought of doing it that way. |
I think we should consider changing the API to take |
We need to define a public API.
@Daniel-Liu-c0deb0t I give to you power to write on this repo so you can in theory edit any comment, I propose to use the next comment like a resume of what we want in this API.
The text was updated successfully, but these errors were encountered: