Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A strange behavior in approximate searches #302

Closed
panxiaoguang opened this issue Feb 28, 2024 · 4 comments
Closed

A strange behavior in approximate searches #302

panxiaoguang opened this issue Feb 28, 2024 · 4 comments

Comments

@panxiaoguang
Copy link

seq = dna"ACAGCGTAGCT"
query = ApproximateSearchQuery(dna"AGGG")
findfirst(query, 1, seq)
3:6
findnext(query, 3, seq, 1)
1:1
findnext(query, 3, seq, 2)
3:3
findnext(query, 3, seq, 4)
4:4

In the official example, if only 1bp of mismatch is allowed, the result is completely correct. However, when I increase the number of mismatches to 3bp, strange results appear. In fact if 3bp mismatch is allowed, we can manually find similar correct positions, e.g.

ACAGCGTAGCT
-AGGG------

In the above case, the expected answer should be 2:5, not 1:1.

@panxiaoguang
Copy link
Author

Another example in mismatches number is 2:

findnext(query, 2, seq, 6)
8:9
ACAGCGTAGCT
-------AGGG

the correct answer should be 8:11,not 8:9

@jakobnissen
Copy link
Member

This is because insertions/deletions are also accepted. So, I the first example, 1:1 is correct because it matches A---

@panxiaoguang
Copy link
Author

This is because insertions/deletions are also accepted. So, I the first example, 1:1 is correct because it matches A---

But I tried running through all the indexes using the findnext function, but I couldn't find the answer I expected 2:5, how can I find it?

for i in 1:length(seq)
    println(findnext(query,3,seq,i))
end

1:1
3:3
3:3
4:4
5:6
6:6
8:8
8:8
9:9
nothing
nothing

@panxiaoguang
Copy link
Author

panxiaoguang commented Feb 28, 2024

Ok, I wrote a small script that can achieve what I want. It's a really stupid algorithm that slides through long sequences and identifies the presence of inconsistent nucleotides one after the other.

function AproximateMatch(longSeq::LongSequence{DNAAlphabet{4}},shortSeq::LongSequence{DNAAlphabet{4}},max_mismatches::Int64) :: Vector{Int64}
    matches = Vector{Int64}()
    len_long = length(longSeq)
    len_short = length(shortSeq)
    
    for i = 1: (len_long - len_short + 1)
        mismatches = 0
        for j in eachindex(shortSeq)
            if !(iscompatible(longSeq[i + j - 1],shortSeq[j]))
                mismatches += 1
                if mismatches > max_mismatches
                    break 
                end
            end
        end
        if mismatches == max_mismatches
            push!(matches,i) 
        end
    end
    matches
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants