-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE - Detect tandem duplications with cigar #148
base: master
Are you sure you want to change the base?
FEATURE - Detect tandem duplications with cigar #148
Conversation
Codecov Report
@@ Coverage Diff @@
## master #148 +/- ##
==========================================
- Coverage 98.40% 98.09% -0.32%
==========================================
Files 19 19
Lines 878 944 +66
==========================================
+ Hits 864 926 +62
- Misses 14 18 +4
Continue to review full report at Codecov.
|
a0408ac
to
7c8e74b
Compare
f8e8809
to
5cd7ed5
Compare
auto & res = *results.begin(); | ||
// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
// duplication. | ||
size_t matches = res.score() % 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
! modulo works wierd with negative values!
* ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG | ||
* |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG | ||
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
* | ||
* suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true}, | ||
* ||||| free_end_gaps_sequence1_trailing{false} | ||
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false}, | ||
* free_end_gaps_sequence2_trailing{true} | ||
* -> tandem_dup_count = 3, duplicated_bases = GCGGG | ||
* | ||
* Case 2: The duplication (insertion) comes before the matched sequence. | ||
* ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG | ||
* ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG | ||
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
* | ||
* prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false}, | ||
* ||||| free_end_gaps_sequence1_trailing{true} | ||
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true}, | ||
* free_end_gaps_sequence2_trailing{false} | ||
* -> tandem_dup_count = 3, duplicated_bases = GCGGG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other Idea:
create suffix tree of the inserted sequence and search for longest common repeated substring without overlap (with errors) and than map this repeated substring (without errors?).
Other input: Burrows Wheeler, occurence table, FM index; reg Expression -> build minimal automat; ZIP Hoffmann code
*/ | ||
std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config, | ||
int32_t const min_length, | ||
std::span<const seqan3::dna5> & sequence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::span<const seqan3::dna5> & sequence, | |
std::span<const seqan3::dna5> const sequence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments :)
// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
// duplication. | ||
size_t matches = res.score() % 100; | ||
size_t mismatches = (res.score() - matches) * (-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as:
mismatches = floor(res.score() / 100) * 100;
?
std::span<seqan3::dna5 const> & sequence, | ||
std::span<seqan3::dna5 const> & inserted_bases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
auto & res = *results.begin(); | ||
// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
// duplication. | ||
size_t matches = res.score() % 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can score be negative?
5cd7ed5
to
abbf672
Compare
abbf672
to
5301059
Compare
Signed-off-by: Lydia Buntrock <[email protected]>
Signed-off-by: Lydia Buntrock <[email protected]>
5301059
to
4a44217
Compare
Resolves #166
With this PR we can now detect tandem duplications in the CIGAR string. We only collect tandem duplications with no errors. In a follow up PR, we will allow errors aswell. Thus I wrote some TODOs in the code.