-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE - Detect tandem duplications with cigar #148
base: master
Are you sure you want to change the base?
Changes from all commits
7713d8b
71d915e
18f9a90
4a44217
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,96 @@ | |
|
||
#include "structures/junction.hpp" // for class Junction | ||
|
||
/*! \brief This function checks if the inserted bases are tandem duplicated. | ||
* | ||
* \param[in] config - configuration for a semi-gobal alignment | ||
* \param[in] min_length - minimum length of variants to detect (default 30 bp, | ||
* \param[in] sequence - suffix or prefix sequence | ||
* \param[in] inserted_bases - the inserted bases of the possible duplication | ||
* \param[in] is_suffix - true: suffix case, false: prefix case | ||
* param[out] match_score - if we found a matching duplication, this value represents the maximal | ||
* amount of matches with the reference and thus the length of the existing | ||
* duplicated part on the reference. If there is no duplication, this value | ||
* is 0. | ||
* param[out] length_of_single_dupl_sequence - greatest common divisor of length of inserted sequence and length of | ||
* maching part -> length of a single duplicated sequence | ||
* \returns std::tie(match_score, length_of_single_dupl_sequence) - a tuple of the resulting values | ||
* | ||
* \details In this function the inserted bases are recursively aligned segment by segment until it has been proven that | ||
* it is a real duplication. | ||
* For simplicity, we only consider the suffix case here, since the prefix case works the same way: | ||
* | ||
* suffix_sequence AAAACCGCGTAGCGGGGCGGG | ||
* |||||||||| | ||
* inserted_bases GCGGGGCGGGGCGGG -> unmatched_inserted_bases: GCGGG | ||
* -> match_score = 10 | ||
* -> length_of_single_dupl_sequence = gcd(15, 10) = 5 | ||
* | ||
* suffix_sequence AAAACCGCGTAGCGGGGCGGG | ||
* ||||| | ||
* unmatched_inserted_bases GCGGG | ||
* | ||
* -> match_score = 5, length_of_single_dupl_sequence = gcd(15, 5) = 5 | ||
*/ | ||
std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config, | ||
int32_t const min_length, | ||
std::span<const seqan3::dna5> & sequence, | ||
std::span<const seqan3::dna5> & inserted_bases, | ||
bool is_suffix); | ||
|
||
/*! \brief This function checks if the inserted bases are tandem duplicated. | ||
* | ||
* \param[in] query_sequence - SEQ field of the SAM/BAM file | ||
* \param[in] length - length of inserted part, given by the CIGAR | ||
* \param[in] pos_ref - position of the inserted part in the ref (current position) | ||
* \param[in] pos_read - position of the inserted part in the read (current position) | ||
* \param[in] inserted_bases - the inserted bases of the possible duplication | ||
* \param[in] min_length - minimum length of variants to detect (default 30 bp, | ||
* expected to be non-negative) | ||
* \param[in, out] pos_start_dup_seq - start position of the duplicated seq (excluding itself) | ||
* \param[in, out] pos_end_dup_seq - end position of the duplicated seq (including itself) | ||
* \param[in, out] tandem_dup_count - the number of tandem copies of the inserted sequence | ||
* \returns duplicated_bases - the duplicated bases of the duplication | ||
* | ||
* \details If the inserted bases include one or more copies of a duplicated sequence, which is suffix or prefix of | ||
* another copy, we have a Tandem Duplication. To check this, we have to do a semi-global alignment. | ||
* | ||
* Case 1: The duplication (insertion) comes after the matched sequence. Thus we need to check if the inserted | ||
* sequence matches (partly) the suffix sequence. The length of the matching part yields to the amount | ||
* of copies, thus we can calculate the tandem_dup_count and the inserted_bases. | ||
* | ||
* ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG | ||
* |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG | ||
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
* | ||
* suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true}, | ||
* ||||| free_end_gaps_sequence1_trailing{false} | ||
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false}, | ||
* free_end_gaps_sequence2_trailing{true} | ||
* -> tandem_dup_count = 3, duplicated_bases = GCGGG | ||
* | ||
* Case 2: The duplication (insertion) comes before the matched sequence. | ||
* ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG | ||
* ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG | ||
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
* | ||
* prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false}, | ||
* ||||| free_end_gaps_sequence1_trailing{true} | ||
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true}, | ||
* free_end_gaps_sequence2_trailing{false} | ||
* -> tandem_dup_count = 3, duplicated_bases = GCGGG | ||
Comment on lines
+63
to
+82
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other Idea: Other input: Burrows Wheeler, occurence table, FM index; reg Expression -> build minimal automat; ZIP Hoffmann code |
||
* \see For some complex examples see detection_test.cpp. | ||
*/ | ||
std::span<seqan3::dna5 const> detect_tandem_duplication(seqan3::dna5_vector const & query_sequence, | ||
int32_t length, | ||
int32_t pos_ref, | ||
int32_t pos_read, | ||
std::span<seqan3::dna5 const> & inserted_bases, | ||
int32_t const min_length, | ||
int32_t & pos_start_dup_seq, | ||
int32_t & pos_end_dup_seq, | ||
size_t & tandem_dup_count); | ||
|
||
/*! \brief This function steps through the CIGAR string and stores junctions with their position in reference and read. | ||
* | ||
* \param[in] read_name - QNAME field of the SAM/BAM file | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.