-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add tandem_dup_count to Junction #147
[FEATURE] Add tandem_dup_count to Junction #147
Conversation
Codecov Report
@@ Coverage Diff @@
## master #147 +/- ##
==========================================
+ Coverage 94.75% 94.93% +0.17%
==========================================
Files 18 18
Lines 706 731 +25
==========================================
+ Hits 669 694 +25
Misses 37 37
Continue to review full report at Codecov.
|
src/structures/junction.cpp
Outdated
@@ -27,11 +32,15 @@ bool operator<(Junction const & lhs, Junction const & rhs) | |||
: lhs.get_mate2() != rhs.get_mate2() | |||
? lhs.get_mate2() < rhs.get_mate2() | |||
: lhs.get_inserted_sequence() < rhs.get_inserted_sequence(); | |||
// TODO (23.7.21, irallia): tandem_dup_count doesn't play a role here right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshuak94 @joergi-w If the amount of duplications is bigger or smaller between junctions, does not play a role for size comparsion right? Just for equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be weird...
In general, the following must hold: If a < b and b < a are both false, then a = b.
This would be violated if you omit the duplication count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah, you mean a duplication is larger if it contains more duplications?
And thus, one junction is larger than another if this is the case.
Right, but we look before whether a junction is smaller from the position perspective (mate < mate). The question is, does it matter how high the tandem_dup_count is and if so where would you put that? Is that more relevant than the length of the inserted sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, I haven't yet fully understood what a junction is...
Can two junctions with equal positions and sequence lengths possibly have a different duplication count? If yes, I would include it with the lowest priority.
My point is, if any of the fields in class junction
differ, there must be a clear order (so you can only omit field comparisons that are indirectly determined, or that are omitted in the equality comparison as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now thought about it again.
A junction describes a change in the read in comparison to the ref (e.g. insertion, duplication, deletion). It consists of two endpoints (breakends), a possibly inserted sequence and the name of the read and now also the number of duplications.
If we now compare two junctions, we want to know if they
- share approximately the same positions (breakend 1 and 2), then
- if they contain a similar number of duplications and
- the inserted sequence is of similar length.
Important to note, for a novel element insertion the inserted sequence is the whole new sequence, for a duplication it is the duplicated sequence, this can be shorter if there are multiple duplications. Therefore, I now compare first on duplication and then on the sequence.
The content of the sequence is not compared yet, but that is still planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
6867c64
to
48a7d05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Two remarks:
include/structures/junction.hpp
Outdated
@@ -13,6 +13,7 @@ class Junction | |||
Breakend mate1{}; | |||
Breakend mate2{}; | |||
seqan3::dna5_vector inserted_sequence{}; | |||
int16_t tandem_dup_count{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use uint16_t
?
Intuitively a count is always positive.
test/api/clustering_test.cpp
Outdated
@@ -14,6 +14,7 @@ int32_t const chrom1_position2 = 94734377; | |||
int32_t const chrom1_position3 = 112323345; | |||
std::string const chrom2 = "chr2"; | |||
int32_t const chrom2_position1 = 234432; | |||
int16_t tandem_dup_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use const
?
However, it would be good to write at least one test with more than 0 tandem duplications...
sorry, it has become a bit more with the test, because functionalities for clustering were also missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments on data types again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Lydia Buntrock <[email protected]>
…nt() function Signed-off-by: Lydia Buntrock <[email protected]>
a125b1c
to
4df660e
Compare
Resolves first part of #143.
A characteristic of tandem duplications is the number of copies. I store this in the new variable
int16_t tandem_dup_count
. It is 0 if there is no duplication (e.g. insertion (novel element), deletion, inversion ...).