-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework single-end MAPQ computation #328
Conversation
Welcome back from vacation :) Great that you found a bug and I like most of your solution! I think I like the linear projection too. For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location |
Yes. For longer reads, though, the situation would become as described in my example (with the score 595 vs 600), so you would get MAPQ zero again. Hm, maybe dividing by the best score isn’t such a good idea after all? Consider the same situation for a 1000 nt read: If it has one location where it maps perfectly and one where it maps with one mismatch, then I would be quite confident that the location without the mismatch is the correct one. And my (intuitive) level of confidence would be quite similar even when the read is significantly shorter. But also, this changes when there are more mismatches: If one location gives 9 and another 10 mismatches, I’d say the difference doesn’t matter that much and both are almost equally good. (Unless perhaps if the 9 mismatches were shared between the two, but taking that into account goes too far here.)
The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right? |
I am not sure what would be a good MAPQ score calculation. I just found this biostars post but haven't studied it closely. Looks a bit like magic and claims to come from this paper.
Yes. |
Getting back to this as I’m working on #359, where I need to touch the same code: Can this PR be merged? My feeling is that the PR is maybe not the best thinkable solution, but an improvement over the status quo. |
Yes. However, as discussed over email, it is probably good to add the case: |
First, there was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated thus: min_mapq_diff = std::max(0, alignment.score - best_score); This is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. (This is clamped to 60 in the output, but still.) Here, we keep track of the highest and second-highest alignment score in best_score and second_best_score. Then, mapq is computed such that it is * 60 for second_best_score=0 and * 0 for second_best_score=best_score, where anything in between is interpolated. See #25
Keep mapping quality separate from the alignment itself appears to make more sense: - Mapping quality for an alignment is not a property of an alignment per se, but is derived from its relationship to other alignments. - get_alignment returns an Alignment, but leaves mapq uninitialized, that is, we must remember to fill it in afterwards, which can easily be forgotten. - Sam::add_pair is passed two mapq values and also two Alignment instances, which also contain mapq values. It is unclear for the caller which of the values is chosen. - Adding a mapq parameter to Sam::add makes it mirror Sam::add_pair.
Is it ok If I use the following alternative (pushed in commit acac70b)? The idea is to round up instead of down. It has the same effect, that is, we get MAPQ=0 only if second_best == best: >>> def f(best, second_best):
... return (60 * (best - second_best) + best - 1) // best
...
>>> f(100, 0)
60
>>> f(100, 50)
30
>>> f(100, 100)
0
>>> f(10000, 9999)
1 I checked the distribution of MAPQ values on the drosophila test dataset (the one on which the CI baseline comparison script runs). |
Great, I agree on your formula which is better than what I suggested. About the plot: seems like a sound distribution assuming there are repeats with 1,2,3,... mutations (and indels) to the best location. approved to merge. |
This contains two changes: A bugfix and a change to the way MAPQ values are computed for single-end mapping.
First, the fix: There was a bug in MAPQ computation due to initializing
best_score
to -1000. In the first iteration,min_mapq_diff
is updated in this way:With
best_score=-1000
, this is justalignment.score + 1000
, so most of the time,min_mapq_diff
ends up being something like 1600 or so for typical read lengths. This is then clamped to 60 in the output, which is probably the reason why we see MAPQ=60 most of the time.Even with the fix, I didn’t understand what the code was doing. I think the intention is to use the second best score and relate it to the best score to compute the mapping quality, but it appears this isn’t implemented that way. So the second change is to compute MAPQ values differently:
best_score
andsecond_best_score
.second_best_score=0
andsecond_best_score=best_score
,where anything in between is interpolated:
To me, the two extreme cases (MAPQ=0 and MAPQ=60) make sense.
I’m not so sure about the linear interpolation, but it also seems to give the results one would want in some cases. For example, if
best_score
is 600 andsecond_best_score
is 595, the formula gives60 * (600-595)/600 = 0.5
, which is rounded down and gives mapq=0, so if the scores are very close, the read is still considered a multimapper.This is intended to address the single-end part of #25.