Rework single-end MAPQ computation #328

marcelm · 2023-08-21T15:17:02Z

This contains two changes: A bugfix and a change to the way MAPQ values are computed for single-end mapping.

First, the fix: There was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated in this way:

min_mapq_diff = std::max(0, alignment.score - best_score);

With best_score=-1000, this is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. This is then clamped to 60 in the output, which is probably the reason why we see MAPQ=60 most of the time.

Even with the fix, I didn’t understand what the code was doing. I think the intention is to use the second best score and relate it to the best score to compute the mapping quality, but it appears this isn’t implemented that way. So the second change is to compute MAPQ values differently:

We keep track of the highest and second-highest alignment score in best_score and second_best_score.
MAPQ is computed such that it is
- 60 for second_best_score=0 and
- 0 for second_best_score=best_score,
  where anything in between is interpolated:
```
uint8_t mapq = 60.0 * (best_score - second_best_score) / best_score;
```

To me, the two extreme cases (MAPQ=0 and MAPQ=60) make sense.

I’m not so sure about the linear interpolation, but it also seems to give the results one would want in some cases. For example, if best_score is 600 and second_best_score is 595, the formula gives 60 * (600-595)/600 = 0.5, which is rounded down and gives mapq=0, so if the scores are very close, the read is still considered a multimapper.

This is intended to address the single-end part of #25.

ksahlin · 2023-08-21T16:38:29Z

Welcome back from vacation :) Great that you found a bug and I like most of your solution! I think I like the linear projection too.

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly. But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

marcelm · 2023-08-23T21:23:30Z

For example, if a 150nt read has a perfect match to one place and one SNP to another, then there is a still preference reflected in the MAPQ score to the perfect location 60 * (300-290)/300 = 2 if I have understood correctly.

Yes. For longer reads, though, the situation would become as described in my example (with the score 595 vs 600), so you would get MAPQ zero again.

Hm, maybe dividing by the best score isn’t such a good idea after all? Consider the same situation for a 1000 nt read: If it has one location where it maps perfectly and one where it maps with one mismatch, then I would be quite confident that the location without the mismatch is the correct one. And my (intuitive) level of confidence would be quite similar even when the read is significantly shorter.

But also, this changes when there are more mismatches: If one location gives 9 and another 10 mismatches, I’d say the difference doesn’t matter that much and both are almost equally good. (Unless perhaps if the 9 mismatches were shared between the two, but taking that into account goes too far here.)

But what happens to reads with a perfect mapping as primary but with a secondary containing a softclipp?

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

ksahlin · 2023-08-23T21:54:22Z

Hm, maybe dividing by the best score isn’t such a good idea after all?

I am not sure what would be a good MAPQ score calculation. I just found this biostars post but haven't studied it closely. Looks a bit like magic and claims to come from this paper.

The secondary should have a much lower score, so the mapping quality would be high. That is the desired outcome, right?

Yes.

marcelm · 2023-11-13T20:22:33Z

Getting back to this as I’m working on #359, where I need to touch the same code: Can this PR be merged? My feeling is that the PR is maybe not the best thinkable solution, but an improvement over the status quo.

ksahlin · 2023-11-15T08:58:45Z

Can this PR be merged?

Yes. However, as discussed over email, it is probably good to add the case: if mapq == 0 and best_score > second_best_score then MAPQ=1 to avoid the linear interpolation assigning MAPQ 0 when there is an actual score difference.

First, there was a bug in MAPQ computation due to initializing best_score to -1000. In the first iteration, min_mapq_diff is updated thus: min_mapq_diff = std::max(0, alignment.score - best_score); This is just alignment.score + 1000, so most of the time, min_mapq_diff ends up being something like 1600 or so for typical read lengths. (This is clamped to 60 in the output, but still.) Here, we keep track of the highest and second-highest alignment score in best_score and second_best_score. Then, mapq is computed such that it is * 60 for second_best_score=0 and * 0 for second_best_score=best_score, where anything in between is interpolated. See #25

Keep mapping quality separate from the alignment itself appears to make more sense: - Mapping quality for an alignment is not a property of an alignment per se, but is derived from its relationship to other alignments. - get_alignment returns an Alignment, but leaves mapq uninitialized, that is, we must remember to fill it in afterwards, which can easily be forgotten. - Sam::add_pair is passed two mapq values and also two Alignment instances, which also contain mapq values. It is unclear for the caller which of the values is chosen. - Adding a mapq parameter to Sam::add makes it mirror Sam::add_pair.

marcelm · 2023-11-15T15:42:27Z

Is it ok If I use the following alternative (pushed in commit acac70b)? The idea is to round up instead of down. It has the same effect, that is, we get MAPQ=0 only if second_best == best:

>>> def f(best, second_best):
...  return (60 * (best - second_best) + best - 1) // best
... 
>>> f(100, 0)
60
>>> f(100, 50)
30
>>> f(100, 100)
0
>>> f(10000, 9999)
1

I checked the distribution of MAPQ values on the drosophila test dataset (the one on which the CI baseline comparison script runs).

MAPQ=0 occurs at the same frequency as before
MAPQ=60 occurs at the same frequency as before
For MAPQ=1..59, it looks like this:

Not sure what I would have expected, but it does look nicer ...

ksahlin · 2023-11-15T15:59:52Z

Great, I agree on your formula which is better than what I suggested. About the plot: seems like a sound distribution assuming there are repeats with 1,2,3,... mutations (and indels) to the best location.

approved to merge.

marcelm force-pushed the mapqse branch from 29cac0d to 715fced Compare August 23, 2023 13:26

Base automatically changed from mapqse to main August 23, 2023 19:25

marcelm force-pushed the mapq branch from 9e9e52e to 011d089 Compare August 23, 2023 19:37

marcelm force-pushed the mapq branch from 011d089 to f5e2f15 Compare November 13, 2023 20:25

marcelm added 2 commits November 15, 2023 15:17

marcelm force-pushed the mapq branch from f5e2f15 to acac70b Compare November 15, 2023 15:38

marcelm added 2 commits November 15, 2023 16:44

Changelog entry

c9c685b

Ensure we never report MAPQ=0 if best and 2nd best score differs

9120362

marcelm force-pushed the mapq branch from acac70b to 9120362 Compare November 15, 2023 15:44

marcelm merged commit 8c74210 into main Nov 15, 2023
9 checks passed

marcelm deleted the mapq branch November 15, 2023 16:56

marcelm mentioned this pull request Nov 17, 2023

Choose random mapping locations for multimappers #364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework single-end MAPQ computation #328

Rework single-end MAPQ computation #328

marcelm commented Aug 21, 2023

ksahlin commented Aug 21, 2023

marcelm commented Aug 23, 2023

ksahlin commented Aug 23, 2023

marcelm commented Nov 13, 2023

ksahlin commented Nov 15, 2023

marcelm commented Nov 15, 2023

ksahlin commented Nov 15, 2023 •

edited

Loading

Rework single-end MAPQ computation #328

Rework single-end MAPQ computation #328

Conversation

marcelm commented Aug 21, 2023

ksahlin commented Aug 21, 2023

marcelm commented Aug 23, 2023

ksahlin commented Aug 23, 2023

marcelm commented Nov 13, 2023

ksahlin commented Nov 15, 2023

marcelm commented Nov 15, 2023

ksahlin commented Nov 15, 2023 • edited Loading

ksahlin commented Nov 15, 2023 •

edited

Loading