-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMI-Tools does not remove all duplicate reads #650
Comments
While I can't tell exactly without studying the BAM files in question
closely, I don't think necessarily indicates a problem. The most like
explanation is to do with the way UMI-tools decides if two UMIs are the
same or not.
The UMI-tools algorithm will merge two UMIs at the same mapping position if
the edit distance between them is 1 (by default), and the count of the more
common UMI is at least 2n-1, where n is the count of the less common UMI.
Consider the following situation:
UMI Count
AAAAA 10
AAAAT 10
When UMI tools is run, these two UMIs would not be considered the same, and
the counts are too similar (2*10-1=19, so the more common one would need to
be seen 19 times). As UMI-tools deduplicates in the output each one of
these UMIs would appear a single time:
UMI Count
AAAAA 1
AAAAT 1
If UMI-tools is run again, these UMIs will be merged, as 2*1-1=1 (in fact,
the -1 factor is explicitly to ensure that chains of similar UMIs seen only
once are merged).
Thus we would not expect UMIs tools to leave the input unchanged if the
output of a previous run is used.
|
This behavior also happens when you use the On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing? Or put another way, if 10x AAAA is considered distinct from 10x AAAT, why is 1 xAAAA considered identical to 1x AAAT? This has real implications for how many duplicates UMI-Tools identifies: If I have sequenced with a high coverage, AAAA and AAAT will be seen as distinct reads, but if I have low coverage, AAAA and AAAT will be seen as identical. |
'I understand how this behavior follows from the way UMI-Tools is made, but I do not understand why this was chosen.' When I originally tested the 'directional' method, I started with thresholds which were scale invariant (e.g 2n, 3n). However, I observed that there was still a higher than expected frequency of UMIs within a single edit distance and this was due to the presence of chains of UMIs with very low counts. As @IanSudbery says, the -1 in the threshold is explicitly designed to collapse these chains. In all honesty, I don't think we ever expected to stop at 'directional' and assumed we or someone else would implement a more Bayesian method for collapsing UMIs which would prove more accurate. It would be relatively trivial to test alternative approaches with simulated or real data should anyone have time and motivation to do so. 'On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing?' I think this is a misconception about what you're doing when you pass through UMI-tools multiple times. UMI-tools is designed to handle a BAM file with all the alignments presented to it. The algorithms have been shown to work well in that use case. If you remove duplicates and collapse down to a single read for each each set of duplicated reads and then pass it to UMI-tools again, you're now working from a different input entirely and one which has simplified the alignments significantly. In that case, why should the alogrithm yield the exact same output as before? It could (as in the case of most of the other methods available), but doing so shouldn't be expected and not doing so doesn't demonstrate that the original deduplication was sub-optimal. |
Thank you for the explanation, it's good to know that UMI-Tools is not supposed to produce the same results when ran on it's own output.
Would you expect these chains to be caused by PCR duplicates, and hence be invalid? Is it not extremely unlikely that PCR amplification would produce such a chain of UMIs? In effect, that would mean that every cycle introduced a mismatch. Or is there another source of bias that would produce these chains? |
The exact source of these chains is unclear to me. PCR errors are probably low enough frequency to discount that being a common cause, while sequencing errors shouldn't create chains by themselves. In combination though, a PCR error plus sequencing error could occasionally create a chain of length 2. Combined with the random sampling of the sequencing, this may occur more than intuitively expected. We took the error distances from random sampling (with frequencies as observed in the data to account for bias in UMI usage) as the best available ground truth and optimised the deduplication algorithms by comparison of the edit distances in the deduplicated BAM. While the algorithms were designed with the possible UMI error sources in mind, they are agnostic as to the source of the error. |
Right, if you take into account that UMI-Tools looks at UMIs for reads that are aligned at the same position, it is unlikely that these UMI's reflect true unique reads. Unless the coverage is very high, the odds of finding two or more UMIs that differ in only 1 base by chance should be low. |
There is a two dimensional process going on here - the higher the sequencing depth, the higher the chance of finding two UMIs that differ in only 1 base through some error process. The higher the original number of molecules, the higher the chance of finding two UMIs that differ in only one base that are genuinely unique. Our testing suggets that you should try to keep the amount of "UMI space" used by your original (unamplified) molecule set under ~1/3 of the total available space. I.e., with a 10mer UMI is good as long as you expect fewer than ~300k molecules in any one location. The chances of PCR mutations are not as low as you might expect. If you run a PCR for 15 cycles, then each original molecule in your tube will have been amplified to ~32,000 molecules, each with two strands that will have been synthesised by the polyermase = 64,000 strands. If each UMI is 10nt long, thats 640,000 bases sythesised by the polymerase. If the polymerase has an error rate of 1e-6 (e.g. Q5 polymerase), then the chance of an error is 1-(1-1e-6)^650,000 ~ 47% chance of at least one error at some point in the process. A chain of 2 wouldn't require an error in every round - the rounds could be quite distant from each other - it would just require that sampling meant that both the earlier and later mutations were only sequenced once. While this is unlikely for any given molecule, when you do 10/100s of millions of reads, the chance of it happening becomes much larger. |
I'm going to close this issue. Let me know if there is anything to add. |
Describe the bug
UMI-Tools does not remove all duplicate reads. This can be verified by running UMI-Tools on it's own output, which removes additional reads.
To Reproduce
Take any bam file that has been deduplicated with UMI-Tools, and run UMI-Tools again on the same output. In my test, running UMI-Tools a third time removes no more reads.
This issue occurs both with and without
--paired
mode.Expected behavior
I would expect UMI-Tools to remove all duplicate reads in a single pass.
Environment
docker://quay.io/biocontainers/umi_tools:1.1.1--py38h0213d0e_1
Additional context
This issue appears not to be related to duplicate reads in the output when using
--paired
as described in #347, since all occurrences of reads are completely removed from the bam file after running UMI-Tools a second time.It does not appear that this issue is related to any specific bam flags in the reads, in my tests there wasn't a single flag or combination of flags that was not effected by this bug. I've attached some flag counts I made when testing running UMI-Tools multiple times:
read paired
andfirst in pair
.once-twice.zip
The text was updated successfully, but these errors were encountered: