Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI-Tools does not remove all duplicate reads #650

Closed
Redmar-van-den-Berg opened this issue Jun 26, 2024 · 8 comments
Closed

UMI-Tools does not remove all duplicate reads #650

Redmar-van-den-Berg opened this issue Jun 26, 2024 · 8 comments

Comments

@Redmar-van-den-Berg
Copy link

Redmar-van-den-Berg commented Jun 26, 2024

Describe the bug
UMI-Tools does not remove all duplicate reads. This can be verified by running UMI-Tools on it's own output, which removes additional reads.

To Reproduce
Take any bam file that has been deduplicated with UMI-Tools, and run UMI-Tools again on the same output. In my test, running UMI-Tools a third time removes no more reads.

umi_tools dedup \
                --method cluster \
                --paired \
                --stdin={input.bam} \
                --stdout={output.bam}

This issue occurs both with and without --paired mode.

Expected behavior
I would expect UMI-Tools to remove all duplicate reads in a single pass.

Environment
docker://quay.io/biocontainers/umi_tools:1.1.1--py38h0213d0e_1

Additional context
This issue appears not to be related to duplicate reads in the output when using --paired as described in #347, since all occurrences of reads are completely removed from the bam file after running UMI-Tools a second time.

It does not appear that this issue is related to any specific bam flags in the reads, in my tests there wasn't a single flag or combination of flags that was not effected by this bug. I've attached some flag counts I made when testing running UMI-Tools multiple times:

  • The files that end in -per-flag.csv count how often each bam flag occurs, across all reads. This means that a single read can contribute to multiple flags, e.g read paired and first in pair.
  • The files that end in -all-flags.csv count how often each combination of flags occurs. Since each read is counted only once by the total number of flags that are set, the total in this file is also the total number of reads in the bam file
  • The files that start with once- are generated on the output of UMI-Tools
  • The files that start with twice- are generated by running UMI-Tools on the output of UMI-Tools

once-twice.zip

@IanSudbery
Copy link
Member

IanSudbery commented Jun 26, 2024 via email

@Redmar-van-den-Berg
Copy link
Author

This behavior also happens when you use the --method directional setting. I understand how this behavior follows from the way UMI-Tools is made, but I do not understand why this was chosen. In fact, when I originally read the paper I misread the directional method to mean that UMIs will be merged if their counts are lower than 2n-1.

On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing? Or put another way, if 10x AAAA is considered distinct from 10x AAAT, why is 1 xAAAA considered identical to 1x AAAT?

This has real implications for how many duplicates UMI-Tools identifies: If I have sequenced with a high coverage, AAAA and AAAT will be seen as distinct reads, but if I have low coverage, AAAA and AAAT will be seen as identical.

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Jul 1, 2024

'I understand how this behavior follows from the way UMI-Tools is made, but I do not understand why this was chosen.'

When I originally tested the 'directional' method, I started with thresholds which were scale invariant (e.g 2n, 3n). However, I observed that there was still a higher than expected frequency of UMIs within a single edit distance and this was due to the presence of chains of UMIs with very low counts. As @IanSudbery says, the -1 in the threshold is explicitly designed to collapse these chains.

In all honesty, I don't think we ever expected to stop at 'directional' and assumed we or someone else would implement a more Bayesian method for collapsing UMIs which would prove more accurate. It would be relatively trivial to test alternative approaches with simulated or real data should anyone have time and motivation to do so.

'On a conceptual level, shouldn't removing duplicates from a collection without duplicates remove nothing?'

I think this is a misconception about what you're doing when you pass through UMI-tools multiple times. UMI-tools is designed to handle a BAM file with all the alignments presented to it. The algorithms have been shown to work well in that use case. If you remove duplicates and collapse down to a single read for each each set of duplicated reads and then pass it to UMI-tools again, you're now working from a different input entirely and one which has simplified the alignments significantly. In that case, why should the alogrithm yield the exact same output as before? It could (as in the case of most of the other methods available), but doing so shouldn't be expected and not doing so doesn't demonstrate that the original deduplication was sub-optimal.

@Redmar-van-den-Berg
Copy link
Author

Thank you for the explanation, it's good to know that UMI-Tools is not supposed to produce the same results when ran on it's own output.

I observed that there was still a higher than expected frequency of UMIs within a single edit distance and this was due to the presence of chains of UMIs with very low counts

Would you expect these chains to be caused by PCR duplicates, and hence be invalid? Is it not extremely unlikely that PCR amplification would produce such a chain of UMIs? In effect, that would mean that every cycle introduced a mismatch. Or is there another source of bias that would produce these chains?

@TomSmithCGAT
Copy link
Member

The exact source of these chains is unclear to me. PCR errors are probably low enough frequency to discount that being a common cause, while sequencing errors shouldn't create chains by themselves. In combination though, a PCR error plus sequencing error could occasionally create a chain of length 2. Combined with the random sampling of the sequencing, this may occur more than intuitively expected. We took the error distances from random sampling (with frequencies as observed in the data to account for bias in UMI usage) as the best available ground truth and optimised the deduplication algorithms by comparison of the edit distances in the deduplicated BAM. While the algorithms were designed with the possible UMI error sources in mind, they are agnostic as to the source of the error.

@Redmar-van-den-Berg
Copy link
Author

Right, if you take into account that UMI-Tools looks at UMIs for reads that are aligned at the same position, it is unlikely that these UMI's reflect true unique reads. Unless the coverage is very high, the odds of finding two or more UMIs that differ in only 1 base by chance should be low.

@IanSudbery
Copy link
Member

There is a two dimensional process going on here - the higher the sequencing depth, the higher the chance of finding two UMIs that differ in only 1 base through some error process. The higher the original number of molecules, the higher the chance of finding two UMIs that differ in only one base that are genuinely unique.

Our testing suggets that you should try to keep the amount of "UMI space" used by your original (unamplified) molecule set under ~1/3 of the total available space. I.e., with a 10mer UMI is good as long as you expect fewer than ~300k molecules in any one location.

The chances of PCR mutations are not as low as you might expect. If you run a PCR for 15 cycles, then each original molecule in your tube will have been amplified to ~32,000 molecules, each with two strands that will have been synthesised by the polyermase = 64,000 strands. If each UMI is 10nt long, thats 640,000 bases sythesised by the polymerase. If the polymerase has an error rate of 1e-6 (e.g. Q5 polymerase), then the chance of an error is 1-(1-1e-6)^650,000 ~ 47% chance of at least one error at some point in the process.

A chain of 2 wouldn't require an error in every round - the rounds could be quite distant from each other - it would just require that sampling meant that both the earlier and later mutations were only sequenced once. While this is unlikely for any given molecule, when you do 10/100s of millions of reads, the chance of it happening becomes much larger.

@IanSudbery
Copy link
Member

I'm going to close this issue. Let me know if there is anything to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants