-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umi_tools dedup: Incorrect output with using --paired #347
Comments
First of all, thanks for taking the time to report this and making to effort to offer a fix - your effort is appreciated. Thanks for the catch with outputting read1 twice if it is at the same location as read2. My guess is that because this happens on For the problem with outputting read2 twice if two read1s point to it.... This is a little less clear cut. We have had a long dicussion in the past about when to do about mates in cases of supplementary reads etc. Some mappers will specifically output a read twice if it is the pair of two different mates. Our biggest problem here is that we are inconsistent - if you have two read1s on the same chr, then your read2 will only be output once. Unfortunately if your read1s are on two different chrs, then the read2 will be output twice. There is definitely a strong argument for your solution. Unfortunately it means storing the keys for the whole file in memory, which is going to be a problem once you start dealing with files that contain 200-300M reads. There was a bit of dicussion on matter related to this when the |
…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).
Thanks for your feedback. I got your points and now tried to sustain the consistency, regardless of the strategy of the aligner, by using more information when searching for read2 alignments, namely the mate information that link back to read1. Hence, in case all read2 alignments are put out as long as they are linking back to read1s that were reported. This also renders the additional set void and thus any memory issue can be avoided. |
Thanks for this. Sorry its taken me a long time to get around to this. I'm sorry to be a pain, but could you split this into two pull requests? The first problem definately needs your fix. I'm still troubled by the second problem though - Your new solution certainly doesn't have the memory issues, but it assumes that if read2 is the mate of read1, then read1 will be the mate of read2, but this is not always the case. As you pointed out in your first post, BWA will have multiple read2s pointing to the same read1. Under this new scheme, if we don't output read1(p), but do output read1(s), then read2, which will point to read1(p) as its mate if it is mapped using BWA, will not be output, when it probably should be. I'm still not sure I have a good alternative that works in all situations though. |
…t with dedup and --paired as well as avoiding large memory footprint (CGATOxford#347).
… in the output
…uplicated_lines Fix for incorrect output of dedup with --paired (#347) in case of duplicated lines
Has this issue been fixed? I think I've just come across the bug, leading to people using CLC not being able to open the bam files that I'm producing. |
Yes. This should be fixed in 1.0.1 |
version: 1.1.1 has the same issue still |
Hello, we are having the same issue still in version 1.1.1.
We noticed that it does solve the issue when running on seperate contigs. For example, when running only on chr19.
We are not sure if running on seperate contigs and then merging is the way to go. This might influence the possibility of detecting inter-chr translocations? |
Sorry, the bug that was fixed in 1.0.1 was that read1 could be output twice if it was mapped to the same location as read2. The problem you have here is outputting read2 twice. Is this the total output? It seems to me that the only way this could happen would be if there was another alignment for |
Hello, That was just a subset, but there are more of these cases where it was outputting read2 twice.
We tried 1.1.1 and also 1.0.1 and indeed, it didn't help for read2. |
I agree with @IanSudbery that this shouldn't occur. Despite the lack of activity on this issue, I'm loath to close it in case we're missing something here. If someone can upload a reproducible example of this behaviour with the latest version of UMI-tools, that would be very helpful! |
Hej hej,
I noticed an issue with
umi_tools dedup
using the option--paired
which in some cases reports the read alignments multiple times in the output. By that, I mean that one gets literally duplicate lines in the SAM/BAM format.I have traced it down that it is in conjunction with the output writer used with
--paired
, namely theTwoPassWriter
insam_methods.py
. It is supposed to output the second-in-pair read alignment that corresponds to reported first-in-pair read alignments. However, in case the first-in-pair and second-in-pair read alignments share the same reference name and reference start, it may report the first-in-pair read alignment twice. Also, it produces incorrect (or at least unexpected) output in case the same second-in-pair read alignment is referred to by multiple first-in-pair read alignments (whichbwa mem
does by default in case there are secondary read alignments).I compiled a small testset of read alignments illustrating both cases and I am currently working on a fix that I will make available as pull request. But for those who are interested and also later for checking the fix by the pull request, please find the alignment file for download (here). It contains the following read alignments (for those who do not want to download it).
To reproduce the issue, simply run the following command with the newest version of
umi_tools
(I used themaster
branch).This results in the following output (without SAM header).
One can see that is unexpected to have two identical second-in-pair read alignments each of the reads.
Just for completion, I am using Python 3.6.7 and umi_tools (e8c2b47) on Mac OSX.
Best regards,
Christian
The text was updated successfully, but these errors were encountered: