-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less variant reads than expected with specified VAF #184
Comments
This looks like targeted or amplicion sequencing data, is that correct? If so, do you know what protocol? |
It's an artificial dataset created with NEAT and based on regions of a targeted kit. |
Hi Adam, |
I tested this behavior some more, and it seems to be prevalent around neighboring SNVs, i.e. cases where the input .bed file has entries like Even when running with --haplosize 300 --force --insane, these two variants make it into the output vcf file, but final bam is missing one of them. Here are the output vcf entries and the pileups on the two positions: 1 11182158 . A C 100 PASS SOMATIC;VAF=0.20588235294117646;DPR=33.5 GT 0/1 1 11182158 N 34 a$aAAcAaAaaaAAAAaaaAAaAaAAAAaAAaAAA ABGGEFCFCDCDEDEDDC>CCCECCDDFCCGDE? Notice how the first locus has only one C-alt while it was targeted at VAF = 0.22 and reported at VAF 0.20 So I went on to look at the intermediate bams, namely the predecessor .muts.bam, and saw that the depth of coverage at those positions is almost double that of the original: 1 11182158 N 60 a$aAAcAaAAcacaaaAAAAAAAAaaaaaaAAAAaaAAaaCACAAACAaAACAaaAAAAAA ABGGEFCFFCCDDCCDDEEDDEEDDDCDC>C>CCCCCEECCCCDDDDFCCCCGGDDEE?? I only see this depth explosion around loci with neighboring SNVs, which leads me to suspect that the reads are gathered redundantly (later on I did verify this). So, perhaps, during the merging of the .muts.bam into the final bam this redundancy is not handled correctly which causes some (or all) mutated copies to fall out? |
to follow up, there also seems to be a non-deterministic component at play here. Running the same command twice produced different mutants, both at much lower AF than reported in the vcf. Here's the diff on the pileups in the same two positions I'm exploring above.
As you can see, the depth is the same on the right and the left but different reads are mutated. |
Hi there, sorry I haven't been able to get to this quickly. What kind of data are you using e.g. WGS or targeted sequencing? What are the command-line arguments? Generally the best way for me to troubleshoot these sort of issues is to have a small .bam file and mutation inputs sufficient to replicate the problem, would you be able to provide this? Finally, you can make mutations deterministically by setting a seed (--seed from memory but see -h output if that's not right) |
Hello Adam,
Thank you for looking into this. Is there a good place to upload a small
data set (bam + bed) or should I just attach it here?
…On Mon, Apr 17, 2023 at 1:11 PM adamewing ***@***.***> wrote:
Hi there, sorry I haven't been able to get to this quickly. What kind of
data are you using e.g. WGS or targeted sequencing? What are the
command-line arguments? Generally the best way for me to troubleshoot these
sort of issues is to have a small .bam file and mutation inputs sufficient
to replicate the problem, would you be able to provide this? Finally, you
can make mutations deterministically by setting a seed (--seed from memory
but see -h output if that's not right)
—
Reply to this email directly, view it on GitHub
<#184 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY4NOR4TQT3ZSQJ43TCMRNLXBWPXBANCNFSM5AJFGV6Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
This message may contain privileged and / or confidential information. If
you are NOT the intended recipient of this message, copying, printing,
disseminating, forwarding or any other use or action derived from its
content is strictly prohibited. Please notify the sender immediately by
e-mail if you have received this e-mail by error and delete this e-mail
from your system. If you received the email by error and this message
contains patient information, please report the error by contacting the
Personalis Clinical Laboratory at ***@***.***
***@***.***>.
|
I should have noted - please only send data if it's based on a public data set (i.e. already openly downloadable somewhere) and is not based on patient data. If that's all good then yes you could drop it here. I would say set a window a few times larger than your library size around the mutation i.e. |
I could reproduce the problem but had to include multiple target loci. I appears that the sort order plays into it somehow because if I minimize the set to only two neighboring loci, the output is correct. Attached is the data set along with the intermediate and the final output bam. Here's the command I used. The three flags at the end were my attempt to relax any depth ration restrictions, but they really didn't make a difference in this case.
Also, if you notice, I'm using as the reference our indexed human build 37. I didn't include it in the tarball as it's pretty bulky, but you can probably use any hs37 you can get your hands on. Please let me know if you need anything else. |
p.s. As to why I'm using --force, that's the only option I found that results in both neighboring loci 1:11182158 and 1:11182179 getting accepted (i.e. both end up in the vcf file). |
Hi Adam, `$ ~/bin/samtools view my_starting.bam 15:22490135-22490135 | grep A00780:906:HLW7WDSX5:1:2323:12084:18818 | cut -f1,2,3,4,5 A00780:906:HLW7WDSX5:1:2323:12084:18818 99 15 22489996 60 $ ~/bin/samtools view addsnv.9b14cb49-a7ba-449c-8f1e-5c4091f8a631.muts.bam 15:22490135-22490135 | grep A00780:906:HLW7WDSX5:1:2323:12084:18818 | cut -f1,2,3,4,5 A00780:906:HLW7WDSX5:1:2323:12084:18818 97 15 22489996 32 |
Hi again, the flags all come from bwa, I suspect it's something related to a mismatch between the tool/parameters used to align the reads initially vs what bamsurgeon did (current defaults are painfully out of date). Haven't forgotten about the previous query either, still haven't gotten to it so apologies for that. I'll look into the specific reasons it's failing but in general though, if you have to use |
It seems like the small number of reads that bamSurgeon had to re-map in
this case let to a skewed insert size estimate. That's what probably led
to the pairs being marked as abnormal. From 'samtools stats' on the
muts.bam file:
SN insert size average: 391.4
SN insert size standard deviation: 1098.2
Doing the same on the* input *bam yields
SN insert size average: 217.7
SN insert size standard deviation: 55.7
As a future feature, perhaps bamSurgeon could infer the insert size from
the *input* bam and then apply it when re-mapping using this bwa mem flag
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean
if absent), max
(4 sigma from the mean if absent) and min of the
insert size distribution.
FR orientation only. [inferred]
Again, sorry for diverting from the original topic as this seems to be
unrelated.
…On Mon, May 1, 2023 at 10:55 PM adamewing ***@***.***> wrote:
Hi again, the flags all come from bwa, I suspect it's something related to
a mismatch between the tool/parameters used to align the reads initially vs
what bamsurgeon did (current defaults are painfully out of date). Haven't
forgotten about the previous query either, still haven't gotten to it so
apologies for that. I'll look into the specific reasons it's failing but in
general though, if you have to use --force chances are the resulting
mutations won't have been done properly.
—
Reply to this email directly, view it on GitHub
<#184 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7DZKNU4C4XAOAHEO4NRE53XECOTPANCNFSM5AJFGV6Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Before I forget, I just wanted to comment that I found a bug in Bamsurgeon that might account for this behavior. The issue occurs in replace_reads.py when loading reads from the "donorbam" (the aligned merged BAM created from reads with mutations added). When populating the rdict (see here) the read IDs are used as keys. However, because mutations can be introduced in the same reads in different places, the same read ID can appear more than once in the donorbam. When a read ID is processed that was previously added to the rdict, it replaces the previous read in the rdict. This causes some fraction of reads with added mutations to be "lost", causing the relevant allele frequencies to drop below the set value. Without a more in-depth review of the code, I can't think of a quick fix for this issue. At first, I thought that you could add a function to "merge reads" whenever the same read ID shows up in the donorbam, but after merging the reads you'd need to do another round of realignment to fix the CIGAR strings which would be a pain. I think that the fix would need to be made back when the mutations are first introduced into the reads. This might affect other open issues like #174. |
Hi Adam,
I'm trying to insert an SNP in a BAM file (30.000X coverage), using the following input:
chr17 7674230 7674230 1 T
With REF=C and hg=hg38.
As you can see from the log it takes the correct input VAF and it creates the correct number of mutated reads:
However, it looks like only half of the reads are actually being used for the variant locus (is it adding a heterozygous variant?):
These are the input parameters:
The text was updated successfully, but these errors were encountered: