-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with the logic for allelic imbalance qc step #134
Comments
Here is a potential modification to fix this problem: |
Hi @rdey-insitro , thanks for your issue. We've run QC on the UKB WES data without running into the issue of not having any genotypes/variants left, and we also did some dedicated testing of this particular bcftools command. To help us diagnose what is going wrong for you, could you please provide us one of the following?
In case this wasn't clear, also keep in mind as we try to understand the issue that the bcftools command is outputting a TSV of variants that should be filtered out (which is done at a later step in Python). Looking forward to your reply. |
Hi, Let's use the fake vcf files in the example directory. I am using the same config file This is the set of "filtered out" variants due to allelic imbalance on chr 21:
And here is the vcf readouts from chr 21:
Notice that, all the variants in this vcf file that have at least one heterozygous sample have been filtered out, even though for example |
After some more digging, found that this might be due to a change in bcftools version. This may explain why previous extensive tests have not found this issue. I am using bcftools v1.17 for the above output which is consistent with the |
Thanks very much for looking more deeply into this. We'll leave this issue open while we look into a more permanent fix, but it sounds like using v1.10 might be a temporary workaround. |
I was trying to QC the 200K UKB WES data using the preprocessing with QC pipeline and I am finding the final genotypes.h5 file to have no genotypes at all (the file size was 8Mb). I did some digging and it seems all the available variants in the bcf file has been filtered out at the allelic imbalance step. The
qc/allelic_imbalance/myfile.vcf.gz
surprisingly includes all the available variant IDs. I am suspecting the following logic issue:Affected file:
pipelines/preprocessing/qc.snakefile
Affected code:
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\n' --exclude 'COUNT(GT="het")=0 || (GT="het" & ((TYPE="snp" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.15) | (TYPE="indel" & (FORMAT/AD[*:1] / FORMAT/AD[*:0]) > 0.20)))' {{input}} | gzip > {{output}}
I think this is not doing the allelic imbalance QC properly. The above step in my understanding excludes variant-sample pairs from the bcf query where the above criteria is met. But, almost all variants have at least one sample where the above criteria is not met. Then, that variant would sneak through to the variant list in the allelic_imbalance directory and end up being excluded in the final genotype.h5 file. So, instead of filtering out variants for which all the heterozygous samples have allele imbalance, this step is filtering out variants if any heterozygous sample has allele imbalance.
I manually ran the same bcftools code on the normalized bcf files, and the output contains all the variant IDs. If the
--exclude
flag is replaced by--include
flag, still the output would contain all the variant IDs.Please let me know if it is an issue with the logical criteria above, or there is something I am doing wrong.
The text was updated successfully, but these errors were encountered: