Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gnotate from ExAC downloaded from Gnomad #122

Open
Drdreammaerd opened this issue Feb 21, 2022 · 13 comments
Open

Make gnotate from ExAC downloaded from Gnomad #122

Drdreammaerd opened this issue Feb 21, 2022 · 13 comments

Comments

@Drdreammaerd
Copy link

Hi Brent,

I would like to make annotated file of ExAC downloaded from here
All chromosomes VCF 4.56 GiB, MD5: f2b57a6f0660a00e7550f62da2654948

The version of slivar is 0.2.7 71af7d12881ae0590c6d2a97ef2b282cc93fe7c6

I have fixed with few header problem and normalize with bcftools.

However, I still got an error message like this

> slivar version: 0.2.7 71af7d12881ae0590c6d2a97ef2b282cc93fe7c6
[slivar tsv] warning! didn't find ANN in header in ./ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz trying other fields
[slivar] unable to find gene field in ANN

The command I use is

slivar make-gnotate --prefix ExACv1.0 --field AF:ExAC_af_v1     $dir/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz

Thanks for your time,

David

@brentp
Copy link
Owner

brentp commented Feb 21, 2022

Can you show what the header of your VCF has for ANN and CSQ?

@Drdreammaerd
Copy link
Author

Header of vcf
#CHROM POS ID REF ALT QUAL FILTER INFO
Annotated info contains in INFO column.

I didn't fine any info stated ANN in the annotated information as well as in the meta-information.

@brentp
Copy link
Owner

brentp commented Feb 21, 2022

are there other lines above that one that start with '#'? if not, this is not valid VCF and slivar wont' be able to use it.
If so, can you show those lines?

@Drdreammaerd
Copy link
Author

Did you mean this?

##fileformat=VCFv4.2
##hailversion=0.2.61-3c86d3ba497a
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="">
##FORMAT=<ID=GT,Number=1,Type=String,Description="">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="">
##FORMAT=<ID=SB,Number=.,Type=Integer,Description="">
##INFO=<ID=AC,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_AFR,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_AMR,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_Adj,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_CONSANGUINEOUS,Number=.,Type=String,Description="">
##INFO=<ID=AC_EAS,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_FEMALE,Number=.,Type=String,Description="">
##INFO=<ID=AC_FIN,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_Hemi,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_Het,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_Hom,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_MALE,Number=.,Type=String,Description="">
##INFO=<ID=AC_NFE,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_OTH,Number=.,Type=Integer,Description="">
##INFO=<ID=AC_POPMAX,Number=.,Type=String,Description="">
##INFO=<ID=AC_SAS,Number=.,Type=Integer,Description="">
##INFO=<ID=AF,Number=.,Type=Float,Description="">
##INFO=<ID=AGE_HISTOGRAM_HET,Number=.,Type=String,Description="">
##INFO=<ID=AGE_HISTOGRAM_HOM,Number=.,Type=String,Description="">
##INFO=<ID=AN,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_AFR,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_AMR,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_Adj,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_CONSANGUINEOUS,Number=1,Type=String,Description="">
##INFO=<ID=AN_EAS,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_FEMALE,Number=1,Type=String,Description="">
##INFO=<ID=AN_FIN,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_MALE,Number=1,Type=String,Description="">
##INFO=<ID=AN_NFE,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_OTH,Number=1,Type=Integer,Description="">
##INFO=<ID=AN_POPMAX,Number=.,Type=String,Description="">
##INFO=<ID=AN_SAS,Number=1,Type=Integer,Description="">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="">
##INFO=<ID=CCC,Number=1,Type=Integer,Description="">
##INFO=<ID=CSQ,Number=.,Type=String,Description="">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="">
##INFO=<ID=DB,Number=0,Type=Flag,Description="">
##INFO=<ID=DOUBLETON_DIST,Number=.,Type=String,Description="">
##INFO=<ID=DP,Number=1,Type=Integer,Description="">
##INFO=<ID=DP_HIST,Number=.,Type=String,Description="">
##INFO=<ID=DS,Number=0,Type=Flag,Description="">
##INFO=<ID=END,Number=1,Type=Integer,Description="">
##INFO=<ID=ESP_AC,Number=.,Type=String,Description="">
##INFO=<ID=ESP_AF_GLOBAL,Number=.,Type=String,Description="">
##INFO=<ID=ESP_AF_POPMAX,Number=.,Type=String,Description="">
##INFO=<ID=FS,Number=1,Type=Float,Description="">
##INFO=<ID=GQ_HIST,Number=.,Type=String,Description="">
##INFO=<ID=GQ_MEAN,Number=1,Type=Float,Description="">
##INFO=<ID=GQ_STDDEV,Number=1,Type=Float,Description="">
##INFO=<ID=HOM_CONSANGUINEOUS,Number=.,Type=String,Description="">
##INFO=<ID=HWP,Number=1,Type=Float,Description="">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="">
##INFO=<ID=Hemi_AFR,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_AMR,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_EAS,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_FIN,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_NFE,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_OTH,Number=.,Type=Integer,Description="">
##INFO=<ID=Hemi_SAS,Number=.,Type=Integer,Description="">
##INFO=<ID=Het_AFR,Number=.,Type=Integer,Description="">
##INFO=<ID=Het_AMR,Number=.,Type=Integer,Description="">
##INFO=<ID=Het_EAS,Number=.,Type=Integer,Description="">
##INFO=<ID=Het_FIN,Number=.,Type=Integer,Description="">
##INFO=<ID=Het_NFE,Number=.,Type=Integer,Description="">
##INFO=<ID=Hom_OTH,Number=.,Type=Integer,Description="">
##INFO=<ID=Hom_SAS,Number=.,Type=Integer,Description="">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="">
##INFO=<ID=K1_RUN,Number=.,Type=String,Description="">
##INFO=<ID=K2_RUN,Number=.,Type=String,Description="">
##INFO=<ID=K3_RUN,Number=.,Type=String,Description="">
##INFO=<ID=KG_AC,Number=.,Type=String,Description="">
##INFO=<ID=KG_AF_GLOBAL,Number=.,Type=String,Description="">
##INFO=<ID=KG_AF_POPMAX,Number=.,Type=String,Description="">
##INFO=<ID=MLEAC,Number=.,Type=Integer,Description="">
##INFO=<ID=MLEAF,Number=.,Type=Float,Description="">
##INFO=<ID=MQ,Number=1,Type=Float,Description="">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="">
##INFO=<ID=NCC,Number=1,Type=Integer,Description="">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="">
##INFO=<ID=POPMAX,Number=.,Type=String,Description="">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="">
##INFO=<ID=QD,Number=1,Type=Float,Description="">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="">
##INFO=<ID=clinvar_conflicted,Number=.,Type=String,Description="">
##INFO=<ID=clinvar_measureset_id,Number=.,Type=String,Description="">
##INFO=<ID=clinvar_mut,Number=.,Type=String,Description="">
##INFO=<ID=clinvar_pathogenic,Number=.,Type=String,Description="">
##INFO=<ID=culprit,Number=1,Type=String,Description="">

@brentp
Copy link
Owner

brentp commented Feb 21, 2022

OK. If you remove this line: ##INFO=<ID=CSQ,Number=.,Type=String,Description=""> from the header, then it should work. I assume you modified that? Usually, it would contain, in Description, how to parse the CSQ field. But that's not needed for your use-case here anyway, so you can remove it completely from the vcf.

@Drdreammaerd
Copy link
Author

Drdreammaerd commented Feb 21, 2022

Should I remove that line even if the variants has CSQ information like this

OUS=0;AC_EAS=0;AC_FEMALE=0;AC_FIN=0;AC_Het=0;AC_Hom=1;AC_MALE=2;AC_NFE=0;AC_OTH=0;AC_POPMAX=2;AC_SAS=2
;AF=6.99800e-05;AGE_HISTOGRAM_HET=0|0|0|0|0|0|1|0|0|0|0|0;AGE_HISTOGRAM_HOM=0|0|0|0|0|0|1|0|0|0|0|0;AN
=42870;AN_AFR=770;AN_AMR=134;AN_Adj=8432;AN_CONSANGUINEOUS=926;AN_EAS=254;AN_FEMALE=2914;AN_FIN=16;AN_
MALE=5518;AN_NFE=2116;AN_OTH=90;AN_POPMAX=5052;AN_SAS=5052;BaseQRankSum=0.727000;CSQ=C|downstream_gene
_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene||||||||||rs
752859895|1|991|-1||SNV|1|HGNC|38034|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|
C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript
|ENST00000438504|unprocessed_pseudogene||||||||||rs752859895|1|991|-1||SNV|1|HGNC|38034|YES|||||||||||
|||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|splice_regio
n_variant&non_coding_transcript_exon_variant&non_coding_transcript_variant|LOW|DDX11L1|ENSG00000223972
|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|5/6||ENST00000450305.2:n.412G>C||412|||
||rs752859895|1||1||SNV|1|HGNC|37102|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|
C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|non_coding_transcript_exon_variant&non_coding_transcript_variant|M
ODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|3/3||ENST00000456328.2
:n.620G>C||620|||||rs752859895|1||1||SNV|1|HGNC|37102|YES||||||||||||||C:0.0003959||||||||C:0|C:6.998e
-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00
000227232|Transcript|ENST00000488147|unprocessed_pseudogene||||||||||rs752859895|1|1032|-1||SNV|1|HGNC
|38034|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GG
A|.,C|non_coding_transcript_exon_variant&non_coding_transcript_variant|MODIFIER|DDX11L1|ENSG0000022397
2|Transcript|ENST00000515242|transcribed_unprocessed_pseudogene|3/3||ENST00000515242.2:n.613G>C||613||
|||rs752859895|1||1||SNV|1|HGNC|37102|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|intron_variant&non_coding_transcript_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene||2/3|ENST00000518655.2:n.482-31G>C|||||||rs752859895|1||1||SNV|1|HGNC|37102|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000538476|unprocessed_pseudogene||||||||||rs752859895|1|1039|-1||SNV|1|HGNC|38034|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000541675|unprocessed_pseudogene||||||||||rs752859895|1|991|-1||SNV|1|HGNC|38034|||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.,C|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001576075|CTCF_binding_site||||||||||rs752859895|1||||SNV|1|||||||||||||||||C:0.0003959||||||||C:0|C:6.998e-05|C:0|C:0.0002372|C:0|C:0|C:0|C:0|||||||||||||GGA|.;ClippingRankSum=1.15000;DP=139843;DP_HIST=14728|2455|2120|518|121|499|534|314|111|21|10|2|2|0|0|0|0|0|0|0,1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0;ESP_AC=0;ESP_AF_GLOBAL=0;ESP_AF_POPMAX=0;FS=0.00000;GQ_HIST=1012|14971|172|100|3161|259|127|30|8|9|5|16|1162|274|59|45|17|2|3|3,0|0|0|0|0|1|0|0|0|0|0|0|0|1|0|0|0|0|0|0;GQ_MEAN=12.4800;GQ_STDDEV=15.1800;HOM_CONSANGUINEOUS=0;Het_AFR=0;Het_AMR=0;Het_EAS=0;Het_FIN=0;Het_NFE=0;Het_OTH=0;Het_SAS=0;Hom_AFR=0;Hom_AMR=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=1;InbreedingCoeff=-0.0844000;K1_RUN=G:0;K2_RUN=GA:0;K3_RUN=GAA:0;KG_AC=0;KG_AF_GLOBAL=0;KG_AF_POPMAX=0;MQ=35.7200;MQ0=0;MQRankSum=0.727000;NCC=60853;POPMAX=SAS;QD=23.4200;ReadPosRankSum=0.727000;VQSLOD=-1.68700;culprit=MQ

@brentp
Copy link
Owner

brentp commented Feb 21, 2022

You can instead try this linux binary (just gunzip, chmod +x and then run as ./slivar_dev ...).
slivar_dev.gz

if this works for you, I'll incorporate this into the next release.

@Drdreammaerd
Copy link
Author

Drdreammaerd commented Feb 24, 2022

Hi Brent

I tried this one slivar_dev.gz

It did start to run but got cancelled when it run on Chr10

[slivar] warning: found ANN but it did not contain a description that indicated gene field. skipping
[slivar] warning: found CSQ but it did not contain a description that indicated gene field. skipping
[slivar tsv] warning! didn't find BCSQ in header in /storage1/fs1/jin810/Active/yung-chun/database/ExAC/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz trying other fields
[slivar] warning: found BCSQ but it did not contain a description that indicated gene field. skipping
[slivar] kvs.len for chr10: 349209 after /storage1/fs1/jin810/Active/yung-chun/database/ExAC/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz
[slivar] writing 349209 encoded and 388 long values for chromosome 10
[slivar] removed 10 duplicated positions by using the value and chromosome: 10
[slivar tsv] warning! didn't find ANN in header in /storage1/fs1/jin810/Active/yung-chun/database/ExAC/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz trying other fields
[slivar] warning: found ANN but it did not contain a description that indicated gene field. skipping
[slivar] warning: found CSQ but it did not contain a description that indicated gene field. skipping
[slivar tsv] warning! didn't find BCSQ in header in /storage1/fs1/jin810/Active/yung-chun/database/ExAC/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz trying other fields
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

May I know what is the problem here?

Thanks,

David

@brentp
Copy link
Owner

brentp commented Feb 24, 2022

Hi, can you try this debug binary and share the error message?
slivar_dbg.gz

@Drdreammaerd
Copy link
Author

Yes, thanks for help.

here is the error I got

> slivar version: 0.2.8 6f116dfb8e416b28a55b3f46b6992ab930d46e8c
[slivar tsv] warning! didn't find ANN in header in /storage1/fs1/jin810/Active/yung-chun/database/ExAC/ExAC.r1.sites.vep.fix.pass.sm.ln.hg38.fixh.vcf.gz trying other fields
/home/brentp/src/slivar/src/slivar.nim(249) slivar
/home/brentp/src/slivar/src/slivar.nim(246) main
/home/brentp/src/slivar/src/slivarpkg/make_gnotate.nim(264) main
/home/brentp/src/slivar/src/slivarpkg/evaluator.nim(324) newEvaluator
/home/brentp/src/slivar/src/slivarpkg/tsv.nim(114) set_csq_fields
/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: index 1 not in 0 .. 0 [IndexDefect]

@brentp
Copy link
Owner

brentp commented Feb 24, 2022

I see. The best way for you to proceed is to remove the CSQ, ANN, BCSQ fields from your VCF completely.
Use bcftools annotate -x $field to do this.

@brentp
Copy link
Owner

brentp commented Feb 25, 2022

This binary (just updated) fixes this problem for me on a similar VCF with no descriptioon of format in the CSQ field.
If it works for you I'll do more testing and make a new release.
slivar_dbg2.gz

@Drdreammaerd
Copy link
Author

slivar_dbg2.gz works on my side. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants