-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RefSeq Database Version #101
Comments
I think you can change them by editing the ftp address of your file in the following cs code. |
@naserelmi @malnirav did you guys figure this out? I dont think those FTP files located here are being used at all
the Downloader.dll command is just using a pre-compiled database chached thats put in S3 buckets and it just downloads those every time. @MichaelStromberg can you please explain how this script found here: https://github.com/Illumina/Nirvana/blob/main/CacheUtils/Commands/Download/ExternalFiles.cs |
@rajatshuvro please be advised ... your inputs appreciated! |
@GH-MStamboulian, You are right. The ExternalFiles.cs was not used. The Nirvana team changed the reference gene model to the newest version of RefGene and Ensemble Gene in their tool's last version, which needs an academic license. |
@Cimorgh-IT which version is this? is there something newer than 3.18.1? |
@Cimorgh-IT is correct. The current latest version is 3.22.0 and is available here: https://developer.illumina.com/illumina-connected-annotations. It is free for academic use. 3.22.0 uses updated gene models from RefSeq and Ensembl. Once you run the downloader, you will get all the files needed to run Nirvana. The new documentation is available here: https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/ Thanks |
have any of you installed this before? im not able to follow the instructions, I downloaded the zip file and then the shell script thats supposed to unzip it and build it but it keeps erroring out:
The zip file I got is IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip and the shell script to unpack it is https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/assets/files/TestIlluminaConnectedAnnotations-e26785a7184802763e147e22e2a39eb6.sh @rajatshuvro any advice? |
I guess I need the IlluminaConnectedAnnotationsBuild.zip file instead ... can someone point me to this file please? |
Hi @GH-MStamboulian , unzip IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip This will produce the necessary DLLs (Downloader.dll, Nirvana.dll, Annotator.dll). |
@GH-MStamboulian , it works if you provide the full path to the IlluminaConnectedAnnotations-3.22.0-0-gc13dcb61-net6.0.zip file. |
@rajatshuvro thank you for this I actually was able to unzip that file and use the Downloaded.dll to download the reference database files and then use the Nirvana.dll to annotate a vcf file to test it, however Illumina annotator (the new software) is still making the same mistakes as Nirvana regardless. I can see that you guys updated the transcripts, (the transcript minor versions are now more up to date) however lets take an example here to explain the issue: lets take the following variant: 1-98348885-G-A, bascally chromosome : 1, position : 98348885 and mut_nt G>A then based on Nirvana and Illumina Annotator we get the following refseq annotation:
as you notice it annotates this as as synonymous variant, and assigns it to codon Tgt to a Tgt (no transition) however in fact this should have been a missense mutation instead. I have used hg19 to call my variants and also RefSeq to interpret the transcript and in this case, which is the gene DPYD I also use the same RefSeq transcript as Nirvana for this gene which is 'NM_000110.4' You can double check that this is in fact the wrong annotation on UCSC genome browser (make sure you pick up hg19 as your reference) and enter the genomic location and you can see that this genomic location maps to the codon ACG which codes for Arginine (R), and then a G>A transition would result the codon Cgt>Tgt transition on the reverse transcript which in turn is a missense variant thats Arg29Cys on the protein level. Funny enough the annotation based on the ensembl transcripts are correct: {'transcript': 'ENST00000370192.3', @rajatshuvro can you please let me know what the issue is here and if infact theres a bug in the annotator that needs to be addressed? |
@GH-MStamboulian , @rajatshuvro |
This is most likely one of those cases where the GRCh37 genome does not have the most common allele of a polymorphism, while the MANE transcript uses the most common variant. You can check Gnomad (https://gnomad.broadinstitute.org/variant/1-98348885-G-A?dataset=gnomad_r2_1), this variant has an allele frequency > 70% and a GrpMax Filtering Allele Frequency > 80%. I have seen such examples quite often. This is one of the consequences that the GRCh37 genome sequence is based on a relatively small number of individuals, and there are differences between GRCh37 and GRCh38 in terms of which variant of a polymorphism is in the genomic backbone. By the way, one might suspect that the MANE transcripts always have the most common variant of a SNP, but that's also not always true. And even worse, I have seen examples where a previous version of a transcript database entry of a MANE transcript had the most common SNP, but the current version of the MANE transcript has not. So it's kind of a mess, and leads to such issues as you described. It is always a good idea to check the allele frequency of a variant, and unless you are interested in SNPs, to filter out SNPs. In your case, NM_000110.4 seems to contain the most common variant of this polymorphism, so no change is reported because the annotator compares against the transcript sequence and not against the genome sequence (it is still slightly confusing that it is reported as "synonymous" while it is really "no change", but I think there is no ontology term for this consequence), while ENST00000370192.3 has the minor variant of this polymorphism, which is why this is then reported as a missense variant. |
You are exactly right @heseber . Thanks. |
@rajatshuvro @naserelmi @heseber why is this going to be interpreted by allele frequencies? you can check that at chromosome 1 position 98348885 on GRCh37, the wildtype base is a G, where is the triplet TGT even coming from? It doesnt even make sense on the reverse transcript which is ACA, you can check this on the genome browser and that base is in fact a G as the reference and the surrounding triplet is acG , which would be Cgt on the reverse transcript. I didnt know that the reference sequence is up to interpretation by allele frequencies, it should be a fixed reference and what Nirvana is getting is the wrong reference base. Can you please elaborate? |
@rajatshuvro so you guys aren't using the standard reference sequence hg19? and you have your own version of it? |
Dear @GH-MStamboulian, I am sure as a bioinformatician you know what a polymorphism is. Just to recapitulate: there are locations in the genome where different individuals have a different sequence, there is variability of the genomic sequence across the human population. If at least two such different variants occur often enough in the population (a commonly used threshold is more than 1%), it is called a polymorphism, and if the change is only a single base (that is, no insertion or deletion), it is called a single nucleotide polymorphism (SNP). The GRCh37 genomic sequence does not reflect this variability because it defines a single unique nucleotide for each genomic location, whereas in reality the sequence at this location can have different nucleotides with different probabilities across the population. While GRCh37 always represents only a single variant of a polymorphism, which is not always the most common variant in the population, different transcripts from a transcript database can reflect the naturally occurring variability if they are derived from different individuals. So it may happen that, when sequencing transcripts or DNA, the sequence you measure corresponds to the most common variant in the population (which is even most likely), but it is different from the GRCh37 genomic sequence because the GRCh37 has a rare variant of this polymorphism in its sequence. This is the case for your example. If different transcripts from a transcript database represent different variants of a polymorphism, they have a slightly different sequence, for some of these transcripts a detected nucleotide actually means a change of sequence, while for some it means no change. It would be wrong to assign the same change (missense or synononymous or ...) to all transcripts, the correct solution is to compare the detected nucleotide with the sequence of each transcript, not with the sequence of the genome, because the genomic sequence of the GRCh37 genome is in disagreement with the sequence of a transcript if the genome sequence and the transcript sequence represent different variants of a polymorphism. The correct solution for annotating genomic variants is therefore to find all transcripts (isoforms, transcript database entries) that overlap with a particular genomic location, and then to compare the detected sequence (ALT from the VCF file) with the sequence of each of these transcripts, not with the sequence of the genome. In summary, the results provided by Illumina's annotator are correct. |
Thanks again @heseber , I couldn't have said it any better. Hope that clarifies it @GH-MStamboulian . |
@heseber thanks for the lesson :) but thats not quite what I was asking or looking for, @rajatshuvro I just think its misleading in Nirvana to name the reference database GRCh37, whereas in fact its not and deviates from the original reference thats published based on base frequencies obtained from a larger population. Can you please provide the reference sequence that Nirvana uses that overlays the suggested transcripts against and makes the annotation calls? |
@GH-MStamboulian, there is nothing that is misleading, I have the feeling you just don't get it, although I really tried hard to help you to understand. I give up now, I have no time to elaborate further. |
@heseber I never asked you to elaborate anything your just assigning that role to yourself, carry on with your day! @rajatshuvro I just want to understand how internally these annotations are passed? like do you have allele frequencies and compute what should be the representative on the fly or you have a pre-calculated custom reference that use it against? in any case Im wondering if you can provide more details concerning this and if theres a reference sequence and a list of transcripts used for each gene that you can share would be great because everything that gets downloaded are in binary format and I cannot unpack and see. |
What version of RefSeq database is downloaded/used in the annotation cache when executing:
dotnet Downloader.dll --ga GRCh37 -o Data
? Is there a way to control/leverage a specific transcript database version?The text was updated successfully, but these errors were encountered: