Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Historical GRCh38 refseq #51

Open
davmlaw opened this issue Jul 13, 2023 · 3 comments
Open

Historical GRCh38 refseq #51

davmlaw opened this issue Jul 13, 2023 · 3 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Jul 13, 2023

https://ncbiinsights.ncbi.nlm.nih.gov/2023/06/29/access-to-historical-human-transcript-alignments/

davmlaw added a commit that referenced this issue Aug 9, 2023
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 9, 2023

Looks like this might be worth 40k new transcripts (has 97k other than latest, but we had 57k of those historical ones already)

Can't just get rid of old gffs.... as a few are still in them only

I might put it at the very front so that only the ones in there are used if nothing else is available

514794 transcript versions from:
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 176343 (34.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz: 97210 (18.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20211119/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 75555 (14.7%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/ref_GRCh38.p2_top_level.gff3.gz: 72137 (14.0%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.108/GFF/ref_GRCh38.p7_top_level.gff3.gz: 43403 (8.4%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz: 34395 (6.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz: 5447 (1.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/110/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 3416 (0.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210514/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1643 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 886 (0.2%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210226/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 837 (0.2%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20191205/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 732 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200522/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 710 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200815/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 681 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 521 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20201120/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 477 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190607/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 399 (0.1%)

Then after moving historical to after UTA

514794 transcript versions from:
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 176343 (34.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20211119/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 75739 (14.7%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/ref_GRCh38.p2_top_level.gff3.gz: 72137 (14.0%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.108/GFF/ref_GRCh38.p7_top_level.gff3.gz: 45908 (8.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz: 40537 (7.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz: 38043 (7.4%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz: 34396 (6.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200522/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 12841 (2.5%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190607/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 3876 (0.8%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/110/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 3641 (0.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210514/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 2131 (0.4%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1943 (0.4%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200815/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1667 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1645 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20191205/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1581 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210226/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1425 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20201120/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 939 (0.2%)
postgresql://uta.biocommons.org/uta_20210129: 2 (0.0%)

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 10, 2023

Looks like the exons were being read twice in the historical ones. Comparing vs latest:

import gzip
import json
data_historical = json.load(gzip.open("./cdot-0.2.21.GCF_000001405.40-RS_2023_03_combined_annotation_alignments.gff.json.gz"))
data_latest = json.load(gzip.open("./cdot-0.2.21.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_03.gff.json.gz"))
common = set(data_historical["transcripts"]) & set(data_latest["transcripts"])
td_latest = data_latest["transcripts"]['NM_198076.6']
td_historical = data_historical["transcripts"]['NM_198076.6']
# From latest GRCh38

{'biotype': ['mRNA'],
 'gene_name': 'COX20',
 'gene_version': '116228',
 'genome_builds': {'GRCh38': {'cds_end': 244843176,
   'cds_start': 244835714,
   'contig': 'NC_000001.11',
   'exons': [[244835657, 244835756, 0, 1, 99, None],
    [244841943, 244842058, 1, 100, 214, None],
    [244842194, 244842258, 2, 215, 278, None],
    [244843040, 244845057, 3, 279, 2295, None]],
   'strand': '+',
   'tag': 'MANE Select',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz'}},
 'hgnc': '26970',
 'id': 'NM_198076.6',
 'start_codon': 57,
 'stop_codon': 414}

# From historical GRCh38

In [13]: td_historical
Out[13]: 
{'biotype': ['mRNA'],
 'gene_name': 'COX20',
 'gene_version': '116228',
 'genome_builds': {'GRCh38': {'cds_end': 244843176,
   'cds_start': 244835714,
   'contig': 'NC_000001.11',
   'exons': [[244835657, 244835756, 0, 1, 99, None],
    [244835657, 244835756, 1, 100, 198, None],
    [244841943, 244842058, 2, 199, 313, None],
    [244841943, 244842058, 3, 314, 428, None],
    [244842194, 244842258, 4, 429, 492, None],
    [244842194, 244842258, 5, 493, 556, None],
    [244843040, 244845057, 6, 557, 2573, None],
    [244843040, 244845057, 7, 2574, 4590, None]],
   'strand': '+',
   'tag': 'MANE Select',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz'}},
 'hgnc': '26970',
 'id': 'NM_198076.6',
 'start_codon': 57,
 'stop_codon': 692}

This was due to me copy/pasting the GFF3 in the refseq_transcripts_grch38.sh

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 10, 2023

Around 80% are valid via VG hgvs_ok code - the invalid ones are about 80% something failing eg _validate_cdna_match() - points out 'cDNA match starts at 3 not 1'

NM_000016.3 has no alignment gaps and exon end-start adds up to 2423 while the sequence length is 2454

Need to compare vs latest to work out what's happening

davmlaw added a commit that referenced this issue Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant