Historical GRCh38 refseq #51

davmlaw · 2023-07-13T07:42:22Z

https://ncbiinsights.ncbi.nlm.nih.gov/2023/06/29/access-to-historical-human-transcript-alignments/

davmlaw · 2023-08-09T08:27:48Z

Looks like this might be worth 40k new transcripts (has 97k other than latest, but we had 57k of those historical ones already)

Can't just get rid of old gffs.... as a few are still in them only

I might put it at the very front so that only the ones in there are used if nothing else is available

514794 transcript versions from:
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 176343 (34.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz: 97210 (18.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20211119/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 75555 (14.7%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/ref_GRCh38.p2_top_level.gff3.gz: 72137 (14.0%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.108/GFF/ref_GRCh38.p7_top_level.gff3.gz: 43403 (8.4%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz: 34395 (6.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz: 5447 (1.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/110/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 3416 (0.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210514/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1643 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 886 (0.2%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210226/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 837 (0.2%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20191205/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 732 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200522/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 710 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200815/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 681 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 521 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20201120/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 477 (0.1%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190607/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 399 (0.1%)

Then after moving historical to after UTA

514794 transcript versions from:
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 176343 (34.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20211119/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 75739 (14.7%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/ref_GRCh38.p2_top_level.gff3.gz: 72137 (14.0%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.108/GFF/ref_GRCh38.p7_top_level.gff3.gz: 45908 (8.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz: 40537 (7.9%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gff.gz: 38043 (7.4%)
http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz: 34396 (6.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200522/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 12841 (2.5%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190607/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 3876 (0.8%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/110/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz: 3641 (0.7%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210514/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 2131 (0.4%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1943 (0.4%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200815/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1667 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1645 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20191205/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1581 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20210226/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 1425 (0.3%)
https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20201120/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz: 939 (0.2%)
postgresql://uta.biocommons.org/uta_20210129: 2 (0.0%)

…writes

davmlaw · 2023-08-10T07:29:19Z

Looks like the exons were being read twice in the historical ones. Comparing vs latest:

import gzip
import json
data_historical = json.load(gzip.open("./cdot-0.2.21.GCF_000001405.40-RS_2023_03_combined_annotation_alignments.gff.json.gz"))
data_latest = json.load(gzip.open("./cdot-0.2.21.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_03.gff.json.gz"))
common = set(data_historical["transcripts"]) & set(data_latest["transcripts"])
td_latest = data_latest["transcripts"]['NM_198076.6']
td_historical = data_historical["transcripts"]['NM_198076.6']

# From latest GRCh38

{'biotype': ['mRNA'],
 'gene_name': 'COX20',
 'gene_version': '116228',
 'genome_builds': {'GRCh38': {'cds_end': 244843176,
   'cds_start': 244835714,
   'contig': 'NC_000001.11',
   'exons': [[244835657, 244835756, 0, 1, 99, None],
    [244841943, 244842058, 1, 100, 214, None],
    [244842194, 244842258, 2, 215, 278, None],
    [244843040, 244845057, 3, 279, 2295, None]],
   'strand': '+',
   'tag': 'MANE Select',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/GCF_000001405.40_GRCh38.p14_genomic.gff.gz'}},
 'hgnc': '26970',
 'id': 'NM_198076.6',
 'start_codon': 57,
 'stop_codon': 414}

# From historical GRCh38

In [13]: td_historical
Out[13]: 
{'biotype': ['mRNA'],
 'gene_name': 'COX20',
 'gene_version': '116228',
 'genome_builds': {'GRCh38': {'cds_end': 244843176,
   'cds_start': 244835714,
   'contig': 'NC_000001.11',
   'exons': [[244835657, 244835756, 0, 1, 99, None],
    [244835657, 244835756, 1, 100, 198, None],
    [244841943, 244842058, 2, 199, 313, None],
    [244841943, 244842058, 3, 314, 428, None],
    [244842194, 244842258, 4, 429, 492, None],
    [244842194, 244842258, 5, 493, 556, None],
    [244843040, 244845057, 6, 557, 2573, None],
    [244843040, 244845057, 7, 2574, 4590, None]],
   'strand': '+',
   'tag': 'MANE Select',
   'url': 'https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/GCF_000001405.40-RS_2023_03/RefSeq_historical_alignments/GCF_000001405.40-RS_2023_03_genomic.gff.gz'}},
 'hgnc': '26970',
 'id': 'NM_198076.6',
 'start_codon': 57,
 'stop_codon': 692}

This was due to me copy/pasting the GFF3 in the refseq_transcripts_grch38.sh

davmlaw · 2023-08-10T08:23:30Z

Around 80% are valid via VG hgvs_ok code - the invalid ones are about 80% something failing eg _validate_cdna_match() - points out 'cDNA match starts at 3 not 1'

NM_000016.3 has no alignment gaps and exon end-start adds up to 2423 while the sequence length is 2454

Need to compare vs latest to work out what's happening

davmlaw added a commit that referenced this issue Aug 9, 2023

#51 - Historical GRCh38 refseq

4759b96

davmlaw added a commit that referenced this issue Aug 9, 2023

#51 - historical - move to after UTA so only use if nothing else over…

2886dfc

…writes

davmlaw added a commit that referenced this issue Aug 10, 2023

#51 - historical GRCh38 - get right URL for alignments

d755146

davmlaw added a commit that referenced this issue Aug 14, 2023

#51 - disable by default for now

fe2c079

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Historical GRCh38 refseq #51

Historical GRCh38 refseq #51

davmlaw commented Jul 13, 2023

davmlaw commented Aug 9, 2023 •

edited

Loading

davmlaw commented Aug 10, 2023 •

edited

Loading

davmlaw commented Aug 10, 2023

Historical GRCh38 refseq #51

Historical GRCh38 refseq #51

Comments

davmlaw commented Jul 13, 2023

davmlaw commented Aug 9, 2023 • edited Loading

davmlaw commented Aug 10, 2023 • edited Loading

davmlaw commented Aug 10, 2023

davmlaw commented Aug 9, 2023 •

edited

Loading

davmlaw commented Aug 10, 2023 •

edited

Loading