From 6d1ab8831fd23ddf51f513f0252ee3f7d4bf7564 Mon Sep 17 00:00:00 2001 From: Vishal Koparde Date: Thu, 11 Feb 2021 15:05:37 -0500 Subject: [PATCH] adding reference.md --- docs/references.md | 413 +++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 414 insertions(+) create mode 100644 docs/references.md diff --git a/docs/references.md b/docs/references.md new file mode 100644 index 0000000..e904632 --- /dev/null +++ b/docs/references.md @@ -0,0 +1,413 @@ +## circRNA DAQ Pipeline + +![img](https://img.shields.io/github/issues/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/forks/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/stars/kopardev/circRNA?style=for-the-badge)![img](https://img.shields.io/github/license/kopardev/circRNA?style=for-the-badge) + +The reference sequences comprises of the host genome and the viral genomes. + +### Fasta + +*hg38* genome build is chosen to represent human (host). Human ribosomal sequences (*45S, 5S*) are downloaded from NCBI. *hg38* was masked for rRNA sequence and *45S* and *5S* sequences from NCBI are appended as separate chromosomes. The following viral sequences were appended to the rRNA masked *hg38* reference: + +| RefSeq Sequence | RefSeq assembly accession | Notes | +| ---------------- | ------------------------- | ----------------------------------------------------- | +| NC_007605.1 | GCF_002402265.1 | Human gammaherpesvirus 4 (Epstein-Barr virus) | +| NC_000898.1 | GCF_000846365.1 | Human betaherpesvirus 6B | +| NC_001664.4 | GCF_000845685.2 | Human betaherpesvirus 6A | +| NC_001716.2 | GCF_000848125.1 | Human betaherpesvirus 7 | +| NC_006273.2 | GCF_000845245.1 | Human betaherpesvirus 5 | +| NC_009333.1 | GCF_000838265.1 | Human gammaherpesvirus 8 | +| NC_045512.2 | GCF_009858895.2 | Severe acute respiratory syndrome-related coronavirus | +| MN485971.1 | xx | HIV from Belgium ... GTF is hand curated | + +Location: The entire resource bundle is available at `/data/Ziegelbauer_lab/resources/hg38_rRNA_masked_plus_rRNA_plus_viruses_plus_ERCC` on biowulf. This location also have additional bash scritpts required for aggregating annotations and building indices required by different aligners. + +**Update** **(02/10/21)** + +The following viral sequence has also been appended to the reference: + +| RefSeq Sequence | RefSeq assembly accession | Notes | +| ---------------- | ------------------------- | ------------------------------------------------------------ | +| NC_001806.2 | GCF_000859985.2 | [Human alphaherpesvirus 1 (Herpes simplex virus type 1)](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10298&lvl=3&lin=f&keep=1&srchmode=1&unlock) (strain 17) | + +Location: The updated resource bundle is at `/data/Ziegelbauer_lab/resources/hg38_rRNA_masked_plus_rRNA_plus_viruses_plus_ERCC.v2` on biowulf + +### Annotations + +Gencode release 36 is used to annotate the human transcripts. Customized RefSeq annotations are used for annotating the viral sequences. + +#### Viruses + +To properly annotate viral sequences the following process was followed: + +**Downloading annotations from NCBI:** + +For example, to download gtf annotations for KSHV genome search "KSHV" at [NCBI](https://www.ncbi.nlm.nih.gov/assembly/) + +![image-20200807133058218](https://tva1.sinaimg.cn/large/007S8ZIlgy1ghirj71jv5j30qn01taa5.jpg) + +On the results page (assembly page), click on "FTP directory for RefSeq assembly" + +![image-20200807133320457](https://tva1.sinaimg.cn/large/007S8ZIlgy1ghirln4fc7j30s60sawj6.jpg) + +The FTP page has the relevant sequence and annotations files: + +![image-20200807133421026](https://tva1.sinaimg.cn/large/007S8ZIlgy1ghirmr2a4dj30t30ajq66.jpg) + +Download the file ending with *.gtf.gz*: + +```bash +> curl -L "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/838/265/GCF_000838265.1_ViralProj14158/GCF_000838265.1_ViralProj14158_genomic.gtf.gz" > GCF_000838265.1_ViralProj14158_genomic.gtf.gz + +> gzip -d GCF_000838265.1_ViralProj14158_genomic.gtf.gz +``` + +**What needs to be fixed?** + +The GTF file downloaded from NBCI (RefSeq annotations) needs fixing in order to work seemlessly with STAR and CCBR_Pipeliner: + +1. GTF is missing lines for "transcript" in column 3. Transcript coordinates can be different from "gene" coordinates. I am going to pick all non-"gene" lines for a particular gene, filter them for a particular "transcript", ~~find the leftmost and rightmost coordinate and use them as coordinates for the new "transcript" line in the new GTF file~~. Using leftmost and rightmost script eliminates the UTR regions and hence if a gene has only one transcript (most common scenario), the transcript coordinates mimic the gene coordinates. + eg. for **HHV8GK18_gp01** gene these are the lines in the original file + + ```bash + NC_009333.1 RefSeq gene 4 1083 . + . gene_id "HHV8GK18_gp01"; db_xref "GeneID:4961511"; gbkey "Gene"; gene "K1"; gene_biotype "protein_coding"; locus_tag "HHV8GK18_gp01"; + NC_009333.1 RefSeq exon 29 1083 . + . gene_id "HHV8GK18_gp01"; transcript_id "HHV8GK18_gp01"; gbkey "mRNA"; gene "K1"; locus_tag "HHV8GK18_gp01"; exon_number "1"; + NC_009333.1 RefSeq CDS 105 941 . + 0 gene_id "HHV8GK18_gp01"; transcript_id "HHV8GK18_gp01"; gbkey "CDS"; gene "K1"; locus_tag "HHV8GK18_gp01"; note "contains an Ig domain"; product "K1"; protein_id "YP_001129350.1"; exon_number "1"; + NC_009333.1 RefSeq start_codon 105 107 . + 0 gene_id "HHV8GK18_gp01"; transcript_id "HHV8GK18_gp01"; gbkey "CDS"; gene "K1"; locus_tag "HHV8GK18_gp01"; note "contains an Ig domain"; product "K1"; protein_id "YP_001129350.1"; exon_number "1"; + NC_009333.1 RefSeq stop_codon 942 944 . + 0 gene_id "HHV8GK18_gp01"; transcript_id "HHV8GK18_gp01"; gbkey "CDS"; gene "K1"; locus_tag "HHV8GK18_gp01"; note "contains an Ig domain"; product "K1"; protein_id "YP_001129350.1"; exon_number "1"; + ``` + +2. Some transcripts have ids as "unknown_transcript_1". New name will be of the format "**gene_id**+_transcript_1". All child features for the transcript in question need the **transcript_id** to be changed to the new name. + + ```bash + NC_009333.1 RefSeq gene 3179 17026 . + . gene_id "HHV8GK18_gp03"; db_xref "GeneID:4961521"; gbkey "Gene"; gene "ORF6"; gene_biotype "protein_coding"; locus_tag "HHV8GK18_gp03"; + NC_009333.1 RefSeq CDS 3179 6574 . + 0 gene_id "HHV8GK18_gp03"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ORF6"; locus_tag "HHV8GK18_gp03"; note "herpesvirus core gene UL29 family"; product "ORF6"; protein_id "YP_001129352.1"; + NC_009333.1 RefSeq start_codon 3179 3181 . + 0 gene_id "HHV8GK18_gp03"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ORF6"; locus_tag "HHV8GK18_gp03"; note "herpesvirus core gene UL29 family"; product "ORF6"; protein_id "YP_001129352.1"; + NC_009333.1 RefSeq stop_codon 6575 6577 . + 0 gene_id "HHV8GK18_gp03"; transcript_id "unknown_transcript_1"; gbkey "CDS"; gene "ORF6"; locus_tag "HHV8GK18_gp03"; note "herpesvirus core gene UL29 family"; product "ORF6"; protein_id "YP_001129352.1"; + ``` + +3. Some **gene_id** 's are empty. I am planning on reporting these in a separate file ending with *.extralines*. These need to be edited manually and appended to the output GTF file. I will be using "chromosome_name:start_coordinate-end_coordinate" as the format for generating a unique **gene_id** for replacement GTF entries. Eg. + + ```bash + NC_009333.1 RefSeq exon 118075 118097 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "misc_RNA"; product "miR-K10"; exon_number "1"; + NC_009333.1 RefSeq exon 127997 129368 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "mRNA"; note "K14, ORF74; polyA_site not reported"; exon_number "1"; + NC_009333.1 RefSeq exon 129517 130671 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "mRNA"; note "K14, ORF74; polyA_site not reported"; exon_number "2"; + ``` + + will be changed to something like this: + + ```bash + NC_009333.1 RefSeq gene 118075 118097 . - . gene_id "NC_009333.1:118075-118097"; gene_name "NC_009333.1:118075-118097"; gene_biotype "miRNA"; gbkey "misc_RNA"; product "miR-K10"; + NC_009333.1 RefSeq transcript 118075 118097 . - . gene_id "NC_009333.1:118075-118097"; gene_name "NC_009333.1:118075-118097"; gene_biotype "miRNA"; transcript_id "NC_009333.1:118075-118097_ + transcript_1"; transcript_name "NC_009333.1:118075-118097_transcript_1"; transcript_type "miRNA"; gbkey "misc_RNA"; product "miR-K10"; + NC_009333.1 RefSeq exon 118075 118097 . - . gene_id "NC_009333.1:118075-118097"; transcript_id "NC_009333.1:118075-118097_transcript_1"; gbkey "misc_RNA"; product "miR-K10"; exon_number "1"; + NC_009333.1 RefSeq gene 127997 130671 . + . gene_id "NC_009333.1:127997_130671"; gene_name "NC_009333.1:127997_130671"; gene_biotype "mRNA"; gbkey "mRNA"; note "K14, ORF74; polyA_site not reporte + d"; + NC_009333.1 RefSeq transcript 127997 130671 . + . gene_id "NC_009333.1:127997_130671"; gene_name "NC_009333.1:127997_130671"; gene_biotype "mRNA"; transcript_id "NC_009333.1:127997_130671_transcr + ipt_1"; transcript_name "NC_009333.1:127997_130671_transcript_1"; transcript_type "NC_009333.1:127997_130671_transcript_1";gbkey "mRNA"; note "K14, ORF74; polyA_site not reported"; + NC_009333.1 RefSeq exon 127997 129368 . + . gene_id "NC_009333.1:127997_130671"; transcript_id "NC_009333.1:127997_130671_transcript_1"; gbkey "mRNA"; note "K14, ORF74; polyA_site not reported"; + exon_number "1"; + NC_009333.1 RefSeq exon 129517 130671 . + . gene_id "NC_009333.1:127997_130671"; transcript_id "NC_009333.1:127997_130671_transcript_1"; gbkey "mRNA"; note "K14, ORF74; polyA_site not reported"; + exon_number "2"; + ``` + +4. Only few genes have **exon** features. Most of them have **CDS** only. The **exon** line is added to have coordinates same as the **CDS**, if it is missing. + + eg. This + + ```bash + NC_009333.1 RefSeq gene 15756 17026 . + . gene_id "HHV8GK18_gp08"; gene_name "ORF11"; gene_biotype "protein_coding"; db_xref "GeneID:4961439"; gbkey "Gene"; locus_tag "HHV8GK18_gp08"; + NC_009333.1 RefSeq transcript 15756 16979 . + . gene_id "HHV8GK18_gp08"; gene_name "ORF11"; gene_biotype "protein_coding"; transcript_id "HHV8GK18_gp08_transcript_1"; transcript_name "HHV8GK18_gp08_transcript_1"; transcript_type "protein_coding"; + NC_009333.1 RefSeq CDS 15756 16976 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + NC_009333.1 RefSeq start_codon 15756 15758 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + NC_009333.1 RefSeq stop_codon 16977 16979 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + ``` + + becomes + + ```bash + NC_009333.1 RefSeq gene 15756 17026 . + . gene_id "HHV8GK18_gp08"; gene_name "ORF11"; gene_biotype "protein_coding"; db_xref "GeneID:4961439"; gbkey "Gene"; locus_tag "HHV8GK18_gp08"; + NC_009333.1 RefSeq transcript 15756 16979 . + . gene_id "HHV8GK18_gp08"; gene_name "ORF11"; gene_biotype "protein_coding"; transcript_id "HHV8GK18_gp08_transcript_1"; transcript_name "HHV8GK18_gp08_transcript_1"; transcript_type "protein_coding"; + NC_009333.1 RefSeq exon 15756 16976 . + . gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + NC_009333.1 RefSeq CDS 15756 16976 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + NC_009333.1 RefSeq start_codon 15756 15758 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + NC_009333.1 RefSeq stop_codon 16977 16979 . + 0 gene_id "HHV8GK18_gp08"; transcript_id "HHV8GK18_gp08_transcript_1"; gbkey "CDS"; gene "ORF11"; locus_tag "HHV8GK18_gp08"; note "derived from herpesvirus dUTPase; related to HHV-5 UL82, UL83 and UL84"; product "ORF11"; protein_id "YP_001129357.1"; + ``` + +**Python script to automate:** + +Simple python script take in RefSeq's GTF and spits out new GTF. This script (*fix_refseq_gtf.py*) is provided in the scripts folder of the repo.: + +```python +# Author: Vishal N. Koparde +# CCBR NCI +# Date: Aug, 2020 + + +import sys,copy,argparse + + +parser = argparse.ArgumentParser() +parser.add_argument('-i',dest='ingtf', required=True, type=str, help="Input RefSeq GTF ..downloaded from NCBI ftp server") +parser.add_argument('-o',dest='outgtf', required=True, type=str, help="Modified Output RefSeq GTF") +args = parser.parse_args() + +def get_gene_id(column9): + x=column9.strip().split() + for i,value in enumerate(x): + if value=="gene_id": + gene_id_index=i+1 + break + gene_id=x[gene_id_index] + return gene_id + +def get_gene_biotype(column9): + x=column9.strip().split() + found=0 + for i,value in enumerate(x): + if value=="gene_type" or value=="gene_biotype": + gene_biotype_index=i+1 + found=1 + break + if found==0: + return '"unknown";' + gene_biotype=x[gene_biotype_index] + return gene_biotype + +def get_gene_name(column9): + x=column9.strip().split() + found=0 + for i,value in enumerate(x): + if value=="gene" or value=="gene_name": + gene_index=i+1 + found=1 + break + if found==0: + return "" + gene_name=x[gene_index] + return gene_name + +def get_transcript_id(column9): + x=column9.strip().split() + found=0 + for i,value in enumerate(x): + if value=="transcript_id": + transcript_id_index=i+1 + found=1 + break + if found==0: + return '"transcript_id_unknown";' + transcript_id=x[transcript_id_index] + return transcript_id + +def fix_transcript_id(column9,g): + x=column9.strip().split() + found=0 + for i,value in enumerate(x): + if value=="transcript_id": + transcript_id_index=i+1 + found=1 + break + x[transcript_id_index]=g + if found==0: + x.append("transcript_id") + x.append(g) + x=" ".join(x) + return x + +def create_new_transript_id(g,i): + n=g.split('"') + n[-2]+="_transcript_"+str(i) + n='"'.join(n) + return n + +def are_exons_present(transcript_lines): + for l in transcript_lines: + l_split=l.strip().split("\t") + if l_split[2]=="exon": + return True + else: + return False + +#create genelist +genelist=[] +gene_coords=dict() +all_gtflines=list(filter(lambda x:not x.startswith("#"),open(args.ingtf).readlines())) +blank_gene_id_lines=[] +for f in all_gtflines: + its_a_gene=0 + if f.strip().split("\t")[2]=="gene": + its_a_gene=1 + gene_id=get_gene_id(f.strip().split("\t")[8]) + if gene_id=='"";': + blank_gene_id_lines.append(f) + continue + genelist.append(gene_id) + if its_a_gene==1 and not gene_id in gene_coords: + gene_coords[gene_id]=(int(f.strip().split("\t")[3]),int(f.strip().split("\t")[4])) +genelist=list(set(genelist)) +# print(genelist) +# print(len(blank_gene_id_lines)) + +#get genes2transcripts ... this is only for verifying that every gene has only 1 transript... this is the assumption +gene_id_2_transcript_ids=dict() +for g in genelist: + if not g in gene_id_2_transcript_ids: + gene_id_2_transcript_ids[g]=list() + lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) + non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) + for l in non_gene_lines: + t_id=get_transcript_id(l.strip().split("\t")[8]) + if t_id!='"transcript_id_unknown";': + gene_id_2_transcript_ids[g].append(t_id) + gene_id_2_transcript_ids[g]=list(set(gene_id_2_transcript_ids[g])) + +geneid2transcriptidfile=open(args.ingtf+".geneid2transcriptid",'w') +for k,v in gene_id_2_transcript_ids.items(): + geneid2transcriptidfile.write("%s\t%s\n"%(k,v)) +geneid2transcriptidfile.close() + +#get genenames +gene_id_2_gene_name=dict() +for g in genelist: + if not g in gene_id_2_gene_name: + gene_id_2_gene_name[g]=list() + lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) + gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) + # if len(gene_line)==0: + # for l in lines_with_gene_id: + # print(l,) + gene_line=gene_line[0] + gene_name=get_gene_name(gene_line.split("\t")[8]) + if gene_name=="": + gene_name=g + gene_id_2_gene_name[g]=gene_name +# for k,v in gene_id_2_gene_name.items(): +# print(k,v) + +#get transcript coordinates +gene_id_2_transcript_coordinates=dict() +for g in genelist: + # print("gene=",g) + if not g in gene_id_2_transcript_coordinates: + gene_id_2_transcript_coordinates[g]=list() + if len(gene_id_2_transcript_ids[g])==1: + gene_id_2_transcript_coordinates[g].append(gene_coords[g]) + else: + lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) + non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) + for t in gene_id_2_transcript_ids[g]: + # print("transcript=",t) + transcript_lines=list(filter(lambda x:t in x,non_gene_lines)) + coords=[] + for l in transcript_lines: + # print(l.strip()) + l_split=l.split("\t") + coords.append(int(l_split[3])) + coords.append(int(l_split[4])) + # print() + gene_id_2_transcript_coordinates[g].append((min(coords),max(coords))) + # print(gene_id_2_transcript_coordinates[g]) +# for k,v in gene_id_2_transcript_coordinates.items(): + # print(k,v) +# exit() + +#get gene biotype\ +gene_id_2_gene_biotype=dict() +for g in genelist: + lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) + gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) + gene_line=gene_line[0] + gene_biotype=get_gene_biotype(gene_line.split("\t")[8]) + gene_id_2_gene_biotype[g]=gene_biotype +# for k,v in gene_id_2_gene_biotype.items(): +# print(k,v) + +out=open(args.outgtf,'w') +for g in genelist: + lines_with_gene_id=list(filter(lambda x: g in x,all_gtflines)) + gene_line=list(filter(lambda x:x.split("\t")[2]=="gene",lines_with_gene_id)) + gene_line=gene_line[0] + gene_line=gene_line.split("\t") + others=gene_line.pop(-1) + gene_line_copy=copy.copy(gene_line) + # other key value pairs to add in the gene_line(col9) + others_to_add=[] + # print("others=",others) + for o in others.strip().split("; "): + # print("o=",o) + o2=o.split(" ") + # print("o2=",o2) + key=o2[0] + value=o2[1:] + value=" ".join(value) + # print("key=",key) + # print("value=",value) + if key in ["gene_id","gene","gene_name","gene_type","gene_biotype"]: + continue + else: + others_to_add.append(key) + if not ";" in value: + others_to_add.append(value+";") + else: + others_to_add.append(value) + + col9=[] + col9.append("gene_id") + col9.append(g) + col9.append("gene_name") + col9.append(gene_id_2_gene_name[g]) + col9.append("gene_biotype") + col9.append(gene_id_2_gene_biotype[g]) + col9plus=copy.copy(col9) + col9plus.extend(others_to_add) + gene_col9=" ".join(col9plus) + gene_line.append(gene_col9) + gene_line="\t".join(gene_line) + out.write("%s\n"%(gene_line)) + + non_gene_lines=list(filter(lambda x:x.split("\t")[2]!="gene",lines_with_gene_id)) + for i,t in enumerate(gene_id_2_transcript_ids[g]): + transcript_line=copy.copy(gene_line_copy) + transcript_line[2]="transcript" + transcript_line[3]=str(gene_id_2_transcript_coordinates[g][i][0]) + transcript_line[4]=str(gene_id_2_transcript_coordinates[g][i][1]) + new_trascript_id=create_new_transript_id(g,i+1) + transcript_col9=copy.copy(col9) + transcript_col9.append("transcript_id") + transcript_col9.append(new_trascript_id) + transcript_col9.append("transcript_name") + transcript_col9.append(new_trascript_id) + transcript_col9.append("transcript_type") + transcript_col9.append(gene_id_2_gene_biotype[g]) + transcript_col9=" ".join(transcript_col9) + transcript_line.append(transcript_col9) + transcript_line="\t".join(transcript_line) + out.write("%s\n"%(transcript_line)) + + transcript_lines=list(filter(lambda x:t in x,non_gene_lines)) + have_exons=are_exons_present(transcript_lines) + for l in transcript_lines: + # print(l) + l=l.strip().split("\t") + tofix=l.pop(-1) + l.append(fix_transcript_id(tofix,new_trascript_id)) + if l[2]=="CDS" and have_exons==False: + l2=copy.copy(l) + l2[7]="." + l2[2]="exon" + l2="\t".join(l2) + out.write("%s\n"%(l2)) + l="\t".join(l) + out.write("%s\n"%(l)) + # print(l) +out.close() + +out=open(args.ingtf+".extralines",'w') +for b in blank_gene_id_lines: + out.write(b) +out.close() +``` + +Final versions of the GTF as located here: + +`/data/Ziegelbauer_lab/circRNADetection/viral_db/GTF/final_versions` \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 9fbbc58..cc2b7e9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -32,4 +32,5 @@ nav: - Home: index.md - Flowchart: flowchart.md - Tutorial: tutorial.md + - References: references.md - Versions: versions.md