This script is written by Tyron Chang.The goal is to leverage bioinformatic tools to mine exonic miRNAs and understand the processing of these specific miRNAs and their relationships with their host genes.
The programming languages and tools used for this study are are shown below:
- Python (data cleaning and processing)
- BEDTools (overlapping the exonic microRNAs)
- Shell (use awk and basic command lines to convert tsv files into bed file)
miR loci data are retrieved from miRbase
The original NCBI RefSeq data are retrieved from UCSC genome table browser
There is a lot of files so if you just want to see the final output please just go to the excel file folder. Here are the files you will be interested:
Human_miR_type_finalized_df_unique.xlsx
will tell you if these miRNAs are exonic, intronic, no host genes, etc.
-
human_exonic_miR(gene_type)_NCBI_unique.xlsx
has more information it will tell you:- if they are protein-coding or non-coding genes.
- numbers of exons for each host genes.
- The length of mRNAs for each host genes.
- Their genetic coordinates.
-
human_exonic_miR_list(protein_coding_host_genes)_NCBI_unique.xlsx
is a clean version of exonic miRNAs resding in protein-coding genes. The file does not have genetic coordinates and it will tell you:- if exonic miRNAs lives in 5'UTR, 3'UTR, or CDS.
- The length of mature host mRNAs.
-
human_exonic_miR_list(all_host_genes)_NCBI_no_loc.xlsx
is a clean version of exonic miRNAs file without coordinates and it will tell you if they are protein-coding or non-coding genes. This file has all the isoforms.
Mapping of the coordinates of miRNAs was done with Bedtools with additional shell scripts. Data cleaning is done with python file. Here I use OOP to import a series of classes and methods. Subsequent characterization of exon-derived miRNAs (GO analysis, heatmap, etc) was carried out with the metadata generated from this pipeline
Python files:
-
Data.ipynb
->this file is used for data cleaning, and it contains a series of methods in a class. -
Gene_func.ipynb
-> this file will assign new column to the dataframe to indicate if the host gene is a protein coding or non-coding gene. -
Mouse_miR_analysis.ipynb
-> this file is miR analysis in mouse.
Shell scripts:
Human
- map all exonic miRs=>
bedtools_human_exonic_miR.sh
- map all intronic miRs=>
bedtools_human_intronic_miR.sh
- map all intronic and no host gene miRs=>
bedtools_human_nonexonic_miR.sh
- move all csv files into a new csv folder=>
human_csv.sh
gff3 files:
Human
all human miRs with their genetic coordinates=> hsa.gff3
Mouse
all mouse miRs with their genetic coordinates=> mmu.gff3
Bed files:
Human
- all exonic miRs=>
human_exonic_miR_NCBI.bed
- all intronic and no host genes miRs=>
human_nonexonicmiR_NCBI.bed
- all intronic miRs=>
human_intronic_miR_NCBI.bed
- all no host genes miRs=>
human_miR_no_hostmRNA_NCBI.bed
tsv files:
Any files that contain _unique
means the isoforms of host genes of miRNAs are dropped
Human
- all exonic miRs=>
human_exonic_miR_NCBI.tsv
- all intronic and no host genes miRs=>
human_nonexonicmiR_NCBI.tsv
- all intronic miRs=>
human_intronic_miR_NCBI.tsv
- all no host genes miRs=>
human_miR_no_hostmRNA_NCBI.tsv