psiMNB

Active learning approach of Multinomial Naive Bayes for finding PUS-dependent pseudouridylation

psiMNB is a training workflow for building Multinomial Naive Bayes (MNB) models of PUS label/probability prediction using active learning approach. psiMNB requires homer(findMotifsGenome.pl)/bedtools/cd-hit and several R packages pre-installation and predominantly used in unix-based operating systems. Therefore, for the usability of psiMNB, we recommend running all the tools and scripts in WSL2 (WSL2 installation guide: https://pureinfotech.com/install-windows-subsystem-linux-2-windows-10/) or unix-based system with R and python.

#use pacman to install packages in batch

install.packages("pacman")
library(pacman)

#load and install required R packages 
p_load('optparse','openxlsx','dplyr','ggplot2','RColorBrewer','optparse','motifStack','gridGraphics','stringr')

Input data

Test data: human_PUS_MNB_input_k-mer_overall.txt, human_PUS_MNB_input_k-mer_TRUB1.txt, human_PUS_MNB_input_k-mer_PUS3.txt, human_PUS_MNB_input_k-mer_PUS1.txt. (i.e. training dataset, result generated by findMotif.sh)

Usage

Generate training dataset

Run findMotif.sh to automatically invoke motifFinding.pl and motifStack_vis.r (to generate motif clustering result)

Notice: findMotif.sh require motifConfigure_hg38.xml file configuration setting. required files: hg38.fa/hg38.fa.fai; required tools to be installed: bedtools, homer (findMotifsGenome.pl), cd-hit

Notice: findMotif.sh will invoke homer (findMotifsGenome.pl) to generate motif enrichment result (e.g. 'homerResults/motif10.motif'), therefore, to improve enrichment effectiveness for human, we recommend replacing the default homer background file (all.rna.motifs) to the new version we provided!

bash findMotif.sh ePSI_seq_total_polyA_Day0_mix.bed Day0_common_anno_group_redundance_mix.txt

Notice: name of 'ePSI_seq_total_polyA_Day0_mix.bed' should be correspond to (One on One) name column of 'Day0_common_anno_group_redundance_mix.txt' (the former is provided to offer genome location, the latter is provided to offer annotation for each observation in the former; 'Y_extendSeq_20nt' column in 'Day0_common_anno_group_redundance_mix.txt' should be provided)

Finally, based on the output file ('findMotif_pssm_append_info.xlsx' suffix) generated from findMotif.sh, manually organizate input dataset like the input data (a two-column txt file with the first column as 20nt extended sequence and the second as assigned labels based on known tRNA Ψ-sites evidence)

Determine k-mer

python build_psi_MNB_overall_test_kmer.py # input data: human_PUS_MNB_input_k-mer_overall.txt is loaded

Build MNB model and run prediction

python build_psi_MNB_overall.py # input data: human_PUS_MNB_input_k-mer_overall.txt is loaded
python build_psi_MNB_TRUB1.py # input data: human_PUS_MNB_input_k-mer_TRUB1.txt is loaded
python build_psi_MNB_PUS3.py # input data: human_PUS_MNB_input_k-mer_PUS3.txt is loaded
python build_psi_MNB_PUS1.py # input data: human_PUS_MNB_input_k-mer_PUS1.txt is loaded

Or run with script parameters:

python PUSscan_build.py -training_file human_PUS_MNB_input_k-mer_overall.txt -model_name overall -to_predict Day0_common_anno_group_redundance_mix.txt -output_dir /public/home/chenzr/PSI_Seq_brainCell/A1-A12-totalRNA-result/psiFinder_ANN_res/PUSscan_test 
python PUSscan_predict.py -model_file overall_multinomialnb_model.pkl -to_predict Day0_common_anno_group_redundance_mix.txt -output_dir /public/home/chenzr/PSI_Seq_brainCell/A1-A12-totalRNA-result/psiFinder_ANN_res/PUSscan_test

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitattributes		.gitattributes
Day0_common_anno_group_redundance_mix.txt		Day0_common_anno_group_redundance_mix.txt
Day0_common_anno_group_redundance_mix.xlsx		Day0_common_anno_group_redundance_mix.xlsx
Day0_common_anno_group_redundance_mix_PUS1_prediction.xlsx		Day0_common_anno_group_redundance_mix_PUS1_prediction.xlsx
Day0_common_anno_group_redundance_mix_PUS3_prediction.xlsx		Day0_common_anno_group_redundance_mix_PUS3_prediction.xlsx
Day0_common_anno_group_redundance_mix_TRUB1_prediction.xlsx		Day0_common_anno_group_redundance_mix_TRUB1_prediction.xlsx
Day0_common_anno_group_redundance_mix_overall_prediction.xlsx		Day0_common_anno_group_redundance_mix_overall_prediction.xlsx
Day0_mix_findMotif_pssm_append_info.xlsx		Day0_mix_findMotif_pssm_append_info.xlsx
PUS1_multinomialnb_model.pkl		PUS1_multinomialnb_model.pkl
PUS3_multinomialnb_model.pkl		PUS3_multinomialnb_model.pkl
PUSscan_build.py		PUSscan_build.py
PUSscan_predict.py		PUSscan_predict.py
README.md		README.md
TRUB1_multinomialnb_model.pkl		TRUB1_multinomialnb_model.pkl
all.rna.motifs		all.rna.motifs
all.rna.motifs.png		all.rna.motifs.png
alpha_plot.png		alpha_plot.png
build_psi_MNB_PUS1.py		build_psi_MNB_PUS1.py
build_psi_MNB_PUS3.py		build_psi_MNB_PUS3.py
build_psi_MNB_TRUB1.py		build_psi_MNB_TRUB1.py
build_psi_MNB_overall.py		build_psi_MNB_overall.py
build_psi_MNB_overall_test_kmer.py		build_psi_MNB_overall_test_kmer.py
ePSI_seq_total_polyA_Day0_mix.bed		ePSI_seq_total_polyA_Day0_mix.bed
findMotif		findMotif
findMotif.sh		findMotif.sh
findMotif_pssm_all.r		findMotif_pssm_all.r
findMotif_pssm_all.sh		findMotif_pssm_all.sh
human_PUS_MNB_input.xlsx		human_PUS_MNB_input.xlsx
human_PUS_MNB_input_k-mer_PUS1.txt		human_PUS_MNB_input_k-mer_PUS1.txt
human_PUS_MNB_input_k-mer_PUS3.txt		human_PUS_MNB_input_k-mer_PUS3.txt
human_PUS_MNB_input_k-mer_TRUB1.txt		human_PUS_MNB_input_k-mer_TRUB1.txt
human_PUS_MNB_input_k-mer_overall.txt		human_PUS_MNB_input_k-mer_overall.txt
kmer_plot.png		kmer_plot.png
model_evaluation.xlsx		model_evaluation.xlsx
model_result.png		model_result.png
motifConfigure_hg38.png		motifConfigure_hg38.png
motifConfigure_hg38.xml		motifConfigure_hg38.xml
motifFinding.pl		motifFinding.pl
motifStack_vis.r		motifStack_vis.r
overall_class_distribution.png		overall_class_distribution.png
overall_multinomialnb_model.pkl		overall_multinomialnb_model.pkl
psiMNB.png		psiMNB.png
run_psiMNB.sh		run_psiMNB.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

psiMNB

Active learning approach of Multinomial Naive Bayes for finding PUS-dependent pseudouridylation

Contents

Pre-installation

Input data

Usage

Generate training dataset

Determine k-mer

Build MNB model and run prediction

About

Releases

Packages

Languages

chenzhr23/psiMNB

Folders and files

Latest commit

History

Repository files navigation

psiMNB

Active learning approach of Multinomial Naive Bayes for finding PUS-dependent pseudouridylation

Contents

Pre-installation

Input data

Usage

Generate training dataset

Determine k-mer

Build MNB model and run prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages