SINE Annotation Tool for Plant Genomes
AnnoSINE is a SINE annotation tool for plant genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.
To use AnnoSINE, you need to install the tools listed below.
- Python 3.7.4
- HMMER 3.3.1
- BLAST+ 2.10.1
- TRF 4.09
- IRF 3.05
- CD-HIT 4.8.1
- RepeatMasker 4.1.2
- Node 12.18.2
# pip
cd ./AnnoSINE/bin
pip3 install -r requirements.txt
# conda
conda env create -f AnnoSINE.conda.yaml
## download IRF
mv irf305.linux.exe irf
## set PATH for IRF
export PATH=$IRF_PATH:$PATH
conda activate AnnoSINE
python3 AnnoSINE.py [options] <mode> <input_filename> <output_filename>
positional arguments:
mode [1 | 2 | 3]
Choose the running mode of the program.
1--Homology-based method;
2--Structure-based method;
3--Hybrid of homology-based and structure-based method.
input_filename input genome assembly path
output_filename output files path
optional arguments:
-h, --help show this help message and exit
-l, --length_factor Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
-c, --copy_number_factor Threshold of the copy number that determines the SINE boundary (default: 0.15)
-s, --shift Maximum threshold of the boundary shift (default: 80)
-g, --gap Maximum threshold of the trancated gap (default: 10)
-minc, --copy_number Minimum threshold of the copy number for each element (default: 20)
-b, --boundary Output SINE seed boundaries based on TSD or MSA (default: msa)
-f, --figure Output the SINE seed MSA figures and copy number profiles (y/n) (default: n)
-r, --non_redundant Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
Genome sequence(fasta format).
- Redundant SINE library: $ Step7_cluster_output.fasta
- Non-redundant SINE library with serial number: $Seed_SINE.fa.
- Whole-genome SINE annotation: $Input_genome.fasta.out. This file contains high-similarity SINE annotations.
- SINE candidates information predicted by homology search: $ ../Family_Seq/Family_Name/Family_Name.out. (m=1 or 3 required)
- SINE candidate sequences predicted by structure search: $ ../Input_Files/Input_genome-matches.fasta. (m=2 or 3 required)
- Extended candidate sequences for TSD search: $ Step1_extend_tsd_input.fa
- TSD identification outputs: $ Step2_tsd.txt
- MSA extended input sequences flanked with TSD: $ Step2_extend_blast_input.fa
- MSA output: $ Step3_blast_output.out
- Intermediate sequences with MSA quality examination: $ Step3_blast_process_output.fa
- SINE candidate sequences after MSA quality examination: $ Step4_rna_input.fasta
- SINE candidates blast against RNA database outputs $ Step4_rna_output.out
- Classified SINE candidates after RNA examintation $ Step4_rna_output.fasta
- TRF output $ Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
- SINE candidates after removing elements consist of tandem repeats $ Step5_trf_output.fasta
- SINE candidate sequences after extension: $ Step6_irf_input.fasta.
- IRF output $ Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
- SINE candidates after removing elements flanked with inverted repeats: $ Step6_irf_output.fasta
- CD-HIT output: $ Step7_cluster_output.fasta.clstr
You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).
cd ./AnnoSINE/bin
python3 AnnoSINE.py 3 ../Testing/A.thaliana_Chr4.fasta ../Output_Files
Results of AnnoSINE tests on testing data are saved in Output_Files.