SINE Annotation Tool for Plant/Animal Genomes
AnnoSINE_v2 is a SINE annotation tool for plant/animal genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.
To use AnnoSINE_v2, you need to install the tools listed below.
- Python 3.7.4
- HMMER 3.3.1
- BLAST+ 2.10.1
- TRF 4.09
- IRF 3.0X
- CD-HIT 4.8.1
- RepeatMasker 4.1.2
- Node 12.18.2
# pip
cd ./AnnoSINE/bin
pip3 install -r requirements.txt
# conda
conda env create -f AnnoSINE.conda.yaml
## change the permission of IRF
chmod 755 irf308.linux.exe
conda activate AnnoSINE
python3 AnnoSINE_v2.py [options] <mode> <input_filename> <output_filename>
If the program stops in a certain step or has no output, this may result from the strict filtering cutoff. You can try the command below:
python3 AnnoSINE.py [options] <mode> -e 0.01 -minc 1 -s 150 <input_filename> <output_filename>
positional arguments:
mode [1 | 2 | 3]
Choose the running mode of the program.
1--Homology-based method;
2--Structure-based method;
3--Hybrid of homology-based and structure-based method.
input_filename input genome assembly path
output_filename output files path
optional arguments:
-h, --help show this help message and exit
-e, --hmmer_evalue Expectation value threshold for saving hits of homology search (default: 1e-10)
-v, --blast_evalue Expectation value threshold for sequences alignment search (default: 1e-10)
-l, --length_factor Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
-c, --copy_number_factor Threshold of the copy number that determines the SINE boundary (default: 0.15)
-s, --shift Maximum threshold of the boundary shift (default: 80)
-g, --gap Maximum threshold of the trancated gap (default: 10)
-minc, --copy_number Minimum threshold of the copy number for each element (default: 20)
-a, --animal If set to 1, then Hmmer will search SINE using the animal hmm files from Dfam. (default: 0)
-b, --boundary Output SINE seed boundaries based on TSD or MSA (default: msa)
-f, --figure Output the SINE seed MSA figures and copy number profiles (y/n). Please note that this step may take a long time to process. (default: n)
-auto, --automatically_continue If set to 1, then the program will skip finished steps and continue unifinished steps for a previously processed output dir. (default: 0)
-r, --non_redundant Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
-t, --threads Threads for each tool in AnnoSINE (default: 36)
-irf, --irf_path Path to the irf program (default: '')
-rpm, --RepeatMasker_enable If set to 0, then will not run RepearMasker (Step 8 for the code). (default: 1)
Genome sequence(fasta format).
- Redundant SINE library: $ Step7_cluster_output.fasta
- Non-redundant SINE library with serial number: $Seed_SINE.fa.
- Whole-genome SINE annotation: $Input_genome.fasta.out. This file contains high-similarity SINE annotations.
- SINE candidates information predicted by homology search: $ ../Family_Seq/Family_Name/Family_Name.out. (m=1 or 3 required)
- SINE candidate sequences predicted by structure search: $ ../Input_Files/Input_genome-matches.fasta. (m=2 or 3 required)
- Extended candidate sequences for TSD search: $ Step1_extend_tsd_input.fa
- TSD identification outputs: $ Step2_tsd.txt
- MSA extended input sequences flanked with TSD: $ Step2_extend_blast_input.fa
- MSA output: $ Step3_blast_output.out
- Intermediate sequences with MSA quality examination: $ Step3_blast_process_output.fa
- SINE candidate sequences after MSA quality examination: $ Step4_rna_input.fasta
- SINE candidates blast against RNA database outputs $ Step4_rna_output.out
- Classified SINE candidates after RNA examintation $ Step4_rna_output.fasta
- TRF output $ Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
- SINE candidates after removing elements consist of tandem repeats $ Step5_trf_output.fasta
- SINE candidate sequences after extension: $ Step6_irf_input.fasta.
- IRF output $ Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
- SINE candidates after removing elements flanked with inverted repeats: $ Step6_irf_output.fasta
- CD-HIT output: $ Step7_cluster_output.fasta.clstr
You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).
cd ./AnnoSINE/Testing
python3 ../bin/AnnoSINE.py -t 20 3 A.thaliana_Chr4.fasta ./Output_Files
Results of AnnoSINE tests on testing data are saved in Output_Files.