AnnoSINE_v2

SINE Annotation Tool for Plant/Animal Genomes

Introduction

AnnoSINE_v2 is a SINE annotation tool for plant/animal genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.

Prerequisites

To use AnnoSINE_v2, you need to install the tools listed below.

Installation

# pip
cd ./AnnoSINE/bin
pip3 install -r requirements.txt

# conda
conda env create -f AnnoSINE.conda.yaml

## change the permission of IRF
chmod 755 irf308.linux.exe

Usage

conda activate AnnoSINE
python3 AnnoSINE_v2.py [options] <mode> <input_filename> <output_filename>

If the program stops in a certain step or has no output, this may result from the strict filtering cutoff. You can try the command below:

python3 AnnoSINE.py [options] <mode> -e 0.01 -minc 1 -s 150 <input_filename> <output_filename>

Argument

positional arguments:
  mode                  [1 | 2 | 3]
                        Choose the running mode of the program.
                                1--Homology-based method;
                                2--Structure-based method;
                                3--Hybrid of homology-based and structure-based method.
  input_filename        input genome assembly path
  output_filename       output files path

optional arguments:
  -h, --help                   show this help message and exit
  -e, --hmmer_evalue           Expectation value threshold for saving hits of homology search (default: 1e-10)
  -v, --blast_evalue           Expectation value threshold for sequences alignment search (default: 1e-10)
  -l, --length_factor          Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
  -c, --copy_number_factor     Threshold of the copy number that determines the SINE boundary (default: 0.15)
  -s, --shift                  Maximum threshold of the boundary shift (default: 80)
  -g, --gap                    Maximum threshold of the trancated gap (default: 10)
  -minc, --copy_number         Minimum threshold of the copy number for each element (default: 20)
  -a, --animal                 If set to 1, then Hmmer will search SINE using the animal hmm files from Dfam. (default: 0)
  -b, --boundary               Output SINE seed boundaries based on TSD or MSA (default: msa)
  -f, --figure                 Output the SINE seed MSA figures and copy number profiles (y/n). Please note that this step may take a long time to process. (default: n)  
  -auto, --automatically_continue If set to 1, then the program will skip finished steps and continue unifinished steps for a previously processed output dir. (default: 0)
  -r, --non_redundant          Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
  -t, --threads		              Threads for each tool in AnnoSINE (default: 36)
  -irf, --irf_path	            Path to the irf program (default: '')
  -rpm, --RepeatMasker_enable  If set to 0, then will not run RepearMasker (Step 8 for the code). (default: 1)

Inputs

Genome sequence(fasta format).

Outputs

Redundant SINE library: $ Step7_cluster_output.fasta
Non-redundant SINE library with serial number: $Seed_SINE.fa.
Whole-genome SINE annotation: $Input_genome.fasta.out. This file contains high-similarity SINE annotations.

Intermediate Files

SINE candidates information predicted by homology search: $ ../Family_Seq/Family_Name/Family_Name.out. (m=1 or 3 required)
SINE candidate sequences predicted by structure search: $ ../Input_Files/Input_genome-matches.fasta. (m=2 or 3 required)
Extended candidate sequences for TSD search: $ Step1_extend_tsd_input.fa
TSD identification outputs: $ Step2_tsd.txt
MSA extended input sequences flanked with TSD: $ Step2_extend_blast_input.fa
MSA output: $ Step3_blast_output.out
Intermediate sequences with MSA quality examination: $ Step3_blast_process_output.fa
SINE candidate sequences after MSA quality examination: $ Step4_rna_input.fasta
SINE candidates blast against RNA database outputs $ Step4_rna_output.out
Classified SINE candidates after RNA examintation $ Step4_rna_output.fasta
TRF output $ Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
SINE candidates after removing elements consist of tandem repeats $ Step5_trf_output.fasta
SINE candidate sequences after extension: $ Step6_irf_input.fasta.
IRF output $ Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
SINE candidates after removing elements flanked with inverted repeats: $ Step6_irf_output.fasta
CD-HIT output: $ Step7_cluster_output.fasta.clstr

Testing

You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).

cd ./AnnoSINE/Testing
python3 ../bin/AnnoSINE.py -t 20 3 A.thaliana_Chr4.fasta ./Output_Files

Results of AnnoSINE tests on testing data are saved in Output_Files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AnnoSINE_v2

Table of Contents

Introduction

Prerequisites

Installation

Usage

Argument

Inputs

Outputs

Intermediate Files

Testing

Citations

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
Dfam_hmm		Dfam_hmm
Family_Seq		Family_Seq
Input_Files		Input_Files
Output_Files		Output_Files
Testing		Testing
bin		bin
AnnoSINE.conda.yaml		AnnoSINE.conda.yaml
LICENSE		LICENSE
README.md		README.md
pipeline.png		pipeline.png

License

oushujun/AnnoSINE_v2

Folders and files

Latest commit

History

Repository files navigation

AnnoSINE_v2

Table of Contents

Introduction

Prerequisites

Installation

Usage

Argument

Inputs

Outputs

Intermediate Files

Testing

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages