Michelle Hwang, [email protected]
Regina Baucom, [email protected]
University of Michigan
https://sites.lsa.umich.edu/baucom-lab/
October 2016
- Python 2
- MCL
- BLAST+
- TE_annotate.sh (Bash script to run entire pipeline.)
- annotate-v2.py (Annotation script. Can be run standalone.)
- cnv_mcl2na.pl (Perl script to format MCL output.)
- sum_bp.py (Optional script in pipeline to sum base pairs.)
- get_fasta_seqs.py (Optional script to grab sequences from a multi-fasta file.)
This a pipeline used to annotate tranposable elements using a word pattern matching and majority rules method. Accuracy of this method is dependent on the divergence of the species being annotated and the species source of the provided transposable element database.
The annotation script will sort sequences into transposable element families, determined by the MCL clustering algorithm. Each family will be annotated according to class (I or II), order, and superfamily.
**** This pipeline assumes host elements are filtered out. ****
Classes | Orders | Superfamilies |
---|---|---|
1 | LTR, SINE, LINE, RETRO | GYPSY, COPIA, ANGELA, ATHILA, RETROLYC1 |
2 | TIR, HELITRON | ACDS, ENSPM, CACTA, PIF/HARBRINGER, HAT, MITE, HELITRON |
The bash script file includes a method to run the entire pipeline. You will need to provide the following files:
- File of your full-length or consensus transposable elements (TE sequences) needing annotation and classification.
- File of your transposable element database in FASTA format.
The pipeline will run through the following steps:
- BLAST
- All by All BLAST of your reads.
- BLAST of reads to NCBI database.
- BLAST of reads to provided transposable element database.
- Clustering
- Clustering of reads via MCL.
- Formatting of MCL output by "cnv_mcl2na.pl"
- Annotation of reads by "annoate.py"
- Optional output
- Generates a table of each sequence's annotation in a tab-delimited format
- "sum_by.py" reports the total base pairs in each family
An example command run with default parameters:
bash TE_annotate.sh sequenced-reads.fa te-db.fa arabidopsis 1 0 0 0 .2 0.5 5
The Python annotation script can be run without using the pipeline. One needs, at minimum:
- MCL formatted outfile
- BLAST outfile from custom transposable element database (outfmt=6 + stitle)
- Fasta file of your TE sequences
The default parameters:
Flag | Default | Description |
---|---|---|
-e | 1 | E-value threshold to be considered from BLAST hit |
-b | 0 | Bit threshold to be considered from BLAST hit |
-p | 0 | Perecent identity threshold to be considered from BLAST hit |
-t | 0.2 | Percent of family members annotated to be considered an annotated family |
-st | 0.5 | Percent of members in a small family annotated to be considered an annotated family |
-ft | 5 | Number of members or below needed to be considered a small family |
To run with default parameters:
python2.7 annotate.py out.mcl blast.out sequences.fa
To run with default parameters and also include BLAST to NCBI results:
python2.7 annotate.py out.mcl blast.out sequences.fa -ncbi ncbi-blast.out
For full list of help and parameters:
python2.7 annotate.py -h
The annotate.py script will output two files and a report to stdout.
This file will contain all your sequences with the transposable element annotations in the header. The header will follow the following format separated by "#":
Sequence ID, Class, Order, Superfamily, MCL family
Summarizes the annotation of each family, one per line, tab-delimited.
This report file will be printed to stdout. To save it, please pipe it into a newfile. It contains a summary of how the script is determining the annotation of each family.
Example Case 1: Typical Family
In this case, we have MCL family 1 with 10 sequences. It passes the annotation threshold, which means that the % of sequences in the family that were annotated passes our designated threshold. Each sequence name is printed out, along with its determined annotation in order of class, order, and superfamily. The last column is the e-value of the BLAST hit that was used to annotate the sequence. If a sequence is determined to be a host element, it is removed. The final line is what the program decides to annotate the family as.
>FAM=1, N=10
Passes annotation threshold.
sequence-1 1 LTR COPIA 1e-100
sequnece-2 1 LTR COPIA 0
sequence-3 1 LTR NA 0.001
sequence-4 1 LTR COPIA 1e-7
sequence-5 NA NA NA no BLAST hits
sequence-6 HOST NA NA =REMOVED
sequence-7 1 LTR NA 1e-56
sequence-8 NA NA NA 0.053
sequence-9 1 LTR COPIA 2e-37
sequence-10 1 LTR COPIA 7e-10
1 LTR COPIA
Example Case 2: Tie Family
In this case, we have a tie to annotate this family between two superfamilies, GYPSY and COPIA. Since GYPSY has the higher e-values for the BLAST hit, we choose GYPSY as the annotation.
>FAM=22, N=6
Passes annotation threshold.
sequence-1 NA NA NA 1e-12
sequence-2 1 LTR GYPSY 3e-72
sequence-3 1 LTR GYPSY 8e-68
sequence-4 1 LTR COPIA 1e-7
sequence-5 1 LTR COPIA 1e-10
sequence-6 1 LTR NA 0
1 LTR GYPSY
Example Case 3: Small Family
For families that are considered "small" based on our small family threshold parameter, they can follow a different annotation threshold. By default, a small family must have at least 50% of its elements annotated to be considered an annotated family. In this case, only 33% of the sequences are annotated, so this family is considered an NA family.
>FAM=100, N=3
Does not pass annotation threshold.
sequence-1 1 LTR GYPSY 1e-11
sequence-2 NA NA NA 6e-16
sequence-3 NA NA NA 5e-10
NA NA NA
Get original ID of all sequences that are COPIA (or other superfamily):
grep 'COPIA' ANNOTATE.fasta | awk -F '[>|#]' '{print $2}'
Get a FASTA file of all sequences that are COPIA (or other superfamily):
python2.7 get_fasta_seqs.py ANNOTATE_Dodder.fasta -k <(grep 'COPIA' ANNOTATE_Dodder.fasta | awk -F '[>|#]' '{print $2}')
Get a FASTA file of all sequences in a specific MCL family:
NUM=6
python2.7 get_fasta_seqs.py ANNOTATE_Dodder.fasta -k <(awk -F '[>|#]' '$6==$NUM {print $2}' ANNOTATE_Dodder.fasta)
Count how many sequences are in a specific MCL familY:
NUM=6
awk -F '[>|#]' '$6==$NUM {print $2}' ANNOTATE_Dodder.fasta | wc -l