To construct reliable and comprehensive repeat libraries is a challenging task due to the variation in repeat structure and the difficulty of assembling repeats in genome sequences. As many elements vary considerably in genetic structure and sequence, the only means of achieving reliable results when identifying and annotating TEs is to practice complementary approaches. A flowchart describing our overall approach to TE identification is given in Fig. below. The specific methods for each type are detailed below. We employ de novo signature-based detection programs that rely upon prior knowledge concerning the sharing between different TEs of standard architectural features necessary for the process of transposition.
Examples of classification according to similarity to known TEs include records in databases like Repbase and protein profiles retrieved from the Pfam database. Unfortunately, only well-described TEs that have a robust structural signature can be discovered by these methods. Some TEs do not have such characteristics and thus cannot be distinguished by this approach. In contrast to homology-based methods, signature-based methods are less biased by similarity to the set of known elements.
LTR retrotransposons represent the largest genomic mass among all repeats. As a result, it is very important to collect this type of element with high confidence. We annotated LTR-RTs using the PhyLTR pipeline.
Briefly, the elements are collected using LTRharvest and filtered by LTRdigest LTRdigest searches for homologs in the putative LTR-RTs using HMMER3 and a set of TE-related pHMMs we provided from Pfam and GyDB. and other custom programs. Both LTRharvest and LTRdigest are in a package called GenomeTools.
creating suffix array Commands:
gt suffixerator -db Input.fna -indexname Input.fna.index -dna -suf -lcp -des -ssp
- -dna: input is DNA sequence
- -suf: output suffix array (suftab) to file
- -lcp: output lcp table (lcptab) to file
- -des: output sequence descriptions to file
- -ssp: output sequence separator positions to file
Searches the input sequence for direct repeats (LTRs) that are separated by a given distance (default 1 kb) and outside of which are apparent target site duplications (TSDs). Candidates distinguished by LTRharvest .
gt ltrharvest -index Input.fna.index -gff3 Input.fna.ltrharvest.out.gff -seqids yes -v yes -mintsd 4 -maxtsd 20 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -minlenltr 100 -maxlenltr 1000 -mindistltr 100 -maxdistltr 15000 -vic 60
- -minlenltr: specify minimum length for each LTR.
- -maxlenltr: specify maximum length for each LTR.
- -mindistltr: specify minimum distance of LTR startpositions.
- -maxdistltr: specify maximum distance of LTR startpositions.
- -mintsd: specify minimum length for each TSD.
- -maxtsd: specify maximum length for each TSD.
- -vic: specify the number of nucleotides (to the left and to the right that will be searched for TSDs and/or motifs around 5' and 3' boundary of predicted LTR retrotransposons.
- -mat: specify matchscore for extension-alignment.
- -mis: specify mismatchscore for extension-alignment.
- -ins: specify insertionscore for extension-alignment.
- -del: specify deletionscore for extension-alignment.
- -v: verbose mode.
- -seqids: use sequence descriptions instead of sequence numbers in GFF3 output.
- -gff3: specify GFF3 outputfilename.
LTRdigest searches for homologs in the putative LTR-RTs using HMMER3 and a set of TE-related pHMMs we provided from Pfam and GyDB.
gt -j 20 ltrdigest -matchdescstart -outfileprefix Input.fna.LTRdigest -hmms Dfam.hmms -seqfile Input.fna
- -matchdescstart: exactly match the sequence descriptions from the input files for the desired sequence IDs (in GFF3) from the beginning to the first whitespace.
- -outfileprefix: prefix for output files (e.g. 'foo' will create files called 'foo_.csv' and 'foo_.fas').
- -hmms: profile HMM models for domain detection in HMMER3 format.
- -seqfile: set the sequence file from which to take the sequences Began extracting LTR_retrotransposon sequences from LTRharvest GFF.
Began extracting LTR_retrotransposon sequences from LTRharvest GFF
Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
bedtools getfasta -fi Input.fna -s -bed Input.LTRharvest_LTR_retrotransposons.gff > Input.LTRharvest_LTR_retrotransposons.fasta
- -fi Input FASTA file
- -fo Output file (opt., default is STDOUT
- -bed BED/GFF/VCF file of ranges to extract from -fi
- -s Force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented.
Usage: nhmmer [options] <query hmmfile|alignfile|seqfile> <target seqfile>
nhmmer --tblout DfamClassification/Input.fna.nhmmer_DfamHits.table --incE 1e-05 -E 10 --cpu 20 Dfam_ERV_LTR.hmm LTRharvest_LTR_retrotransposons.fasta
- --tblout : save parseable table of hits to file
- --incE : consider sequences <= this E-value threshold as significant [0.01] (x>0)
- -E : report sequences <= this E-value threshold in output [10.0] (x>0)
nhmmer_table2columns.py < Input.fna.nhmmer_DfamHits.table > Input.LTR_retrotransposon_DfamBestHits.tab
gffAddAttr.py -gff Input.fna.LTRdigest.withORFs_gt_300bp.gff -attr dfamClassification -map Input.fna.LTR_retrotransposon_DfamBestHits.tab -mapKey ID -restrictType LTR_retrotransposon -replaceIfNone > Input.fna.LTRdigest.withDfam.gff
Using BLAST (Basic Local Alignment Search Tool) RepBase data base.
tblastx -db Repbase_ERV_LTR.fasta -query Input.LTRharvest_LTR_retrotransposons.fasta -evalue 1e-05 -outfmt "7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sstrand" -num_threads 20 -max_hsps 25 Output.fna.tblastx_Repbase.tab
- -db: BLAST database name
- -evalue: Expectation value (E) threshold for saving hits
- -outfmt: 7 can be additionally configured to produce a custom format specified by space delimited format specifiers, or by a token specified by the delim keyword.
- -max_hsps: Set maximum number of HSPs per subject sequence to save for each query
Extracting best hits from tblastx on Repbase results, using best_blast_hit.py.
best_blast_hit.py < Output.fna.tblastx_Repbase.tab > Output.fna.LTR_retrotransposon_RepbaseBestHits.tab
gffAddAttr.py -gff Input.LTRdigest.withDfam.gff -attr repbaseClassification -map Input.LTR_retrotransposon_RepbaseBestHits.tab -mapKey ID -restrictType LTR_retrotransposon -replaceIfNone > Input.LTRdigest.withDfam.withRepbase.gff
Note: The Full pipeline depsited in zenodo Further details regarding Class1 TEs Identification, classification, divergent, scripts. please install the script. [].
Here, we began with the recognized genomic coordinates of LTR-RTs identified in the previous step. These candidates were masked with maskfasta from BedTools to avoid conflicts or duplicate hits.
Usage: bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
bedtools maskfasta -fi InputGenome.fa -bed LTRdigestClassifiedNoFP.gff -fo InputGenome_Msked.fa
Next, open reading frame sequences were extracted from the masked genome by applying the getorf tool from EMBOSS v6.4.0.0. The minimum ORF size was set to 500 bp in anticipation of detecting the apyrimidinic endonuclease (APE) gene (which is 600--800 bp in 97% of inspected non-LTR elements)
getorf -sequence InputGenome_Msked.fa -outseq InputGenome_Msked.fa -minsize 500 -find 2
- -minsize: Minimum nucleotide size of ORF
- -find: 2 (Nucleic sequences between STOP codons)
the exploration of the genomic sequences with MGEScan-non-LTR , which identifies and classifies non-LTR TEs in genomic sequences using probabilistic models based on the structure of the 12 established non-LTR TE clades. More precisely, we used MGEScan-non-LTR and hmmsearch from HMMER with two separate hidden Markov model (HMM) profiles, one for the reverse transcriptase (RT) gene and one for the endonuclease (APE) gene, both of which are well conserved among non-LTR TEs.
virtualenv $MGESCAN_VENV
source $MGESCAN_VENV/bin/activate
git clone https://github.com/COL-IU/mgescan.git
cd mgescan
python setup.py install
mgescan nonltr <genome_dir> [--output=<data_dir>] [--mpi=<num>]
All eukaryotic DNA transposons reported so far belong to a single category of elements which use the so-called ``cut-and-paste'' transposition mechanism, except Helitrons, which transpose by rolling-circle replication. Here, we employed methodologies for the detection of DNA transposons in the studied genomes based on the initial identification of TIR, and non-autonomous elements such as miniature inverted-repeat elements (MITEs) and helitron.
MITEs are DNA-based elements that have TIRs but lack a transposase gene, and their well-defined structural features make them suitable for discovery by computational approaches. a valuable tool for detecting MITEs in eukaryotic genomes, MiteFinder. this tool is capable of detecting both perfect and imperfect inverted repeats through a string matching approach. It computes a new function to cluster MITE sequences into different MITE families in several steps.
- First, it builds a k-mer index and seeks inverted repeats. Then, all sequence candidates are distinguished by the presence of a TIR pair of default length and a TSD pair.
- the scaffolds are divided into multiple sequence fragments that overlap by 800 bp, which is the maximum length of MITEs, to guarantee that all inverted repeats are identified.
- pairs of TIRs having lengths in the range of 50-800 bp are retained, and the remainder used as seeds for MITE candidates in the next step.
- identified sequences are compared with MITEs in the Repbase database using blastn. Those with high similarity are considered valid positives, and those with low similarity as false positives. For each MITE cluster, the sequence with the highest blast score was selected via best_blast_hit.py as the representative family sequence. The tool was executed with default parameters, except for the use of a confidence-score threshold of 0.5 to exclude low-confidence candidates.
miteFinder -input Input.fna -output Output_Mite.fa -pattern_scoring pattern_scoring.txt -threshold 0.5
- -threshold Threshold of removing mite candidates with low-confidence score.
- -pattern_scoring The path of a scoring file.
Clustering (CD-Hit)
CD-Hit -i Output_Mite.fa -o Clustered-Output_Mite.fa -c 0.8
Classification (BLAST)
blastn -task blastn -query SpeciesMITEs.fa -subject Clustered-Output_Mite.fa -outfmt 6 > out.csv
Extract Best hits best_blast_hit.py
best_blast_hit.py < out.csv > bestHits
HelitronScanner relies on sequence matches between trained local combinational variables (LCVs) and genome . Specifically, it scores 5$'$ and 3$'$ termini based on a training set of published Helitrons, and then merges the coordinates and scores for putative Helitron-like sequences. Multiple stages are examined, using the HelitronScanner tool, First, scan the Head/Tail with a default threshold is 1, which allows any matches to LCVs. The location scores are further filtered in the pairends stage, considering a threshold of 6 for Helitron’s both ends.
java -jar HelitronScanner.jar scanHead -lcv_filepath -genome -buffer_size -output HeadOut
java -jar HelitronScanner.jar scanTail -lcv_filepath -genome -buffer_size -output TailOut
java -jar HelitronScanner.jar pairends -hs HeadOut -ts TailOut -o Pair.fa -ht 6 -tt 6
java -jar HelitronScanner.jar draw -pscore Pair.fa -genome genome.fa -output FinalHelitron.fa --pure
Clustering (CD-Hit)
CD-Hit -i FinalHelitron.fa -o Clustered_Helitron.fa -c 0.8
Classification (BLAST)
blastn -task blastn -query SpeciesHelitron.fa -subject Clustered_Helitron.fa -outfmt 6 > out.csv
Extract Best hits best_blast_hit.py
best_blast_hit.py < out.csv > bestHits
libraries generated using the programs mentioned above merged, pluse the TE repeat sequences present in study species can be extracte from Dfam Consensus-20170127 and RepBase-20181026 using the script queryRepeatDatabase shipped with RepeatMasker. The results of both steps above were combined and processed for masking and annotation using RepeatMasker.
Annotation, and mask the genome using RepeatMasker
RepeatMasker -nolow -norna -a -s -gff -cutoff 225 -no_is -pa 40 -lib Combined_Lib.fa Input_Genome.fa -dir Output
- -cutoff: Sets cutoff score for masking repeats when using -lib
- -no_is: Skips bacterial insertion element check
- -lib: Allows use of a custom library
- -norna: Does not mask small RNA (pseudo) genes
- -nolow: Does not mask low_complexity DNA or simple repeats
- -a:Writes alignments in .align output file
- -s: Slow search; 0-5% more sensitive, 2-3 times slower than default
- -gff:Creates an additional Gene Feature Finding format output
Additionally, the copy number of each TE and determined genome coverage from the RepeatMasker output files (.out), which correspond to the number of insertions identified in the masked genomes. The remaining unmasked portion of the genome is scanned using RepeatModeler with default settings to detect any unclassified TEs such as TIRs that were missed by structure-based TE identification.
Collecting repetitive sequences missed by previouse Structure-based tools using RepeatModeler
The remaining sequence is then used as the input for RepeatModeler:
BuildDatabase -name MaskedGenome Output.masked.fna
RepeatModeler -database MaskedGenome
Among the sequences generated by RepeatModeler, some are associated with identities and others are not. The script repeatmodeler_parse.pl separates elements without identities into repeatmodeler_unknowns and the others in repeatmodeler_identities.
- Finally the candidate duplication removal assisted using seqket rmdup.
- Filtered library used as an input for reannotation using RepeatMasker To estimate the final Transposable elements in the genome.