-
Notifications
You must be signed in to change notification settings - Fork 39
Extending MitoZ s database
For annotation, most of the time, MitoZ's default database works well, if not (usually due to the protein sequences in MitoZ's default database being too distant from your samples), then you might want to build a custom annotation database for MitoZ, here is how to do it.
execute:
$ conda env list
# conda environments:
#
base * /home/guanliang/soft/miniconda3
mitozEnv /home/guanliang/soft/miniconda3/envs/mitozEnv
The exact path for me is: /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz
.
The path for MitoZ's database:
$ ll /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles
total 16K
-rw-rw-r-- 2 guanliang 0 May 12 06:47 __init__.py
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 CDS_HMM
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 rRNA_CM
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 __pycache__
drwxrwxr-x 2 guanliang 4.0K May 24 17:36 MT_database
To list all the database file for PCG annotation:
$ ls /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/*_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Animal_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Annelida-segmented-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Arthropoda_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Bryozoa_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Chaetognatha-arrow-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Chordata_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Cnidaria_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Echinodermata_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Mollusca_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Nematoda_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Nemertea-ribbon-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Platyhelminthes-flatworms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Porifera-sponges_CDS_protein.fa
I would suggest that you NOT touch the /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/
if you do not know what you are doing. Instead, copy this directory to a new place and then edit the files within this new place.
$ mkdir ~/mitoz_custom_db
$ cp -a /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles ~/mitoz_custom_db
$ ls -lhrt ~/mitoz_custom_db/profiles/
total 16K
-rw-rw-r-- 1 guanliang guanliang 0 May 12 06:47 __init__.py
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 CDS_HMM
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 rRNA_CM
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 __pycache__
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 17:36 MT_database
After you update the files within the ~/mitoz_custom_db/profiles/
, when you run MitoZ, you should use the --profiles_dir ~/mitoz_custom_db/profiles
option to tell MitoZ that you want to use this custom database:
$ mitoz annotate --thread_number 8 --fastafiles YOUR_mito_genome.fasta --profiles_dir ~/mitoz_custom_db/profiles --genetic_code 5 --clade Arthropoda
What if you got errors with the --profiles_dir
option? For example,
FileNotFoundError: [Errno 2] No such file or directory: '03_anno_Option_1_test.fasta_mitoscaf.fa.solar.genewise.gff.cds.position.cds
Make sure the value of your --profiles_dir
option is correct, right under the path there should be CDS_HMM
, rRNA_CM
, and MT_database
directories.
And make sure your target clade has the three files in these directories:
CDS_HMM/Arthropoda_CDS.hmm
CDS_HMM/Arthropoda_CDS_length_list
MT_database/Arthropoda_CDS_protein.fa
You can create them by yourself. The "Artrhopoda" here is the clade name.
- Please provide an absolute path to the
--profiles_dir
option!
Have a look at https://github.com/linzhi2013/MitoZ/issues/146.
If your samples belong to arthropods, then you should add the new protein sequences into this file:
~/mitoz_custom_db/profiles/MT_database/Arthropoda_CDS_protein.fa
For example, add the following sequences to this file:
>gi_NC_KX091860_ND1_Cerapanorpa_obtusa_319_aa
MMMIDFIMPLIGSLLLIICVLVGVAFLTLLERKVLGYIQIRKGPNKVGFMGIPQPFCDAIKLFTKEQTYP
ILSNYVSYYFSPIFSLFLSLTVWLVMPYFTNLYTFNLGLMFFLCCTSLGVYTVMIAGWSSNSNYALLGGL
RAVAQTISYEVSLALILLSFVFLIGNYSLMSFFYYQNYVWFIIITFPLALSWFASCLAETNRTPFDFAEG
ESELVSGFNVEYSSGGFALIFLAEYASILFMSMLFSVIFLGCDLMSFMFFIKLTFLSFLFIWVRGTLPRF
RYDKLMYLAWKSFLPLALNYLIFFLGLKVMLIYLY
The header line format must follow this style.
The rule is, to find some mitogenomes more closely related to your samples.
For example, you can use some mitogenomes on NCBI that belongs to the same genus, or family as your sample. If you do not know what clades your samples belong to, you can blast your mitogenome sequences to NCBI's NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), and use the top-hit species.
Normally, adding one species closely related to your sample into MitoZ's database is good enough.
- Download the protein sequences of some more closely related species:
- Add these new protein sequences into the annotation database file:
~/mitoz_custom_db/profiles/MT_database/Arthropoda_CDS_protein.fa
The header line format must follow this style.
>gi_NC_XXX_YYY_Cerapanorpa_obtusa_319_aa
- replace the
XXX
with the Genbank accession number of the protein sequence, andgi_NC_
must be kept for any case. For example, you must useKX091860
instead ofKX091860.1
, which means that the dot (.
) is not allowed here. - replace the
YYY
with the corresponding standard PCG names:ATP6, ATP8, COX1, COX2, COX3, CYTB, ND1, ND2, ND3, ND4, ND4L, ND5, ND6
. - replace
Cerapanorpa_obtusa
with the new genus and species name. For unknown species,GenusName_sp.
is also fine. - replace
319
with the length of the protein sequences.
Here shown is the ND1 gene only. You can do the same thing for the other PCGs. But you do NOT have to add all 13 PCGs. For example, the ATP8 gene is usually very divergent, and thus difficult to be annotated by MitoZ, in this case, you can simply add a new ATP8 protein sequence to your custom MitoZ's database.
Finally, if your samples belong to another clade, say Chordata, then you should edit the ~/mitoz_db/profiles/MT_database/Chordata_CDS_protein.fa
instead.
About:
Commands:
- The -all- subcommand
- The -filter- subcommand
- The -assemble- subcommand
- The -findmitoscaf- subcommand
- The -annotate- subcommand
- The -visualize- subcommand
Usages:
- Installation
- Tutorial
- Extending MitoZ-s database
- Batch processing of many samples
- Known issues
- FAQ
- Some important intermediate files
- Upload to GenBank
MitoZ-tools:
- Overview: The -mitoz tools- command
- The -mitoz-tools--group_seq_by_gene- command
- The -mitoz tools bold_identification- command
- The -mitoz tools circle_check- command
- The -mitoz tools gbfiletool- command
- The -mitoz tools gbseqextractor- command
- The -mitoz tools msaconverter- command
- The -mitoz tools taxonomy_ranks- command