-
Notifications
You must be signed in to change notification settings - Fork 39
Known issues
How to check:
Have a look at the file *.mitoAssemble.K*.overlap_information
, it may have something like this:
>C3252271 overlap between 5' and 3' are 52bp
TGAACGGAATAGTTGGTAATTAGTTTAATCAAAACAAATGATTTCGACTCA
and check the file *.mitoAssemble.K*.mitogenome.fa
:
$ head -1 *.mitoAssemble.K*.mitogenome.fa
>C3252271 topology=circular
Because this overlapping region is quite long and not simple repeats (say AAAAAAAA
), (in most cases) we can safely say that we have got a circular mitochondrial genome (there are methods to verify this, see XXX).
To do:
- Use another Circularity check script (I remember I have written one?) which can be aware of the simple repeats
Say we have this file *.mitoAssemble.K51.overlap_information
:
>C3892882 overlap between 5' and 3' are 9bp
TTTTTTTT
and we have this file *.mitoAssemble.K51.mitogenome.fa
:
>C3892882 topology=circular
In this case, it is dubious that the assembled mt genome is circular. In this case, you can just treat the sequence as linear, or try to assemble the mt genome with different kmers (larger or smaller), which might be able to overcome the simple repeat problem. This depends on the goal of your study, e.g., if you are going to use only the PCGs for subsequent analysis and you have got all PCGs already, then a circular (complete) mt genome does not help. The changing breakpoint method mentioned 3. Breakpoint and incomplete genes should also help.
The summary.txt
file:
#Seq_id Length(bp) Circularity Closely_related_species
C3252271 16597 yes Onychomys leucogaster
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
C3252271 14 82 69 + tRNA trnR(ucg) tRNA-Arg 1
C3252271 84 381 298 + CDS ND4L NADH dehydrogenase subunit 4L 1
.
.
.
C3252271 14568 14772 205 + CDS ATP8 ATP synthase F0 subunit 8 1
C3252271 14729 15410 682 + CDS ATP6 ATP synthase F0 subunit 6 1
C3252271 16193 16262 70 + tRNA trnG(ucc) tRNA-Gly 1
C3252271 16262 >16598 337 + CDS ND3 NADH dehydrogenase subunit 3 1
Here, the ND3 cannot find its stop codon, but because this sequence is actually circular already, the stop codon of ND3 is at the 5' end of this sequence.
In this case, you can simply manually change the breakpoint of the mitochondrial genome. But be careful, you should find a breakpoint where there is no gene (especially overlapping regions of different genes), for example, here sites between 14772
and 14729
, or the site 16262
are not good positions. Instead, Any positions between 15410
and 16193
can be chosen.
After this, you can re-annotate the sequence using the mitoz annotate
command, by also providing the fastq files (--fq1
and/or --fq2
options), MitoZ will calculate the sequencing depth of all sites along the mitogenome, which in turn provides more evidence to show if the mitogenome is really complete or not (check the abundance track on the circos.svg
and cirvos.png
files) --- if the sequencing depth (abundance) around the original breakpoint is normally high like other sites of the mitogenome, not sudden dropping or increasing a lot (which indicates repeats), then the mitogenome is complete.
During testing, when I submitted the job to SGE for running, sometimes it generated a core dump file (e.g. core.81923
) in the annotation step, I do not know the exact reason yet. But when I locally re-ran the command, the problem resolved. Complicated Conda environments can also induce problems, e.g. I once used a conda
installed by another person to create the mitozEnv
environment, and I always got missing PCGs in the annotation step.
Do not know why sometimes even circos --modules
shows every required Perl module is installed, MitoZ still fails to run Circos (thus no circos.svg
and circos.png
files), which results in something like this in the stderr output:
2022-06-01 17:18:53,822 - mitoz.utility.utility - INFO -
cp -r /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt /home/gmeng/test/sing/mt_annotation/ttt.ttt.megahit.mitogenome.fa.result
cp: der Aufruf von stat für „/home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
cp: der Aufruf von stat für „/home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
2022-06-01 17:18:53,842 - mitoz.utility.utility - ERROR -
Error occured when running command:
cp -r /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt /home/gmeng/test/sing/mt_annotation/ttt.ttt.megahit.mitogenome.fa.result
Solution:
For each specimen, find the mt_annotation
directory:
$ source activate mitozEnv # or use "mamba" or "conda" instead of "source" the command here.
$ ls mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/ -d | while read f ; do cd $f ; circos ; cd ../../../ ; done
# to list the resulting files:
$ ls mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{svg,png}
If you have multiple specimens within the same directory, for example, /my/project/SampleID_1
, /my/project/SampleID_2
,
$ cd /my/project/ # go to the project directory containing the assembly directory of each specimen
$ source activate mitozEnv # or use "mamba" or "conda" instead of "source" the command here.
$ ls */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/ -d | while read f ; do cd $f ; circos ; cd ../../../../ ; done
# the '*' here will match your sample IDs
# to list the resulting files:
$ ls */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{svg,png}
# to copy the SVG/PNG files to the resulting directories
# Warning: The below command assumes there is NO '_' in your sample ID
$ ls */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{png,svg} | perl -a -F'/' -ne 'chomp; $F[2]=~s/tmp\_//; $F[2]=~s/\_mitoscaf\.fa//; my $sample=(split(/\_/,$F[2]))[0]; $F[2]=~s/$sample\_$sample\.//; my $result_dir="$sample/$sample.result/$sample.$sample.$F[2].result"; `cp $_ $result_dir`; '
updates:.
-
The Singularity version (MitoZ version 3.3) seems good to me, indicating the above problem is probably because my environmental variables on the cluster are somehow complicated (and I do not know the exact reason), if you have the same problem, please try to install
mitozEnv
into a clean environment or try the Singularity version. -
empty depth file
If the XXX.result/XXX.XXX.megahit.mitogenome.fa.result/circos.depth.txt
file is empty, it is normal for Circos not to run properly.
To specify a specific assembler, use the --assembler
option.
Warning: --assembler spades
only accepts paired-end data, which means that you need to provide both --fq1
and --fq2
!
This can happen if:
- typo for the value of the
--requiring_taxa
option. - Your etetoolkit database is broken. Please re-install it by checking https://github.com/linzhi2013/MitoZ/wiki/Installation#3-the-etetoolkit-database.
While Megahit can generate circular mitogenomes, in rare cases, however, it could generate very long "mitogenomes", for example:
$ cat summary.txt
#Seq_id Length(bp) Circularity Closely_related_species
k141_58720 44502 no
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_58720 22299 22368 70 + tRNA trnI(gau) tRNA-Ile 2
k141_58720 22367 22439 73 - tRNA trnQ(uug) tRNA-Gln 2
k141_58720 22442 22510 69 + tRNA trnM(cau) tRNA-Met 2
k141_58720 22511 23520 1010 + CDS ND2 NADH dehydrogenase subunit 2 2
k141_58720 23523 23589 67 + tRNA trnW(uca) tRNA-Trp 2
k141_58720 23591 23659 69 - tRNA trnA(ugc) tRNA-Ala 2
k141_58720 23662 23731 70 - tRNA trnN(guu) tRNA-Asn 2
k141_58720 23763 23830 68 - tRNA trnC(gca) tRNA-Cys 1
k141_58720 23831 23897 67 - tRNA trnY(gua) tRNA-Tyr 1
k141_58720 23899 25443 1545 + CDS COX1 cytochrome c oxidase subunit I 2
k141_58720 25441 25509 69 - tRNA trnS(uga) tRNA-Ser 3
k141_58720 25513 25580 68 + tRNA trnD(guc) tRNA-Asp 2
k141_58720 25582 26265 684 + CDS COX2 cytochrome c oxidase subunit II 2
k141_58720 26269 26332 64 + tRNA trnK(uuu) tRNA-Lys 2
k141_58720 26333 26536 204 + CDS ATP8 ATP synthase F0 subunit 8 2
k141_58720 26494 27174 681 + CDS ATP6 ATP synthase F0 subunit 6 2
k141_58720 27174 27958 785 + CDS COX3 cytochrome c oxidase subunit III 2
k141_58720 27958 28025 68 + tRNA trnG(ucc) tRNA-Gly 2
k141_58720 28031 28372 342 + CDS ND3 NADH dehydrogenase subunit 3 1
k141_58720 28374 28440 67 + tRNA trnR(ucg) tRNA-Arg 1
k141_58720 28443 28739 297 + CDS ND4L NADH dehydrogenase subunit 4L 1
k141_58720 28733 30115 1383 + CDS ND4 NADH dehydrogenase subunit 4 1
k141_58720 30111 30177 67 + tRNA trnH(gug) tRNA-His 1
k141_58720 30178 30236 59 + tRNA trnS(gcu) tRNA-Ser 3
k141_58720 30236 30305 70 + tRNA trnL(uag) tRNA-Leu 2
k141_58720 30306 32117 1812 + CDS ND5 NADH dehydrogenase subunit 5 1
k141_58720 32114 32635 522 - CDS ND6 NADH dehydrogenase subunit 6 1
k141_58720 32636 32704 69 - tRNA trnE(uuc) tRNA-Glu 1
k141_58720 32710 33852 1143 + CDS CYTB cytochrome b 1
k141_58720 33855 33921 67 + tRNA trnT(ugu) tRNA-Thr 1
k141_58720 33922 33989 68 - tRNA trnP(ugg) tRNA-Pro 1
k141_58720 34957 35022 66 + tRNA trnF(gaa) tRNA-Phe 1
k141_58720 35025 35972 948 + rRNA s-rRNA 12S ribosomal RNA 1
k141_58720 35973 36042 70 + tRNA trnV(uac) tRNA-Val 1
k141_58720 36041 37605 1565 + rRNA l-rRNA 16S ribosomal RNA 1
k141_58720 37607 37681 75 + tRNA trnL(uaa) tRNA-Leu 2
k141_58720 37682 38638 957 + CDS ND1 NADH dehydrogenase subunit 1 1
k141_58720 38637 38704 68 + tRNA trnI(gau) tRNA-Ile 2
k141_58720 38702 38773 72 - tRNA trnQ(uug) tRNA-Gln 2
k141_58720 38775 38843 69 + tRNA trnM(cau) tRNA-Met 2
k141_58720 38844 39881 1038 + CDS ND2 NADH dehydrogenase subunit 2 2
k141_58720 39880 39946 67 + tRNA trnW(uca) tRNA-Trp 2
k141_58720 39948 40016 69 - tRNA trnA(ugc) tRNA-Ala 2
k141_58720 40019 40088 70 - tRNA trnN(guu) tRNA-Asn 2
k141_58720 40153 41685 1533 + CDS COX1 cytochrome c oxidase subunit I 2
k141_58720 41681 41749 69 - tRNA trnS(uga) tRNA-Ser 3
k141_58720 41753 41820 68 + tRNA trnD(guc) tRNA-Asp 2
k141_58720 41822 42505 684 + CDS COX2 cytochrome c oxidase subunit II 2
k141_58720 42509 42572 64 + tRNA trnK(uuu) tRNA-Lys 2
k141_58720 42573 42776 204 + CDS ATP8 ATP synthase F0 subunit 8 2
k141_58720 42734 43414 681 + CDS ATP6 ATP synthase F0 subunit 6 2
k141_58720 43414 44187 774 + CDS COX3 cytochrome c oxidase subunit III 2
k141_58720 44195 44262 68 + tRNA trnG(ucc) tRNA-Gly 2
Protein coding genes totally found: 19
tRNA genes totally found: 32
rRNA genes totally found: 2
--------------------------------------
Genes totally found: 53
And then I checked the duplicate genes, for example, the two COX1 genes and the two ATP8 genes, and found out they are different!
For the correct COX1, it has a sequencing depth of > 40,000 X along the whole gene. But for the wrong COX1, it has sequencing depths from 30X to 200X along the whole gene.
When I blast both COX1 to the NCBI NT database, they both mapped to the target genus.
Thus, be cautious when using the Megahit program for mitogenome assembly! We need to fine-tune the assembly parameters for it! For now, to determine if Megahit's mitogenome is reliable, (1) check the mitogenome length, too long and not circular is definitely problematic; (2) check the sequencing depth along the mitogenome! Be cautious with very high or low regions; (3) Compare the gene sequences to NCBI NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), check if it is your target clade; (4) Compare the results to the results generated by mitoAssemble which are more reliable. (5) adapting the used kmers (see below).
By adapting the used kmers.
One of the possible reasons for the ultra-long mitogenome (>40Kbp) is that, when using small kmers, the assembly graph is much more complex, and thus some sequences are "wrongly" connected. When larger kmers are used, the assembly graph is simplified, and these connections can be avoided.
By default, MitoZ uses --kmers_megahit 21 29 39 59 79 99 119 141
if you use Megahit for assembly (--assembler megahit
). When I set --kmers_megahit 39 59 79 99
(because my maximum read length is only 100 bp) or --kmers_megahit 59 79 99 119 141
, I both got a correct mitogenome (all genes are 100% similar to the ones assembled by --assembler mitoassemble
, see below summary). The sequencing depth along the whole mitogenome now is 1600X to 7199X.
You can also use even larger kmers, depending on your maximum read length. I may set larger default kmers in the further versions of MitoZ, and when the assembly is not good, the users can try smaller kmers by themselves.
#Seq_id Length(bp) Circularity Closely_related_species
k141_114419 16359 yes
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_114419 67 141 75 + tRNA trnL(uaa) tRNA-Leu 2
k141_114419 142 1098 957 + CDS ND1 NADH dehydrogenase subunit 1 1
k141_114419 1097 1164 68 + tRNA trnI(gau) tRNA-Ile 1
k141_114419 1162 1233 72 - tRNA trnQ(uug) tRNA-Gln 1
k141_114419 1235 1303 69 + tRNA trnM(cau) tRNA-Met 1
k141_114419 1304 2341 1038 + CDS ND2 NADH dehydrogenase subunit 2 1
k141_114419 2340 2406 67 + tRNA trnW(uca) tRNA-Trp 1
k141_114419 2408 2476 69 - tRNA trnA(ugc) tRNA-Ala 1
k141_114419 2479 2548 70 - tRNA trnN(guu) tRNA-Asn 1
k141_114419 2583 2649 67 - tRNA trnC(gca) tRNA-Cys 1
k141_114419 2650 2716 67 - tRNA trnY(gua) tRNA-Tyr 1
k141_114419 2718 4262 1545 + CDS COX1 cytochrome c oxidase subunit I 1
k141_114419 4260 4328 69 - tRNA trnS(uga) tRNA-Ser 2
k141_114419 4332 4399 68 + tRNA trnD(guc) tRNA-Asp 1
k141_114419 4401 5084 684 + CDS COX2 cytochrome c oxidase subunit II 1
k141_114419 5088 5151 64 + tRNA trnK(uuu) tRNA-Lys 1
k141_114419 5152 5355 204 + CDS ATP8 ATP synthase F0 subunit 8 1
k141_114419 5313 5993 681 + CDS ATP6 ATP synthase F0 subunit 6 1
k141_114419 5993 6777 785 + CDS COX3 cytochrome c oxidase subunit III 1
k141_114419 6777 6844 68 + tRNA trnG(ucc) tRNA-Gly 1
k141_114419 6850 7191 342 + CDS ND3 NADH dehydrogenase subunit 3 1
k141_114419 7193 7259 67 + tRNA trnR(ucg) tRNA-Arg 1
k141_114419 7262 7558 297 + CDS ND4L NADH dehydrogenase subunit 4L 1
k141_114419 7552 8934 1383 + CDS ND4 NADH dehydrogenase subunit 4 1
k141_114419 8930 8996 67 + tRNA trnH(gug) tRNA-His 1
k141_114419 8997 9055 59 + tRNA trnS(gcu) tRNA-Ser 2
k141_114419 9055 9124 70 + tRNA trnL(uag) tRNA-Leu 2
k141_114419 9125 10936 1812 + CDS ND5 NADH dehydrogenase subunit 5 1
k141_114419 10933 11454 522 - CDS ND6 NADH dehydrogenase subunit 6 1
k141_114419 11455 11523 69 - tRNA trnE(uuc) tRNA-Glu 1
k141_114419 11529 12671 1143 + CDS CYTB cytochrome b 1
k141_114419 12674 12740 67 + tRNA trnT(ugu) tRNA-Thr 1
k141_114419 12741 12808 68 - tRNA trnP(ugg) tRNA-Pro 1
k141_114419 13776 13841 66 + tRNA trnF(gaa) tRNA-Phe 1
k141_114419 13844 14791 948 + rRNA s-rRNA 12S ribosomal RNA 1
k141_114419 14792 14861 70 + tRNA trnV(uac) tRNA-Val 1
k141_114419 14860 16359 1500 + rRNA l-rRNA 16S ribosomal RNA 1
Protein coding genes totally found: 13
tRNA genes totally found: 22
rRNA genes totally found: 2
--------------------------------------
Genes totally found: 37
See https://github.com/linzhi2013/MitoZ/issues/176
Have a look at the --clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges}
option.
If you use Annelida-segmented-worms
or Nemertea-ribbon-worms
or Porifera-sponges
, you may get errors like this:
This is a small bug because MitoZ will try to find the taxonomy lineage for Nemertea-ribbon-worms
(instead of Nemertea).
Current solution:
Find the /app/anaconda/lib/python3.9/site-packages/mitoz/tools/taxonomy_ranks/get_taxonomy_rank_with_ete3_with_super_and_sub.py
, and modify part of its content (around line number 78):
def __init__(self, user_taxa):
NCBITaxa.__init__(self)
self.lineages = dict()
self.user_taxa = user_taxa
self.taxa_searched = user_taxa
if '-' in user_taxa: # add this line
self.taxa_searched = user_taxa.split('-')[0] # add this line
If you are using the Docker or Singularity image, you can copy this file to your host, modify the file, and then, bind it back to the image (so the modified file overwrites the one in the image). For example,
singularity exec --bind $PWD --bind /host/path/modified_get_taxonomy_rank_with_ete3_with_super_and_sub.py:/app/anaconda/lib/python3.9/site-packages/mitoz/tools/taxonomy_ranks/get_taxonomy_rank_with_ete3_with_super_and_sub.py ~/singularity/images/MitoZ_v3.4.sif
I will fix the problem in the next release.
MitoZ option: --resume_assembly
-
For megahit, it works when a run is accidentally stopped, e.g. the server was down. It will NOT work if you continue the run with some parameters changed.
-
Changing the kmers after a running was stopped does not work with Megahit so far. see https://github.com/voutcn/megahit/issues/110
-
When there is not a previous stopped run and you use the
--resume_assembly
option from the very beginning: It will not work either. Because this way, megahit only cares aboutmegahit --continue --out-dir ./megahit_out
options and omits the others, and it assumes all related parameters and commands have been written to./megahit_out
before using the--continue
option (--resume_assembly
for MitoZ).
Sometimes, for unknown reasons, your read 1 and read 2 fastq files may not be properly paired, e.g. the order of read 1 is not the same as read 2. This will lead to some errors and complaints in MitoZ. At least, I know Megahit and BWA do not like them.
Here are the tools to fix the problem:
-
https://bioinf.shenwei.me/seqkit/usage/#pair (already installed in the
mitoEnv
environment, you can use the command directly; however, it can consume a lot of RAM!) -
https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/repair-guide/ (the
repair.sh
script). For example,repair.sh in=<fq1> in2=<fq2> out=<paired.R1.fq.gz> out2=<paired.R2.fq.gz> outs=<singleton output> -Xmx10g
In the next release of MitoZ, it will automatically detect such a problem and fix it on the fly.
The accuracy of gene ranges highly depends on our database. Therefore, if you find the CDS ranges of species can not be accurately determined by MitoZ, you may need to modify the database by yourself (see https://github.com/linzhi2013/MitoZ/wiki/Extending-MitoZ-s-database):
Manually correct the annotation results of MitoZ or NCBI's annotaion, and then add them to the MitoZ's database. See https://github.com/linzhi2013/MitoZ/wiki/Extending-MitoZ-s-database. You might need to remove other NCBI's sequences in the database, i.e., creating your own custom database.
About:
Commands:
- The -all- subcommand
- The -filter- subcommand
- The -assemble- subcommand
- The -findmitoscaf- subcommand
- The -annotate- subcommand
- The -visualize- subcommand
Usages:
- Installation
- Tutorial
- Extending MitoZ-s database
- Batch processing of many samples
- Known issues
- FAQ
- Some important intermediate files
- Upload to GenBank
MitoZ-tools:
- Overview: The -mitoz tools- command
- The -mitoz-tools--group_seq_by_gene- command
- The -mitoz tools bold_identification- command
- The -mitoz tools circle_check- command
- The -mitoz tools gbfiletool- command
- The -mitoz tools gbseqextractor- command
- The -mitoz tools msaconverter- command
- The -mitoz tools taxonomy_ranks- command