Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another ZeroDivisionError: float division by zero #38

Closed
dbrowneup opened this issue Jul 19, 2016 · 10 comments
Closed

Another ZeroDivisionError: float division by zero #38

dbrowneup opened this issue Jul 19, 2016 · 10 comments

Comments

@dbrowneup
Copy link

Encountered a new error while running a scaffold job yesterday. Here's the setup:

#BSUB -J BbB_PB_BESST_v1 -L /bin/bash -W 40:00
#BSUB -n 5 -R "span[ptile=5] rusage[mem=2700] select[nxt]" -M 2700
#BSUB -o OUT_RUN_BESST_v1
##
ml BESST
ml mathstats
ml pysam
ml networkx
ml matplotlib/1.5.0-intel-2015B-Python-2.7.10
#
#BESST commit e2aa3abb6ef759f8c08ec60cfe7e0060cbaa0736
#
cd /scratch/user/dbrowne/2016.07_JUL/2016.07.15_BESST_Scaffolding_HA_PB_Assembly
#
/scratch/user/dbrowne/Software/BESST/runBESST -c polished_BOT6.NM2kb.fixed.full.fasta -f BWA_SXPX.bam BWA_NGNB.bam BWA_HOOW.bam -o ./BESST_OUTPUT_v1 --orientation fr rf rf --iter 1500000 -plots --separate_repeats --min_mapq 40 --dfs_traversal --no_score -max_contig_overlap 10000

And the output:

Number of initial contigs: 3275
Choosing mode: 843
mu_adjusted:758.764147704, sigma_adjusted:140.391113055, skewness_adjusted:-1.43886745031
Creating contig graph with library:  BWA_SXPX.bam
Traceback (most recent call last):
  File "/scratch/user/dbrowne/Software/BESST/runBESST", line 415, in <module>
    main(args)
  File "/scratch/user/dbrowne/Software/BESST/runBESST", line 177, in main
    (G, G_prime) = CG.PE(Contigs, Scaffolds, Information, C_dict, param, small_contigs, small_scaffolds, bam_file)      #Create graph, single out too short contigs/scaffolds an
d store them in F
  File "/scratch/user/dbrowne/Software/BESST/BESST/CreateGraph.py", line 256, in PE
    Contigs, Scaffolds, G = RepeatDetector(Contigs, Scaffolds, G, param, G_prime, small_contigs, small_scaffolds, Information)
  File "/scratch/user/dbrowne/Software/BESST/BESST/CreateGraph.py", line 986, in RepeatDetector
    GO.repeat_contigs_logger(Repeats, Contigs, param.output_directory, small_contigs, param)
  File "/scratch/user/dbrowne/Software/BESST/BESST/GenerateOutput.py", line 63, in repeat_contigs_logger
    print >> repeat_logger_file, "{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(cont_obj.name, cont_obj.length, round(cont_obj.coverage,1), round(cont_obj.coverage/param.mean_coverage,0
), round(param.mean_ins_size,0), placable)
ZeroDivisionError: float division by zero
@dbrowneup
Copy link
Author

Oh wait hold on... reading through old bugs and stuff, I realized that I'm operating on an old version of mathstats:

[dbrowne@ada4 2016.07.15_BESST_Scaffolding_HA_PB_Assembly]$ ml spider mathstats

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  mathstats: mathstats/0.2.2-intel-2015B-Python-2.7.10
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    This module can be loaded directly: module load mathstats/0.2.2-intel-2015B-Python-2.7.10

    Help:
       Composable style cycles - Homepage: https://pypi.python.org/pypi/mathstats

Let me try and get that updated. Then hopefully I can close this issue!

@dbrowneup
Copy link
Author

dbrowneup commented Jul 19, 2016

Okay, unfortunately, updating to mathstats 0.2.5 and retrying the command did not work, resulting in exactly the same error.

Here's the problematic code: https://github.com/ksahlin/BESST/blob/master/BESST/GenerateOutput.py#L63

Specifically, this part of the code is where the problem is coming from:

round(cont_obj.coverage/param.mean_coverage,0)

Replicating the error in IPython:

In [10]: round(1.0/0,0)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-10-87eba751f4c6> in <module>()
----> 1 round(1.0/0,0)

ZeroDivisionError: float division by zero

In [11]: round(1/0.0,0)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-11-11b0b1965d8a> in <module>()
----> 1 round(1/0.0,0)

ZeroDivisionError: float division by zero

Therefore param.mean_coverage must be set to zero for some reason. I confirmed this by looking at the Statistics.txt file after the program exits:

Parsing BAM file...
L50:  320 N50:  175683 Initial contig assembly length:  201800654
Time initializing BESST objects:  0.015793800354
Total time elapsed for initializing Graph:  0.0181450843811
Reading bam file and creating scaffold graph...
ELAPSED reading file: 1827.91752005
NR OF FISHY READ LINKS:  0
Number of USEFUL READS (reads mapping to different contigs uniquly):  2378713
Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding):  18487313
Reads with too large insert size from "USEFUL READS" (filtered out):  2344045
Initial number of edges in G (the graph with large contigs):  0
Initial number of edges in G_prime (the full graph of all contigs before removal of repats):  5304
Number of duplicated reads indicated and removed:  24399
Mean coverage before filtering out extreme observations =  498.836133553
Std dev of coverage before filtering out extreme observations=  1202.33071831
Mean coverage after filtering =  0.0
Std coverage after filtering =  0.0
Length of longest contig in calc of coverage:  5642686
Length of shortest contig in calc of coverage:  1327
Detecting repeats..

This parameter is set here:

param.mean_coverage = mean_cov

Where we see param.mean_coverage = mean_cov. The mean_cov variable comes from the previous line of code:

mean_cov, std_dev_cov = CalculateMeanCoverage(Contigs, Information, param)

mean_cov, std_dev_cov = CalculateMeanCoverage(Contigs, Information, param)

This leads us to the CalculateMeanCoverage function:

def CalculateMeanCoverage(Contigs, Information, param):

It looks like something is going wrong in this section of the function:

## SMOOTH OUT THE MEAN HERE by removing extreme observations ##

The mean_cov variable is set again here:

mean_cov = sum(filtered_list) / n

It is dependent on the filtered_list object which is calculated by the RemoveOutliers function:

def RemoveOutliers(mean_cov, std_dev, cov_list):

Nothing really jumps out at me in the code as to what is going wrong, but perhaps it will be more obvious to you, Kristoffer! Hopefully my bit of digging around is helpful to you. I'd like to get this fixed ASAP, because I need to finish scaffolding this assembly! :P

@dbrowneup dbrowneup changed the title ZeroDivisionError: float division by zero Another ZeroDivisionError: float division by zero Jul 19, 2016
@ksahlin
Copy link
Owner

ksahlin commented Jul 19, 2016

Very good digging indeed. This is indeed an issue of the filtering of outliers in the coverage histogram of contigs -- something I reported here butnever fixed. It would be helpful to see how your particular contig coverage histogram looks -- Do you have a coverage histogram of the contigs (can be obtained e.g. by running besst with the --plot flag)? One way to estimate some "baseline" coverage is to estimate the mode or median in the cases where the filtering of outliers function converges to 0 for the mean coverage. But again, it would be better to see how such a histogram looks like to find a good approach to prevent it. Actually printing the vector (list?) with coverage values we could work directly with those values.

@dbrowneup
Copy link
Author

Here is the coverage histogram:
contigs_coveragebwa_sxpx bam

The assembly was made with FALCON using only PacBio sequencing data. I'm using BESST to scaffold it with the Illumina libs I've got.

I've done this before on a different PacBio assembly, without issue. Here's the coverage histogram from that run:
contigs_coveragesxpx_pb_denovo_bwa bam

And the resulting gap predictions:

gappredictions

I'm thinking the coverage data is just too skewed in the new assembly, so that essentially everything is considered an outlier?

Maybe I can manually specify the coverage cutoff with -z? I'm curious to try setting -z_min 1 to get rid of everything with 0 coverage. However, something I noticed a while ago is that removing low-coverage sequences actually decreased scaffold quality substantially. But that was dealing with DBG assembly of Illumina data.

@dbrowneup
Copy link
Author

Well, I got it to work with this setup:

#BSUB -J BbB_PB_BESST_v2 -L /bin/bash -W 40:00
#BSUB -n 5 -R "span[ptile=5] rusage[mem=2700] select[nxt]" -M 2700
#BSUB -o OUT_RUN_BESST_v2
##
export PYTHONPATH=$PYTHONPATH:/general/home/dbrowne/.local/lib/python2.7/site-packages
ml Python/2.7.10-intel-2015B
ml HTSlib/1.2.1-intel-2015B
ml SAMtools/1.2-intel-2015B-HTSlib-1.2.1
ml pysam/0.8.3-intel-2015B-Python-2.7.10
ml networkx/1.10-intel-2015B-Python-2.7.10
ml matplotlib/1.5.0-intel-2015B-Python-2.7.10
#
#BESST commit e2aa3abb6ef759f8c08ec60cfe7e0060cbaa0736
#
cd /scratch/user/dbrowne/2016.07_JUL/2016.07.15_BESST_Scaffolding_HA_PB_Assembly
#
/scratch/user/dbrowne/Software/BESST/runBESST -c polished_BOT6.NM2kb.fixed.full.fasta -f BWA_SXPX.bam BWA_NGNB.bam BWA_HOOW.bam -o ./BESST_OUTPUT_v2 --orientation fr rf rf --iter 1500000 -plots --separate_repeats --min_mapq 40 --dfs_traversal --no_score -max_contig_overlap 10000 -z 900 -z_min 1 -filter_contigs 10000

Output attached:
BESST_OUTPUT_v2.zip

@YoannAnselmetti
Copy link

Hi Kristoffer,

I copy-cut my text from issue #42 that is now closed.

I get exactly the same error with version 2.2.5 that Ryan Chikhi in issue #42.
But it's very strange since my long insert size library (~35kb) works but not my MP (1.5kb) and PE (180bp) libraries...

I'm using BESST 2.2.5 with option "-z 10000".

I get two kinds of error:

  • First (equivalent to Ryan):

Traceback (most recent call last):
File "/home/yanselmetti/.local/bin/runBESST", line 415, in
main(args)
File "/home/yanselmetti/.local/bin/runBESST", line 177, in main
(G, G_prime) = CG.PE(Contigs, Scaffolds, Information, C_dict, param, small_contigs, small_scaffolds, bam_file)
#Create graph, single out too short contigs/scaffolds and store them in F
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/CreateGraph.py", line 276, in PE
infer_spurious_link_count_threshold(G_prime, param)
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/CreateGraph.py", line 325, in infer_spurious_lin
k_count_threshold
link_params = e_nr_links.Param(param.mean_ins_size, param.std_dev_ins_size, cov, param.read_len, 0)
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/e_nr_links.py", line 65, in init
self.readfrequency = 2 * self.read_len / self.cov
ZeroDivisionError: float division by zero

  • Second (equivalent to error of this issue):

Traceback (most recent call last):
File "/home/yanselmetti/.local/bin/runBESST", line 415, in
main(args)
File "/home/yanselmetti/.local/bin/runBESST", line 177, in main
(G, G_prime) = CG.PE(Contigs, Scaffolds, Information, C_dict, param, small_contigs, small_scaffolds, bam_file)
#Create graph, single out too short contigs/scaffolds and store them in F
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/CreateGraph.py", line 256, in PE
Contigs, Scaffolds, G = RepeatDetector(Contigs, Scaffolds, G, param, G_prime, small_contigs, small_scaffolds, I
nformation)
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/CreateGraph.py", line 986, in RepeatDetector
GO.repeat_contigs_logger(Repeats, Contigs, param.output_directory, small_contigs, param)
File "/home/yanselmetti/.local/lib/python2.7/site-packages/BESST/GenerateOutput.py", line 63, in repeat_contigs_l
ogger
print >> repeat_logger_file, "{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(cont_obj.name, cont_obj.length, round(cont_o
bj.coverage,1), round(cont_obj.coverage/param.mean_coverage,0), round(param.mean_ins_size,0), placable)
ZeroDivisionError: float division by zero

What should I do to avoid tehse errors?
Should I increase value of "-z" option?

Thanks you in advance for your answer,
Yoann

@ksahlin
Copy link
Owner

ksahlin commented Mar 2, 2017

Hi Yoann,

I implemented a fix for this in the latest develop version 7a0efa6. Would you mind trying this commit and see if it fixes your issue?

The fix removes all contigs with 0 in coverage before calculating average coverage. Contigs with 0 coverage (no mappings) will, by definition, not be scaffolded anyway. I consequently needed to change the output to be coverage estimates of "all contigs with at least one good mapping".

Hopefully this will also fix Dan's issue with the first histogram shown above. Too many contigs with 0 coverage will make the "Removing outlier" algorithm converge to 0.

@YoannAnselmetti
Copy link

Hi Kristoffer,

Thanks for this fix, I will try it on my experiments that returned a "float division by zero" error.

Before your reply, I modified the parameters use for my experiments:
I added "-z_min 1" option and increased "-z 10000" option to "-z 100000".
And now it doesn't send me anymore a "ZeroDivisionError" for my experiments!

Does "-z_min 1" option is equivalent to not consider contigs with 0 coverage (and then avoid "ZeroDivisionError")?

@ksahlin
Copy link
Owner

ksahlin commented Mar 3, 2017

Ok great!

No it doesn't. Because contigs can have coverage between 0 and 1.

@ksahlin
Copy link
Owner

ksahlin commented Mar 3, 2017

I implemented so that -z_min now can take a float value, with default of 0.001 in 92a81f7

@ksahlin ksahlin closed this as completed Feb 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants