-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastANI is not working #96
Comments
Yes, you need to install it separately and need to make sure it's in your system PATH. |
Hi Dr. Olm, thanks for reply. I just runt the code and see: mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash) May I ask that if I want to use fastANI for S_algorithm with dRep compare or dereplicate, do all the three "ERROR" stools need to be installed? For sure, fastANI yes. However about the other two? If yes, are the following links correct to download? Many thanks for reply. |
Hello, Nope, to use fastANI you only need to correct the fastANI option. The other two are used for other S_algorithms |
Hi Dr. Olm, it is great to know that. I have only fixed the fastANI option, and now looks good. I have run dRep compare with fastANI, and find that the flag of fastANI '--minFraction' is 0, but the default is 0.2. May I ask that the '--minFraction' is indeed set to 0? or do I misunderstand anything? 11-09 17:51 DEBUG running cluster 1 Thanks |
Hello, Yes, the |
Hi, thanks for reply. It is good to have the default value used. I will see it when the running is finished. |
Hi Dr. Olm, I have run with drep compare with fastANI with 50 processors, and it shows in the log that 1200.5 min is needed as below. However, it has been running already 45 h (2700 min), and I can see that the job is still running, however may I ask that the time indicated in the log is just estimated and the real time may be longer than that? Thanks |
Hello, Yes it is indeed just an estimate. Depending on random factors like genome size, the specifics of the genomes being compared, and the number of cores being used, the actual time can vary a bit. The parallelization also doesn't scale perfectly; 50 cores will take more than twice as long as 25 cores, but the simple formula used to estimate how long the job will take doesn't take this into account. Best, |
Hi Dr. Olm,
Running 8999390 fastANI comparisons- should take ~ 1200.5 min In the Cdb.csv, there is no cluster 592 record. May I ask why?
Plotting MDS plot May I ask why? I think these issues are without any impact on the cluster, and the files in data_tables are totally fine for subsequence analysis, right?
Sorry about many questions and thanks for your time. |
Hello,
Alternatively you could run dRep again in debug mode (add
Best, |
Hi Dr. Olm, Thanks for reply. May I further ask based on your reply?
Thanks. |
Hello,
-Matt |
Hi Dr. Olm, Thank you for reply.
Sorry may I introduce a new question: Thanks |
Hello,
3.1) Yes, your understanding is correct. 3.2) I'm a little confused by the question but can explain a few things. The purpose of primary clustering (done with Mash, which is fast) is to reduce the number of comparisons that have to be done with fastANI (which is slow). The only genomes that are ever compared with FastANI are those that are in the same primary cluster by Mash; genomes in different primary clusters will never be in the same secondary clusters. Mash is not very accurate however (especially with incomplete genomes), so that's why by default dRep casts a wide net with Primary clustering, and then makes secondary clusters with fastANI. If you'd like to compare Mash and fastANI directly, you'll need to run dRep once with the command A bit more detail on this can be found here: https://drep.readthedocs.io/en/latest/choosing_parameters.html -Matt |
Hi Dr. Olm,
It run without error: Reference = [bin.7.fna] INFO [thread 0], skch::main, Count of threads executing parallel_for : 50 However, there is nothing in fastANI_out_592 output. I think there should be one comparison inside. I run another cluster also with only one genome that was good with dRep, there is one comparison in fastANI_out, and the log is same as above. I am confused, is it due to the bin file itself? Could you have a look where the problem is? Thanks |
...interesting. Are you sure that cluster 592 only has a single genome in it? The size of Bdb.csv (measured with the command Thank you, |
Hi Dr. Olm, Yes, there is only one genome difference between Bdb.csv (200001) and Cdb.csv (20000). I run 20000 genomes together, plus the title line, that's why the Bdb.csv is with 200001 lines. Indeed, I have so many fastANI output with only one genome in /data/fastANI_files/ folder. The top 10 lines from Cdb.csv:
The version of fastANI I used is 1.32, and the version of dRep is 2.6.2, and run dRep bonus test --check_dependencies
Sorry about the new problem, and thanks a million for your time. |
Hello, OK interesting- thank you for pointing this out to me. You are indeed correct and apologies for not understanding how my own program works. Unfortunately though I don't think I'll be able to help with this problem, since it seems to be an issue with FastANI. You can try and post an issue on the FastANI page (https://github.com/ParBLiSS/FastANI/issues), but they don't seem to be answered very often. However, in future versions of dRep I will have it assign genomes that are in a primary cluster by themselves to a cluster without needing to run fastANI. Best, |
Hi Dr. Olm,
Thanks |
Hello, Yes, you are correct on both points. FastANI failing doesn't matter in this case, because no matter what that genome would be cluster 592. -Matt |
Hi Dr. Olm, Thank you very much for reply fast and now it does make sense to me. Thanks again. |
Hi Dr. Olm, FYI, I just searched the answer to my fastANI problem, and I found it here ParBLiSS/FastANI#21. I also changed the fragLen like 200, this genome worked with fastANI. A quite small question about the command of fastANI in dRep, like below, the last 10 letters are just to distinguish the output of fastANI in fastANI_files folder, making no any effect on the process of fastANI itself, right?
Thanks |
Interesting. Thank you for pointing this out to me. Yes, the last 10 letters are randomly generated and just used to distinguish output files. -Matt |
Hi Dr. Olm, Thank you for reply. Anyway, this is not a problem in the case of one genome in a cluster. Thanks again. |
Hi Dr. Olm, One question from the output of run 30,000 genome together, with
There is no error reported for Step 1. Cluster
However, in the Cdb, one genome is missing compared to Bdb (there are 30,000 genomes). I also checked the unique primary cluster of Cdb, it is correct with 4362. I also checked the number of comparison in Mdb, it is correct with 59,999, however, there is no record in Ndb.
Thanks |
Hi Dr. Olm, I updated my question, can you look at it when you are available? |
Hello, This is an interesting problem and not something that I've encountered before. I don't see any way that the problem genome could impact anything else, so I believe the rest of the clustering output is still correct. I think this is probably an issue with FastANI, though it's not clear to me what the issue is. The only way I can think of to tell which primary cluster the missing genome is in is to re-run the program with the Best, |
Hi Matt, Thanks for reply. It is good that all rest clusters are fine. Best, |
Hi Matt, When I run ~50,000 genomes, and there is an error:
I think this is due to the large number of genomes, right? Is there any other way to solve this problem without reducing the genome number? In addition, I can shorten my genome names with half length. Thanks |
Yes your understanding correctly. And yes you can still use the clustering as reliable even when warnings are present; I ignore the warnings in my own research -Matt |
Thanks a million, Matt. Best, Wang |
Hi Matt, I have sent one email to your gmail obtained from your github homepage, which is about the paper https://www.nature.com/articles/s41467-017-02018-w, may I ask whether you know it? Thanks, Wang |
Hi Matt, making plots 1, 2, 3, 4 I think this does not matter for the clustering results, right? |
Yes, that will not impact clustering results. In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it? |
Hi Matt, As for the other question, I have a question about the metadata for four samples, and sent it to [email protected], do you receive it? Thanks |
Sorry to follow this question, as indicating to fail to make plot for primary cluster, why I still find 'Primary_clustering_dendrogram.pdf' in the folder of 'figures'? |
The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind. I don't believe I got the email. Maybe try re-sending to [email protected] |
So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes? Thanks |
Hi Matt, do I understand correctly? Thanks, Wang |
I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem |
Hi Matt, okay, thanks. I think this failure to make plot is due to too large number of genomes, right? When I run less genomes, this plot was plotted without failure. |
Yes exactly |
Good to learn. Anyway, the unaffected clustering results are the most important. Many thanks. |
Hi Matt, May I ask a question about the memory used? If I run 30000 genomes at once (~90G in size) with 40 threads, how much memory and storage (considering the intermediate files) has to be used approximately? Thanks |
Hello, I think I found a bug in dRep related to running it with the -l option less than 3000. I am trying to derep a collection of viral genomes (many of them are less than 2k) so I run derep as such: This results in the majority of my genomes failing the fastANI step and I think its becasue the default FragmentLength for fastANI is 3000 and that should be lowered when running dRep with -l 1000 (maybe FragmentLength should be set equal to whatever is set for -l if its less than 3000?). Or there should be an option in the dRep command to set options in the fastANI command. |
Hello, Can you please confirm you’re running the most up-to-date versions of dRep and fastANI? I remember this being an issue that I believe I fixed -Matt |
Yes, I am running the most up-to-date versions *I think of dRep (v3.4.0) and fastANI (v1.33). |
Ahhh OK I see. I used to have an option to set the fragment length, but it ended up being a problem with different version of inStrain. I will try and address this in the next dRep update- thank you for bringing it to my attention -MO |
Hello,
but fstani is there
and
what is happening? Maybe doesnt like the fastANI version? |
Hi @Gian77 - What happens when you try the command FastANI sometimes requires its own dependencies, which could be why this isn't working -Matt |
Hey @MrOlm Thanks fro your fast answer on this. This, what comes out, what's
Gian |
Hi Gian, This is what I suspected. This has something to do with C++ and how fastANI is written, which is something I don't fully understand. Someone else was able to fix the problem here - #146 And a discussion of the problem on fastANI's gitHub can be found here - ParBLiSS/FastANI#96 Best, |
Thanks much! I dug this out... I was able to make fastANI to work by installing the right gsl libraries in the conda environemnt
dRep worked after this fix. Gian |
Following on the discussion. Main problem: I have fastANI installed and it works. I try to run:
But if I do:
Which basically means it return an error when you call it with dRep cannot proceed, even if fastANI is working, error:
Tool version:
|
Hi @AlessioMilanese - Thanks for tracking down this issue and posting here- I very much appreciate it! I will address in the next dRep update (or please feel free to submit a pull request if you're inclined). In the meantime, it seems that updating fastANI to new versions fixes this fastANI behavior. Thanks again! |
Thanks for the fast response Matt. For some reasons I cannot install a newer version of FastANI, but I will figure it out.
I'm not sure what is the best solution. I think I would add an option (like |
Hi @AlessioMilanese - yeah that would be a fine solution, or (if possible) just doing a different check for fastANI that doesn't return a 0 error code |
Hi Dr. Olm,
I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
3355 primary clusters made
Step 3. Perform secondary clustering
Running 8999390 fastANI comparisons- should take ~ 1200.5 min
Traceback (most recent call last):
File "/install/software/anaconda3.6.b/bin/dRep", line 33, in
controller.parseArguments(args)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments
self.compare_operation(**vars(args))
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper
drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper
data_folder, wd=workDirectory, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes
ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes
df = run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI
exe_loc = drep.get_exe('fastANI')
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe
assert False, "{0} isn't working- make sure its installed".format(name)
AssertionError: fastANI isn't working- make sure its installed
May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.
Thanks
The text was updated successfully, but these errors were encountered: