Continue crashed analysis from tree inference step #61

diegomarquezp · 2021-08-09T19:36:24Z

Hello Siavash. Hoping you are well.
I'm reaching the final steps to build a tree from the Silva 13.8 dataset.
Unfortunately, it crashed during the tree inference step.
The only iteration for the realignment step took a bit more than 3 weeks. I was checking the wiki for a way to continue with the alignment produced with this step, but apparently the --aligned option still goes over a realignment step.
I did some time estimations with subsets of the Silva database and it should take about 3 more days before finishing the tree, only if we manage to skip the realignment, otherwise it would be 3 more weeks again.

I was wondering if I'm missing an option from the wiki to continue from this substep of the iteration. Otherwise, I can try to modify the code to provide the last alignment to the first iteration. If that's the case, I will need to kindly ask you to refer me to the involved files in this change or any development documentation to aid in solving this situation.

Update: I found out that the inference step consists of a call to fasttreeMP - the debug output shows the exact args to execute the binary with. I'm thinking that the final steps would involve running a modified version of treeholder.py

Thanks beforehand for your help.

The text was updated successfully, but these errors were encountered:

smirarab · 2021-08-09T20:00:03Z

To be clear, you ran PASTA using the alignment from the previous stage as input (-i)? And it still tried to do an alignment? Also, are you planning to do one iteration or more? If you want to have only one iteration, there is no reason to run FastTree inside PASTA. You can just run it outside.

…

On Mon, Aug 9, 2021 at 12:36 PM Diego Alonso Marquez Palacios < ***@***.***> wrote: Hello Siavash. Hoping you are well. I'm reaching the final steps to build a tree from the Silva 13.8 dataset. Unfortunately, it crashed during the tree inference step due to the outdated fasttreeMP (which I did not expect to be used again in the processing when preparing a new server) The only iteration for the realignment step took a bit more than 3 weeks. I was checking the wiki for a way to continue with the alignment produced with this step, but apparently the --aligned option still goes over a realignment step. I did some time estimations with subsets of the Silva database and it should take about 3 more days before finishing the tree, only if we manage to skip the realignment, otherwise it would be 3 more weeks again. I was wondering if I'm missing an option from the wiki to continue from this substep of the iteration. Otherwise, I can try to modify the code to provide the last alignment to the first iteration. If that's the case, I will need to kindly ask you to refer me to the involved files in this change or any development documentation to aid in solving this situation. Thanks beforehand for your help. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#61>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGJXOHN67O7H37DUPIDPBDT4AU4HANCNFSM5B2VJU4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

-- Siavash Mirarab

diegomarquezp · 2021-08-09T20:10:40Z

Last month I started pasta with -i pastajob_temp_iteration_initialsearch_seq_alignment.txt --aligned
The last file produced in the folder today was pastajob_temp_iteration_0_seq_alignment.txt

So is it possible to just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt ?

From the subset tests logs, I'm guessing the command below would be useful for this last step?:

/home/ec2-user/pasta-code/pasta/bin/fasttreeMP -quiet -nt -gtr -gamma -                                              **configuration)
fastest -intree /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre -log /home/ec2-user/.pasta/past
ajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/log /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/i                         pmj.launch_alignment(context_str=context_str)
nput.fasta

(assuming input.fasta == ...iteration_0_seq_alignment.txt)

edit: Yes, it did perform a realignment step (one iteration)

Thank you!

smirarab · 2021-08-10T00:31:16Z

Hi Diego, Yes, you can just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt. Note that default PASTA does 3 iterations, but most of the advantage comes from the first iteration. The result of running fasttree on this alignment will give you the result of the first iteration. I think given the size of your dataset, one iteration is reasonable and sufficient. If you wanted to do one more iteration, you can, but not necessary (I think). Three more caveats. 1. ...iteration_0_seq_alignment.txt is already masked to remove super gappy sites. The unmasked file is also available (...temp_iteration_0_seq_unmasked_alignment.gz) but you don't want to give that to FastTree. Also, it is in a format that needs translation (more on that below). However, you may want to see how long ...iteration_0_seq_alignment.txt is and you may decide to mask even a bit more; the default is to mask a site if it is a gap in >99.9% of species. 2. the ...iteration_0_seq_alignment.txt file uses PASTA's internal names for species. These names can be translated back to the original names using a simple text file (._temp_name_translation.txt) and a command that I will send you. 3. If you are going to use the FastTree tree as your final tree, you may want to eliminate the starting tree (/home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre). It will take a bit longer, but that should be fine. For both 1 and 2, I have scripts that are shipped as part of PASTA. Let me write a quick markdown file and describe these. In the meantime, you can start your FastTree run. I hope to get to this in a day or so. Thanks Siavash

…

On Mon, Aug 9, 2021 at 1:10 PM Diego Alonso Marquez Palacios < ***@***.***> wrote: Last month I started pasta with -i pastajob_temp_iteration_initialsearch_seq_alignment.txt --aligned The last file produced in the folder today was pastajob_temp_iteration_0_seq_alignment.txt So is it possible to just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt ? From the subset tests logs, I'm guessing the command below would be useful for this last step?: /home/ec2-user/pasta-code/pasta/bin/fasttreeMP -quiet -nt -gtr -gamma - **configuration) fastest -intree /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre -log /home/ec2-user/.pasta/past ajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/log /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/i pmj.launch_alignment(context_str=context_str) nput.fasta (assuming input.fasta == ...iteration_0_seq_alignment.txt) Thank you! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGJXODNAQDI74HBJXKR5CLT4AY4VANCNFSM5B2VJU4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

-- Siavash Mirarab

diegomarquezp · 2021-08-10T03:53:38Z

Thanks so much Siavash. I'll let you know about how fasttree goes.

smirarab · 2021-08-10T22:40:43Z

Diego, I added information about getting the unmasked alignment from the PASTA temporary files and name mapping here: https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md#step-6-using-run_seqtoolspy and in particular https://github.com/smirarab/pasta/blob/master/pasta-doc/pasta-tutorial.md#restart-pasta-from-the-previous-runs

…

On Mon, Aug 9, 2021 at 8:53 PM Diego Alonso Marquez Palacios < ***@***.***> wrote: Thanks so much Siavash. I'll let you know about how fasttree goes. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGJXOGG45W4PSV3JXS6ZB3T4CPE3ANCNFSM5B2VJU4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

-- Siavash Mirarab

diegomarquezp · 2021-08-22T19:38:37Z

Hi Siavash.

Thanks for the added steps on the tutorial.
I could restart the crashed step using the first and only iteration's alignment and finally obtain a tree this week.
With the tree and aligned sequences, I tried to run SEPP on it through QIIME after importing the PASTA results, but it took way too long (15+ hours) compared with the SEPP reference database that you published for SILVA 12.8 (40 minutes).
What I have noticed is, the aligned sequences contained in the 12.8 QZA file are only a subset of the whole 12.8 reference database.
I was wondering if you used any special criteria to extract the subset. Would restrict the aligned sequences set to 2-3 sequences per species do the work? That would roughly match the size of the sequences of 12.8.
That would be the only step needed as we already have the alignment.

Thank you very much.

smirarab · 2021-08-26T17:54:01Z

Hi Diego,

There are two potential reasons.

Default output of PASTA is not masked for super gappy sites. There are many sites that have just a couple of letters in them among millions of species. We need to remove those before using them as input to SEPP. For removing gappy sites, I suggest you use the run_seqtools.py method that you learned about in the tutorial. I would remove sites with 99.9% gaps or 99% gaps. You can try different thresholds and see how many sites are left in the final alignment. You should hopefully have something in the same order as 12.8 (thousands of sites).
Once (1) is taken care of, if the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.

Thanks

diegomarquezp · 2021-08-26T21:17:09Z

Hi Siavash, thanks for the response. I will let you know about this

smirarab mentioned this issue Sep 2, 2021

Silva 138? smirarab/sepp-refs#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue crashed analysis from tree inference step #61

Continue crashed analysis from tree inference step #61

diegomarquezp commented Aug 9, 2021 •

edited

Loading

smirarab commented Aug 9, 2021 via email

diegomarquezp commented Aug 9, 2021 •

edited

Loading

smirarab commented Aug 10, 2021 via email

diegomarquezp commented Aug 10, 2021

smirarab commented Aug 10, 2021 via email

diegomarquezp commented Aug 22, 2021

smirarab commented Aug 26, 2021

diegomarquezp commented Aug 26, 2021

Continue crashed analysis from tree inference step #61

Continue crashed analysis from tree inference step #61

Comments

diegomarquezp commented Aug 9, 2021 • edited Loading

smirarab commented Aug 9, 2021 via email

diegomarquezp commented Aug 9, 2021 • edited Loading

smirarab commented Aug 10, 2021 via email

diegomarquezp commented Aug 10, 2021

smirarab commented Aug 10, 2021 via email

diegomarquezp commented Aug 22, 2021

smirarab commented Aug 26, 2021

diegomarquezp commented Aug 26, 2021

diegomarquezp commented Aug 9, 2021 •

edited

Loading

diegomarquezp commented Aug 9, 2021 •

edited

Loading