Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silva 138? #3

Open
jwdebelius opened this issue Sep 22, 2020 · 12 comments
Open

Silva 138? #3

jwdebelius opened this issue Sep 22, 2020 · 12 comments

Comments

@jwdebelius
Copy link

Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database along side that

@ahalhed
Copy link

ahalhed commented Jan 27, 2021

Yes please! I am hoping to run a fragment insertion analysis with SILVA but my OTUs were picked using a SILVA 138 reference database prepared for QIIME2.

@ericsson-lab
Copy link

Yes please! This would be amazing!

@valentynbez
Copy link

Hello, I've been trying to recreate a SeppReferenceTree artefact pipeline for Silva 138.1 from the repo.

  1. I downloaded Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta and Exports/taxonomy/tax_slv_ssu_138.1.tre.gz from last SILVA release.
  2. Then I run nw_topology -bI to prepare the tree.
    I am interested in creating SeppReferenceDatabase for V4 region specifically.

For the moment I am particularly puzzled with the masking step from here.

  • should I first filter reference sequences from SILVA to V4 region or do this masking step?
  • how should I choose the masking length properly?

The help would be much appreciated!
Thank you.

@smirarab
Copy link
Owner

smirarab commented Sep 2, 2021

It seems @diego92sigma6 has had some luck with this issue: smirarab/pasta#61. Perhaps he can chime in.

For the moment I am particularly puzzled with the masking step from here.

* should I first filter reference sequences from SILVA to V4 region or do this masking step?

* how should I choose the masking length properly?

That masking step is meant to remove super-gappy sites from the alignment (not just retaining V4).

  • If you are interested in only V4, I would first remove everything other than V4, then run that masking step to remove super gappy sites. Don't forget to re-estimate branch lengths after you do masking.
  • How you define supper gappy is up to you. I chose to remove sites that 99.5% or more gaps. With the new version of run_seqtools.py available from https://github.com/smirarab/pasta/ you can provide percentages directly.

Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database alongside that

I'd be happy to share the scripts. In fact, I thought everything necessary is here already: https://github.com/smirarab/sepp-refs/tree/master/silva and https://github.com/smirarab/sepp/tree/master/sepp-package/buildref

When I last tried to used SILVA 138, I ran to the issue of non-monophyly of archaea. I didn't have time to further follow up further on that.

@diegomarquezp
Copy link

Hi, @crusher083 hoping you are well.
I'm also trying to compile 13.8, maybe we can share a couple of things.
I manually performed an alignment using PASTA over the non-truncated dataset
which was successful but now we are having trouble with SEPP because too many sequences are being used (whole database - +2000000 sequences). @smirarab advised to remove gappy sequences

Since my dataset is too big to run on a desktop computer (12GB fasta), I had to create a small C++ program for gappy sequence filtering that uses streams to optimize resources. This 12GB alignment with 2 million sequences is taking 120 seconds to perform the filtering. I would be really happy to help with this program if you are in a similar situation with resources availability.

My current situation is that only 3 out of the 2 million sequences were 97+% gaps, so I'm following a second piece of advice from @smirarab to filter similar sequences which I will write here for convenience.

If the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.

On my side, maybe filtering to V4 only may be the step I was missing to reduce the dataset. I was wondering if there is any tool you know about to perform this task.

I'm happy to help with anything you need.

@smirarab
Copy link
Owner

smirarab commented Sep 2, 2021

Diego, you may have misunderstood what I asked for filtering. I was advising removing sites (so columns) not species (rows) that have more than 99.5% gaps. Did you try simply removing gappy sites?

@diegomarquezp
Copy link

Oh, that's my bad!
It makes sense now. I will remove the gappy sites and see how it goes.

@diegomarquezp
Copy link

Hi Siavash,
I'm getting very close to have the reference. I was wondering if you could please refer me to a resource to understand the step rooting on the lowest common ancestor of archaea. I'm honestly a bit lost in here. Does this mean associating the RAxML output tree to another preexisting one? and which tools would you use to perform this?
Thank you!

@smirarab
Copy link
Owner

smirarab commented Sep 22, 2021 via email

@diegomarquezp
Copy link

Hi Siavash, I think we can use this taxonomy file, which contains the accession and semicolon separated taxonomy path for each entry . My input for the raxml steps were the full aligned sequences from silva (ended up using these) with 99.99 sites removed, with a tree generated from fasttree. I decided to manually build the tree instead of using this one because some accessions were associated with the same taxa, producing undesired results in the raxml steps. The produced tree has accessions as nodes. I think this is correct because the sepp-ref for 12.8 is also based on an accession tree.
I will expose a public folder with the results so far once the branch length step is done. I will let you know.
Thank you!

@smirarab
Copy link
Owner

smirarab commented Nov 12, 2021 via email

@diegomarquezp
Copy link

Hi Siavash.
I had to stop this project for a while.
What I have so far is the tree with branch lengths from the full aligned sequences of silva, but with many repeated sequences. The only issue left to solve besides rooting on archaea is to reperform it with the reduced sequences. There are about 40k seqs that are repeated according to raxml (even with the original, non-trimmed dataset).
I think my server won't be busy for a while, so I can start redoing the tree over the non-repeated dataset (masked-dna-sequences-accession.fasta.reduced).

Here is a public gdrive folder with my work. Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants