Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dendogram #75

Open
JChristopherEllis opened this issue Apr 16, 2021 · 2 comments
Open

Dendogram #75

JChristopherEllis opened this issue Apr 16, 2021 · 2 comments

Comments

@JChristopherEllis
Copy link

JChristopherEllis commented Apr 16, 2021

Can you create a dendrogram from the dist results?

Also, could you recommend parameters for large fungal genome comparison?

@dnbaker
Copy link
Owner

dnbaker commented Apr 20, 2021

Hi,

Sure, you can do that.

You'd start with a distance or similarity matrix, and then feed that into a hierarchical clustering algorithm. Good options could include scipy's hierarchical clustering (https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) or HDBSCAN, both of which can work on distance matrices.

For parameter election, the k will depend on how similar the genomes are. 16-19 seems to be good for generating pairwise distance across all fungal genomes in RefSeq, but if you're working with many related strains you may want something more like 30-100.

An example workflow with Scipy's Hierarchical Clustering you might follow:

import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

x = ... # Parse distance matrix from file somehow
# If square, convert to condensed distance matrix from scipy.cluster.hierarchy
if x.ndim > 1:
    from scipy.spatial.distance import squareform
    x = squareform(x)

L = sch.linkage(x)
dn = sch.dendrogram(L)

You can then export the dendrogram or visualize it with matplotlib. (fig.show after creating the dendrogram should show it.)

The downside to this is that it only works for symmetric distances in SciPy, though you should be able to use containment distance with HBDSCAN. Of course, you can convert any similarity measure (containment, jaccard) into a distance by using 1 - x for the similarity, or you can use the Mash formula to convert a Jaccard into a distance (log((2 * x) / (1 + x)) / k).

Spectral Clustering, for instance, will use affinities rather than distances.

I hope this helps, and let me know if you have any further questions or problems. Thanks,

Daniel

@mihkelvaher
Copy link

mihkelvaher commented Apr 23, 2021

Quicktree also performs quite well

sed -i "1s/.*/$FILECOUNT/" $dashingDistanceMatrix
quicktree -in m $dashingDistanceMatrix > $newick # NJ-tree, https://github.com/khowe/quicktree
nw_reroot $newick > final.nwk # quick and dirty rooting, http://cegg.unige.ch/newick_utils

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants