Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aminoacid distance to AAI? #58

Open
jianshu93 opened this issue Jun 7, 2022 · 2 comments
Open

aminoacid distance to AAI? #58

jianshu93 opened this issue Jun 7, 2022 · 2 comments

Comments

@jianshu93
Copy link

Hello Daniel,

For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Jun 8, 2022

Hi Jianshu,

You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.

Specifically:

1 + log(2*J/(1+J)) / k

For Python code, you might perform something like:

amino_jaccards = # somehow set the vector of Jaccard similarities, parsing or otherwise
est_amino_identity = 1. + np.log(2 * amino_jaccards / (1. + amino_jaccard)) / k

This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.

Thanks,

Daniel

@jianshu93
Copy link
Author

thanks daniel.This is very helpful.

jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants