aminoacid distance to AAI? #58

jianshu93 · 2022-06-07T14:16:19Z

Hello Daniel,

For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?

Thanks,

Jianshu

dnbaker · 2022-06-08T17:17:07Z

Hi Jianshu,

You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.

Specifically:

1 + log(2*J/(1+J)) / k

For Python code, you might perform something like:

amino_jaccards = # somehow set the vector of Jaccard similarities, parsing or otherwise
est_amino_identity = 1. + np.log(2 * amino_jaccards / (1. + amino_jaccard)) / k

This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.

Thanks,

Daniel

jianshu93 · 2022-06-09T14:40:12Z

thanks daniel.This is very helpful.

jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aminoacid distance to AAI? #58

aminoacid distance to AAI? #58

jianshu93 commented Jun 7, 2022

dnbaker commented Jun 8, 2022 •

edited

Loading

jianshu93 commented Jun 9, 2022

aminoacid distance to AAI? #58

aminoacid distance to AAI? #58

Comments

jianshu93 commented Jun 7, 2022

dnbaker commented Jun 8, 2022 • edited Loading

jianshu93 commented Jun 9, 2022

dnbaker commented Jun 8, 2022 •

edited

Loading