You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered:
You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.
Specifically:
1 + log(2*J/(1+J)) / k
For Python code, you might perform something like:
amino_jaccards=# somehow set the vector of Jaccard similarities, parsing or otherwiseest_amino_identity=1.+np.log(2*amino_jaccards/ (1.+amino_jaccard)) /k
This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.
Hello Daniel,
For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: