-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for proteome? but not just genome in nt #36
Comments
Hi Jianshu, Thanks for your question! Yes -- you can apply this to amino sequence collections! You enable this with the flag You also have the option of compressed amino acid alphabets with Also - to provide more example, if you want to de-duplicate, you use the dashing2 sketch --greedy 0.8 -F pathsfiles.txt -k6 --protein --cmpout deduplicated.txt
# Or
dashing2 sketch --greedy 0.8 -k6 --protein --cmpout deduplicated.txt f1.fasta f2.fasta f3.fasta # <...> The higher the --greedy parameter is, the more specific each cluster is and the more clusters there will be. If you want to de-duplicate by sequence (rather than by file), you can append I'm happy to answer any further questions you might have, and thanks again - Daniel |
Oh thanks daniel very much. glad that you have this functionality. It will be very useful for dereplication. Is this setsketch kmer abundance weigted? I was using the probminhash by the same author(a better version of bagminhash)to account for kmer multiplicity. I have not read the setsketch yet. Jianshu |
The SetSketch isn't supporting multiplicities, but we have two ways to deal with them. If you add the If you want strict multiset similarity, you can use either of the For sequencing datasets or sufficiently large genomes, you may not want to do exact counting for either ProbMinHash or BagMinHash. If you run into out-of-memory problems using either of these, you can use the Thanks, Daniel |
Hello Daniel, I am comparing microbial genomes, so size is not so big for each genome, but a lot of genomes (millions of genome files), I think the greedy search should be the limiting step. Plus genomes binned from environmental samples are always not complete, which make the complete occurrences less useful but the relative frequencies very useful when comparing with genomes that are complete sometime in the database. If we are comparing metagenomes, the memory will be a big issue I agree and weighted set will take a lot of memory. But actually in biology, kmer multiplicity happen a lot, for example, about 5-10% percent of the microbial genes are multi copy, especially those very conservatative ones like, 16S gene, ribosome protein, et,al, so use weighted set make more sense to have those difference in copies counted in the final distance? Do you have any suggestions for users that weighted set is preferred when comparing real world metagenomes or genomes, even though it takes a lot of memory. Many thanks for the explanation. Jianshu |
Hello Daniel Thanks, Jianshu |
I'd lean toward using BagMinHash, then. On the genome pairs I've compared with fastANI, it's been very slightly better at estimating ANI than ProbMinHash, but sometimes better than exact k-mer Jaccard (or weighted Jaccard) similarity. My intuition suggests that BagMinHash is better for assemblies/proteomes, but that ProbMinHash would be better for And lastly, you're right that you'd often want to calculate the ANI/AAI precisely, but hopefully you can filter down your candidate close relatives pretty reliably using k-mers first. I'll look forward to hearing how it goes, and good luck! Best, Daniel |
Hello Daniel, I ran a test last night, with --greedy 0.8, for all my preoteomoes nearly each one form a new cluster, 47609 out of 47894 (bacteria proteome). here 0.8 is similarity right, 1-distance? or it is actually distance. I am a little bit confused: dashing2 sketch --threads 48 --enable-protein --greedy 0.8 -k7 --cmpout deduplicated_fast.txt *.faa I have a relationship between probminhash distance and ANI (by blastn base ANI, very close to fastANI),k 27, S 12000, to have very high specificity, I use large k. It seems distance 0.93 correspond to ANI 95% I feel that for amino acid based distance, with k7, s1024, should be enough to have this resolution, up to 90% AAI. I want to dereplicate at 90% AAI, trying to figure out the distance value correspond to 90% AAI. I will also try probminhash distance not just the default setsketch. Many thanks, Jianshu |
Hello Daniel. When using greedy 0.1, probminhash, I have only 24415 cluster out of 47894 proteome. seems similarity is similarity, distance, we should use distance for similarity options. I expect that 0.1 similarity, 0.9 distance correspond to 90% or so AAI, very similar to ANI. 0.8 similarity, will be like 0,2 distance so very high AAI actually, maybe 99% or even higher. Do those make any sense to you? Plus this really fast!!! a few minutes with 24 thread. Jianshu |
Hi Jianshu, Great to hear - glad it's fast. You're right - the default output is similarity, so you'll assign points to clusters where their similarity is over that threshold. If you want to switch to a distance, add Note that if you choose a distance, your threshold for inclusion will be the largest distance instead of the smallest similarity. And you're right that that in the high k-mer similarity regime you can get to very high ANI items; if there are relatively few mutations spread out, that can still result in a lot of k-mers changing, especially since ANI/AAI is computed over homologous regions and the k-mer methods aren't being that precise. Thanks, Daniel |
Hello Daniel, nohup time dashing2 sketch --threads 24 --greedy 0.069 --pminhash -k28 -S 12000 --cmpout deduplicated_k28S12000g0069.txt *.fna & I am using it for dereplicaiton of 47894 microbial genome in fasta format, I am guessing that 0.93 (1-0.069) should be very close to fastANI 95%. I notice that the memory reached 250 G. I think probminhash is significantly less memory than bagminhash right? did you compress the 2 bit for DNA alphabets? My experience with Rust, bitvec, compress 2 bit and then decompress require significantly less memory when sketching the same amount of genomes (47894). Is that because of the greedy cluster? And then this error after 30 minutes: Command terminated by signal 9 Jianshu |
Hey Jianshu - A signal 9 with dashing2 usually means that it ran out of memory, and it was killed by a process that terminates jobs that hit that kind of limit. PMH and BMH should have effectively equivalent memory requirements. The primary memory requirement when using ProbMinHash and BagMinhash, however, is the space required for the key-value map from k-mers to counts. You can reduce this somewhat by using the And you ask about compressing the alphabet into 2-bits. Dashing2 uses 2 bits for DNA sequences as long as they fit in 64-bit registers, at which point it switches to using a cyclic hash that's packed into 64 bits. (This is different for protein - the switch happens after 7 peptides, since 20-letter alphabet takes more space per character.) But... that really shouldn't matter, since these 64-bit representations are at the fixed size of 8 bytes, and if k > 6 you're already using a 64-bit hash of the k-mer. You could add The other way to reduce the memory footprint is to use Let me know if I can help more, and thanks again, Daniel |
Hello Daniel, As you suggested, with about 700 G memory, the genome in nt format dereplication works! about 400G memory was needed for the entire GTDB database (47804 genomes): dashing2 sketch --threads 24 --greedy 0.069 --pminhash -k28 -S 12000 --cmpout deduplicated_fna_k28S12000g0069.txt *.fna You are actually implement the Probminhash3a algorithm in the paper right or just probminhash3 only, 3a is more memory efficient. Jianshu |
Great news - thanks for letting me know! In the newest version, I've introduced some options you can use to lower peak memory, but We're currently using PMH2a, mostly because it could be run quickly and sketching speed is particularly important. We can probably implement PMH3a in an efficient way, but it will take some time/engineering. I'll let you know if that changes. |
Hello Daniel, Do you have the dashing2 paper preprint ready? I would be very happy to read it. Thanks, Jianshu |
Hi Jianshu, We're still writing the preprint currently. I expect we'll submit it within the next two weeks. Thanks for checking in, Daniel |
Hello Daniel, I come back to this after a few tries. I realized that for genome assemblies, especially for those assembled from metagenomes (a lot of genomes species for metagenomes), they can be incomplete, e.g. 80% completeness is very common (also some level of sampling of the original genome), so will it be reasonable that probminhash is used under this situation, in that when comparing 2 genomes incompleteness will not be strongly interfering with how similar they are? THanks, Jianshu |
Hi Jianshu, Sorry for taking so long - I defended and moved over the new year/holiday. It's possible that incompleteness can affect how similar they seem to be. You could try using symmetric containment to work around it - hoping that that accounts for potentially lost contents. With more depth of coverage, it becomes easier to be sure you'll be able to recover meaningful signatures. I would expect probminhash to help allow comparisons of k-mer sets of different coverage levels. Happy to discuss further, and happy new year! Thanks, Daniel |
Hello Daniel, I am also sorry for late response, and congrats! Dr. Baker! A few questions related to Minhash, weighted minhash, probminhash and contamination minhash: when comparing MinHash (mash software) with probminhash. It looks like Mash is not sensitive to set size (total kmer size), so when two set from two genomes in database with very different size, as long as the one with more shared kmers with query, no matter how different the size is for the query compared to database genomes, it will be considered as the best one while probminhash has a leverage, meaning, when set size are too different, the number of share kmers will be penalized by very different genome size (kmer multiplicity/set size) in the final distance. I do not know which one is better, but I read the contamination Jaccard paper (https://www.sciencedirect.com/science/article/pii/S009630031930116X?casa_token=OIODVypDwEMAAAAA:CRygZY25CsDdlj16LrPod1RgE3DczMOCeQbdEPQL8mYSshi65aARICnTC0o0UzykKE7ad6u9Hw), which told me that with sketch 12000 in our case for proteome of bacterial genome (size 10^6), k=9, total genome kmer set size will be 100X larger than sketch size 12000, random sampling of 12000 from 100X larger total set compared to sampling from 50X or 70X larger total set will lead to loss of accuracy because the size difference will allow smaller set size to have more chance of been sampled than larger set size. To have the best performance, we could not increase sketch size to like 50X 12000, so this problem will be there. What do you think of this. Finch does have a contamination index there for output and if it is high, distance should not be trusted, but only support nt sequences. In the case of metagenome and genome , we all have the problem of different set size coming from sequencing depth and genome incompleteness respectively. I find the contamination Jaccard a very good idea, telling when set size are very different, distance will not be that reliable. Looking forward to you reply. Thanks, Jianshu |
Hell Dashing2 developer,
Dashing can be use for genome derelication in nt format but I am wondering wither it is an option to have it also work for amino acid sequences (all gene of a genome in amino acid format et.al.). This is widely used for microbiologists
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: