Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for proteome? but not just genome in nt #36

Open
jianshu93 opened this issue Nov 4, 2021 · 18 comments
Open

support for proteome? but not just genome in nt #36

jianshu93 opened this issue Nov 4, 2021 · 18 comments

Comments

@jianshu93
Copy link

Hell Dashing2 developer,

Dashing can be use for genome derelication in nt format but I am wondering wither it is an option to have it also work for amino acid sequences (all gene of a genome in amino acid format et.al.). This is widely used for microbiologists

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 4, 2021

Hi Jianshu,

Thanks for your question! Yes -- you can apply this to amino sequence collections!

You enable this with the flag --protein, which uses a 20-letter amino acid alphabet.

You also have the option of compressed amino acid alphabets with --protein14, --protein8, and --protein6, which groups similar amino acids together in smaller alphabet sizes for longer-range comparisons.

Also - to provide more example, if you want to de-duplicate, you use the --greedy flag. Here's an example:

dashing2 sketch --greedy 0.8 -F pathsfiles.txt -k6 --protein --cmpout deduplicated.txt
# Or
dashing2 sketch --greedy 0.8 -k6 --protein --cmpout deduplicated.txt f1.fasta f2.fasta f3.fasta # <...>

The higher the --greedy parameter is, the more specific each cluster is and the more clusters there will be.

If you want to de-duplicate by sequence (rather than by file), you can append F to the --greedy argument as of the newest commit; I'll have it in the binary releases soon.

I'm happy to answer any further questions you might have, and thanks again -

Daniel

@jianshu93
Copy link
Author

Oh thanks daniel very much. glad that you have this functionality. It will be very useful for dereplication. Is this setsketch kmer abundance weigted? I was using the probminhash by the same author(a better version of bagminhash)to account for kmer multiplicity. I have not read the setsketch yet.

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 4, 2021

The SetSketch isn't supporting multiplicities, but we have two ways to deal with them.

If you add the --probminhash flag, you will sketch with ProbMinHash rather than the SetSketch, generating sketches for the probability Jaccard similarity. This effectively normalizes both sets by their total counts.

If you want strict multiset similarity, you can use either of the --bagminhash/--multiset flags, which will give you sketches that estimate the weighted Jaccard similarity. This doesn't normalize the sets by total counts, so the absolute number of occurrences is important, as opposed to the fraction of occurrences in the dataset. The downside is that BagMinHash takes about 10x as long as ProbMinHash.

For sequencing datasets or sufficiently large genomes, you may not want to do exact counting for either ProbMinHash or BagMinHash. If you run into out-of-memory problems using either of these, you can use the --countsketch-size/-c flag to set the number of features you're accumulating into, which caps the memory usage at the expense of some approximation of the count map. 5 million seems to be sufficient for prokaryotic genomes.

Thanks,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

I am comparing microbial genomes, so size is not so big for each genome, but a lot of genomes (millions of genome files), I think the greedy search should be the limiting step. Plus genomes binned from environmental samples are always not complete, which make the complete occurrences less useful but the relative frequencies very useful when comparing with genomes that are complete sometime in the database. If we are comparing metagenomes, the memory will be a big issue I agree and weighted set will take a lot of memory. But actually in biology, kmer multiplicity happen a lot, for example, about 5-10% percent of the microbial genes are multi copy, especially those very conservatative ones like, 16S gene, ribosome protein, et,al, so use weighted set make more sense to have those difference in copies counted in the final distance? Do you have any suggestions for users that weighted set is preferred when comparing real world metagenomes or genomes, even though it takes a lot of memory.

Many thanks for the explanation.

Jianshu

@jianshu93
Copy link
Author

Hello Daniel
My understanding for dereplicaiton is kmer based method does not work very well at very high resolution,e.g., if 2 genome proteome, 95% average amino acid identity, can it also tell the difference? And will probminhash help with this, meaning weighted set help with very identical proteomes. So very similar to genome dereplicaiton , it can help to reduce some very distance comparisons, like 50% to each other, but for exact comparisons for highly identity ones, we still need to calculate average amino acid identity based on some local alignment method such as blastp. With the experience of probminhash for genome nt, very huge sketch value, 12000, can differentiate close related ones, 99% to each other. I will test very close genomes to see what will happen

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 5, 2021

I'd lean toward using BagMinHash, then. On the genome pairs I've compared with fastANI, it's been very slightly better at estimating ANI than ProbMinHash, but sometimes better than exact k-mer Jaccard (or weighted Jaccard) similarity.

My intuition suggests that BagMinHash is better for assemblies/proteomes, but that ProbMinHash would be better for
metagenomics and RNA-seq, since you're getting a specific amount of coverage distributed by some sampling.

And lastly, you're right that you'd often want to calculate the ANI/AAI precisely, but hopefully you can filter down your candidate close relatives pretty reliably using k-mers first.
If you're dealing with a lot of very similar items and need to distinguish rather close neighbors, you may want to increase the --nLSH parameter. The default is 2, but if you increase this to 3 or 5, for instance, it becomes easier to differentiate the very close neighbors.
You mention increasing sketch size for close neighbors. You might also want to try increasing k-mer size to make the sketching more specific. There's no limit on k-mer length since we use a rolling hash function if k is sufficiently large.

I'll look forward to hearing how it goes, and good luck!

Best,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

I ran a test last night, with --greedy 0.8, for all my preoteomoes nearly each one form a new cluster, 47609 out of 47894 (bacteria proteome). here 0.8 is similarity right, 1-distance? or it is actually distance. I am a little bit confused:

dashing2 sketch --threads 48 --enable-protein --greedy 0.8 -k7 --cmpout deduplicated_fast.txt *.faa

I have a relationship between probminhash distance and ANI (by blastn base ANI, very close to fastANI),k 27, S 12000, to have very high specificity, I use large k. It seems distance 0.93 correspond to ANI 95%

probminhash_anib

I feel that for amino acid based distance, with k7, s1024, should be enough to have this resolution, up to 90% AAI. I want to dereplicate at 90% AAI, trying to figure out the distance value correspond to 90% AAI. I will also try probminhash distance not just the default setsketch.

Many thanks,

Jianshu

@jianshu93
Copy link
Author

Hello Daniel.

When using greedy 0.1, probminhash, I have only 24415 cluster out of 47894 proteome. seems similarity is similarity, distance, we should use distance for similarity options. I expect that 0.1 similarity, 0.9 distance correspond to 90% or so AAI, very similar to ANI. 0.8 similarity, will be like 0,2 distance so very high AAI actually, maybe 99% or even higher.

Do those make any sense to you? Plus this really fast!!! a few minutes with 24 thread.

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 5, 2021

Hi Jianshu,

Great to hear - glad it's fast. You're right - the default output is similarity, so you'll assign points to clusters where their similarity is over that threshold.

If you want to switch to a distance, add --poisson-distance to use the Mash distance formula, which should convert the similarities to distances. This isn't a straight-forward subtraction; it's converting the Jaccard x into -log((2x) / (1 + x)) / k, though it can be close to them. For a distance of 0.8, with k = 7, this means a similarity of 0.00185, or about 1 in 539.

Note that if you choose a distance, your threshold for inclusion will be the largest distance instead of the smallest similarity.

And you're right that that in the high k-mer similarity regime you can get to very high ANI items; if there are relatively few mutations spread out, that can still result in a lot of k-mers changing, especially since ANI/AAI is computed over homologous regions and the k-mer methods aren't being that precise.

Thanks,

Daniel

@jianshu93
Copy link
Author

jianshu93 commented Nov 5, 2021

Hello Daniel,

nohup time dashing2 sketch --threads 24 --greedy 0.069 --pminhash -k28 -S 12000 --cmpout deduplicated_k28S12000g0069.txt *.fna &

I am using it for dereplicaiton of 47894 microbial genome in fasta format, I am guessing that 0.93 (1-0.069) should be very close to fastANI 95%. I notice that the memory reached 250 G. I think probminhash is significantly less memory than bagminhash right? did you compress the 2 bit for DNA alphabets? My experience with Rust, bitvec, compress 2 bit and then decompress require significantly less memory when sketching the same amount of genomes (47894). Is that because of the greedy cluster?

And then this error after 30 minutes:

Command terminated by signal 9
60357.47user 1834.59system 51:01.50elapsed 2031%CPU (0avgtext+0avgdata 378286548maxresident)k
324195936inputs+9198272outputs (367major+761085034minor)pagefaults 0swaps
THanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 7, 2021

Hey Jianshu -

A signal 9 with dashing2 usually means that it ran out of memory, and it was killed by a process that terminates jobs that hit that kind of limit.

PMH and BMH should have effectively equivalent memory requirements. The primary memory requirement when using ProbMinHash and BagMinhash, however, is the space required for the key-value map from k-mers to counts. You can reduce this somewhat by using the --countsketch-size flag, which approximates that map with fixed memory usage.

And you ask about compressing the alphabet into 2-bits. Dashing2 uses 2 bits for DNA sequences as long as they fit in 64-bit registers, at which point it switches to using a cyclic hash that's packed into 64 bits. (This is different for protein - the switch happens after 7 peptides, since 20-letter alphabet takes more space per character.)

But... that really shouldn't matter, since these 64-bit representations are at the fixed size of 8 bytes, and if k > 6 you're already using a 64-bit hash of the k-mer.

You could add -o signature_matrix.bin as an argument. This would cause Dashing2 to store the matrix directly on disk and then memory-map it, which can reduce the memory footprint at the cost of a little runtime.

The other way to reduce the memory footprint is to use --countsketch-size <int>, e.g., 5000000, which caps the memory usage at <int> * 8 * nthreads bytes, rather than being fixed as a function of the sum of the number of unique kmers for each input file.

Let me know if I can help more, and thanks again,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

As you suggested, with about 700 G memory, the genome in nt format dereplication works! about 400G memory was needed for the entire GTDB database (47804 genomes):

dashing2 sketch --threads 24 --greedy 0.069 --pminhash -k28 -S 12000 --cmpout deduplicated_fna_k28S12000g0069.txt *.fna

You are actually implement the Probminhash3a algorithm in the paper right or just probminhash3 only, 3a is more memory efficient.

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 13, 2021

Great news - thanks for letting me know!

In the newest version, I've introduced some options you can use to lower peak memory, but --fastcmp-shorts should cut the memory requirements by about 4, --fastcmp-bytes should cut by around 8. (Though that's currently only available for SetSketch, not ProbMinHash.) A good portion of the memory requirement is likely from k-mer:count maps because you're using ProbMinHash with exact counting, which requires that dictionary. You could reduce this with --countmin-size <param>, although there's some further approximation using this step.

We're currently using PMH2a, mostly because it could be run quickly and sketching speed is particularly important. We can probably implement PMH3a in an efficient way, but it will take some time/engineering. I'll let you know if that changes.

@jianshu93
Copy link
Author

Hello Daniel,

Do you have the dashing2 paper preprint ready? I would be very happy to read it.

Thanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Nov 16, 2021

Hi Jianshu,

We're still writing the preprint currently. I expect we'll submit it within the next two weeks.

Thanks for checking in,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

I come back to this after a few tries. I realized that for genome assemblies, especially for those assembled from metagenomes (a lot of genomes species for metagenomes), they can be incomplete, e.g. 80% completeness is very common (also some level of sampling of the original genome), so will it be reasonable that probminhash is used under this situation, in that when comparing 2 genomes incompleteness will not be strongly interfering with how similar they are?

THanks,

Jianshu

@dnbaker
Copy link
Owner

dnbaker commented Jan 12, 2022

Hi Jianshu,

Sorry for taking so long - I defended and moved over the new year/holiday.

It's possible that incompleteness can affect how similar they seem to be. You could try using symmetric containment to work around it - hoping that that accounts for potentially lost contents. With more depth of coverage, it becomes easier to be sure you'll be able to recover meaningful signatures. I would expect probminhash to help allow comparisons of k-mer sets of different coverage levels.

Happy to discuss further, and happy new year!

Thanks,

Daniel

@jianshu93
Copy link
Author

Hello Daniel,

I am also sorry for late response, and congrats! Dr. Baker!

A few questions related to Minhash, weighted minhash, probminhash and contamination minhash:

when comparing MinHash (mash software) with probminhash. It looks like Mash is not sensitive to set size (total kmer size), so when two set from two genomes in database with very different size, as long as the one with more shared kmers with query, no matter how different the size is for the query compared to database genomes, it will be considered as the best one while probminhash has a leverage, meaning, when set size are too different, the number of share kmers will be penalized by very different genome size (kmer multiplicity/set size) in the final distance. I do not know which one is better, but I read the contamination Jaccard paper (https://www.sciencedirect.com/science/article/pii/S009630031930116X?casa_token=OIODVypDwEMAAAAA:CRygZY25CsDdlj16LrPod1RgE3DczMOCeQbdEPQL8mYSshi65aARICnTC0o0UzykKE7ad6u9Hw), which told me that with sketch 12000 in our case for proteome of bacterial genome (size 10^6), k=9, total genome kmer set size will be 100X larger than sketch size 12000, random sampling of 12000 from 100X larger total set compared to sampling from 50X or 70X larger total set will lead to loss of accuracy because the size difference will allow smaller set size to have more chance of been sampled than larger set size. To have the best performance, we could not increase sketch size to like 50X 12000, so this problem will be there. What do you think of this. Finch does have a contamination index there for output and if it is high, distance should not be trusted, but only support nt sequences.

In the case of metagenome and genome , we all have the problem of different set size coming from sequencing depth and genome incompleteness respectively. I find the contamination Jaccard a very good idea, telling when set size are very different, distance will not be that reliable.

Looking forward to you reply.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants