Limiting kmer coverage gives better fit. #149

d00bin · 2024-12-23T17:29:16Z

First, I would like to thank you for creating this software, and especially making the web tool available! It is amazingly useful!

I'm a bit confused about max kmer coverages, how to choose them, and whether there is even a need to choose them.

I'm using PacBio HiFi reads, and the coverage should be ~30-40%. There seems to be some contamination from bacteria, however.
BUSCO is 94% complete and 2%fragmented for assembly 397mb

Here Is my example:

Parameters Used:

GenomeScope version 2.0
input file = user_uploads/bHO3mNkRYTDjpeuUJUrL
output directory = user_data/bHO3mNkRYTDjpeuUJUrL
p = 2
k = 31
max_kmercov = 25

property	min	max
Homozygous (aa)	99.2346%	99.5236%
Heterozygous (ab)	0.476407%	0.76544%
Genome Haploid Length	342,463,429 bp	370,390,321 bp
Genome Repeat Length	0 bp	0 bp
Genome Unique Length	342,463,429 bp	370,390,321 bp
Model Fit	99.0663%	99.0663%
Read Error Rate	0.465775%	0.465775%

And here I set limit to 1000000 for k31:

Parameters Used:

GenomeScope version 2.0
input file = user_uploads/9q1g6l9n9oSvG3jylFCI
output directory = user_data/9q1g6l9n9oSvG3jylFCI
p = 2
k = 31
max_kmercov = 1000000

Analysis Results:

property	min	max
Homozygous (aa)	99.2179%	99.2439%
Heterozygous (ab)	0.75612%	0.782069%
Genome Haploid Length	621,267,086 bp	622,690,743 bp
Genome Repeat Length	109,355,991 bp	109,606,585 bp
Genome Unique Length	511,911,095 bp	513,084,158 bp
Model Fit	79.6629%	96.8566%
Read Error Rate	0.275908%	0.275908%

From a sister species and Busco results I expect the genome size to be ~400-420mb. Why in order to get a better model fit and seemingly more realistic estimation I need to lower the max coverage to 25?

Because if the high coverage result is true, then I'm lacking ~200mb which surely should not result in assembly BUSCO ~94%.

Also, is it normal to have uniq value - 100%?

Thank you for all your work!!

mschatz · 2024-12-23T19:49:09Z

Thanks for your interest. The main reason for the max kmer frequency is to exclude very high frequency kmers as these are usually contaminates like phiX or other unnatural sequences. It looks like you might have some as you have an abnormal peak in coverage at around 7,000x to 9,000x coverage. It is possible that these are really part of the genome, but it seems somewhat unlikely. So I would recommend a cutoff about about 5000x to exclude these, which is what I ran here for you: http://genomescope.org/genomescope2/analysis.php?code=u3qung3ryHjPCr3Fn21u This estimates the haploid genome size to be about 600Mbp. This is still larger than the published genome (which I assume is this: https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have additional bacterial contamination or other contamination issues present. For this I would try aligning your assembly to the reference (with minimap or mummer) to see how well it matches. You can also try screening the reads or contigs using kraken or blast to see which are likely to be from bacteria. But I would not set the max kmer coverage to only 25 as this will exclude too many real kmers from the genome. I did notice that you have a much lower rate of heterozygosity (0.8% compared to 1.2% in the paper) so perhaps your sample is quite different from the published genome. For context, a human genome has a heterozygosity rate of about 0.1% so this is a major difference Good luck Mike

…

On Mon, Dec 23, 2024 at 12:29 PM d00bin ***@***.***> wrote: First, I would like to thank you for creating this software, and especially making the web tool available! It is amazingly useful! I'm a bit confused about max kmer coverages, how to choose them, and whether there is even a need to choose them. I'm using PacBio HiFi reads, and the coverage should be ~30-40%. There seems to be some contamination from bacteria, however. BUSCO is 94% complete and 2%fragmented for assembly 397mb Here Is my example: *Parameters Used:* - GenomeScope version 2.0 - input file = user_uploads/bHO3mNkRYTDjpeuUJUrL - output directory = user_data/bHO3mNkRYTDjpeuUJUrL - p = 2 - k = 31 - max_kmercov = 25 property min max Homozygous (aa) 99.2346% 99.5236% Heterozygous (ab) 0.476407% 0.76544% Genome Haploid Length 342,463,429 bp 370,390,321 bp Genome Repeat Length 0 bp 0 bp Genome Unique Length 342,463,429 bp 370,390,321 bp Model Fit 99.0663% 99.0663% Read Error Rate 0.465775% 0.465775% Syngnathus_typhle_K31.png (view on web) <https://github.com/user-attachments/assets/f7e49cc0-ab83-47c8-b57c-3cffcc8831d6> And here I set limit to 1000000 for k31: *Parameters Used:* - GenomeScope version 2.0 - input file = user_uploads/9q1g6l9n9oSvG3jylFCI - output directory = user_data/9q1g6l9n9oSvG3jylFCI - p = 2 - k = 31 - max_kmercov = 1000000 *Analysis Results:* property min max Homozygous (aa) 99.2179% 99.2439% Heterozygous (ab) 0.75612% 0.782069% Genome Haploid Length 621,267,086 bp 622,690,743 bp Genome Repeat Length 109,355,991 bp 109,606,585 bp Genome Unique Length 511,911,095 bp 513,084,158 bp Model Fit 79.6629% 96.8566% Read Error Rate 0.275908% 0.275908% linear_plot.png (view on web) <https://github.com/user-attachments/assets/c8b8f3fb-3f30-477d-9821-4d9c3b4d584c> From a sister species and Busco results I expect the genome size to be ~400-420mb. Why in order to get a better model fit and seemingly more realistic estimation I need to lower the max coverage to 25? Because if the high coverage result is true, then I'm lacking ~200mb which surely should not result in assembly BUSCO ~94%. Also, is it normal to have uniq value - 100%? *Thank you for all your work!!* — Reply to this email directly, view it on GitHub <#149>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP3435A2RDHTI4JCHJRWL2HBCAFAVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43ASLTON2WKOZSG42TMNBWG44DQNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

d00bin · 2024-12-23T20:11:26Z

Thanks for such a fast response, Mike!

This estimates the haploid genome size to be about 600Mbp. This is still
larger than the published genome (which I assume is this:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have
additional bacterial contamination or other contamination issues present.

You assumed right! I'm sequencing a sister species. The thing is, Feulgen staining estimations for my species and sister species (S.scovelli)are around ~500 mb (https://doi.org/10.1186/s13059-016-1126-6)
In both cases assemblies are much smaller, and BUSCO results support the assembly size.
Notice, they also have this weird peak on the left side of the plot.

I have then another question.
For the second species I'm estimating genome size for, I had to limit coverage to 22x to get a good model fit, and the genome size I expect to see, both based on Feulgen staining and other sequencing data/publications.

Parameters Used:

GenomeScope version 2.0
input file = user_uploads/rDlrrs6Ixupcu2ml71yN
output directory = user_data/rDlrrs6Ixupcu2ml71yN
p = 2
k = 31
max_kmercov = 22

property	min	max
Homozygous (aa)	98.9585%	98.9735%
Heterozygous (ab)	1.02646%	1.04151%
Genome Haploid Length	1,750,614,073 bp	1,754,767,627 bp
Genome Repeat Length	138,686,621 bp	139,015,672 bp
Genome Unique Length	1,611,927,453 bp	1,615,751,955 bp
Model Fit	99.6728%	99.6728%
Read Error Rate	0.380185%	0.380185%

What could be the case here? Could it be because of relatively low coverage by HIFi?
If I remove the limit, then the genome size is much smaller than expected.

mschatz · 2024-12-23T23:43:46Z

For this second species, yes I think the low coverage is an issue. This shows there is only ~5x coverage per haplotype which is too low for a reliable estimate of genome size and too low for a good assembly. If at all possible I would recommend another sequencing run (or two) Good luck! Mike

…

On Mon, Dec 23, 2024 at 3:11 PM d00bin ***@***.***> wrote: Thanks for such a fast response, Mike! This estimates the haploid genome size to be about 600Mbp. This is still larger than the published genome (which I assume is this: https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have additional bacterial contamination or other contamination issues present. You assumed right! I'm sequencing a sister species. The thing is, Feulgen staining estimations for my species and sister species (S.scovelli)are around ~500 mb (https://doi.org/10.1186/s13059-016-1126-6) In both cases assemblies are much smaller, and BUSCO results support the assembly size. I have then another question. For the second species I'm estimating genome size for I had to limit coverage to 22x to get good fit and the genome size I expect to see, both based on Feulgen staining and other sequencing data/publications. *Parameters Used:* - GenomeScope version 2.0 - input file = user_uploads/rDlrrs6Ixupcu2ml71yN - output directory = user_data/rDlrrs6Ixupcu2ml71yN - p = 2 - k = 31 - max_kmercov = 22 property min max Homozygous (aa) 98.9585% 98.9735% Heterozygous (ab) 1.02646% 1.04151% Genome Haploid Length 1,750,614,073 bp 1,754,767,627 bp Genome Repeat Length 138,686,621 bp 139,015,672 bp Genome Unique Length 1,611,927,453 bp 1,615,751,955 bp Model Fit 99.6728% 99.6728% Read Error Rate 0.380185% 0.380185% N.png (view on web) <https://github.com/user-attachments/assets/e77696af-5897-42da-9390-ed1f461ce48f> What could be the case here? Could it be because of relatively low coverage by HIFi? — Reply to this email directly, view it on GitHub <#149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34YYUL2I7GDQ6HB7CIL2HBVAJAVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRQGI2DINZYHA> . You are receiving this because you commented.Message ID: ***@***.***>

d00bin · 2024-12-24T10:23:44Z

and too low for a good assembly. If at all
possible I would recommend another sequencing run (or two)

Not really possible. But it assembled quite well, and we are planning to combine it with some HiC data.

Thanks a lot for your advice!

d00bin · 2024-12-24T15:36:55Z

So, a quick update

I decided to try counting k-mers using KMC instead of jellyfish. My commands were:

kmc -fq -k31 -t32 -m120 -ci1 -cs10000 hifi_reads.fastq kmc_mer31 ./
kmc_tools transform kmc_mer31 histogram kmc31.histo

Species 1 :

Input File: user_uploads/3iAZHENvlZWekAy7Qdb3
Output Directory: user_data/3iAZHENvlZWekAy7Qdb3
Parameters:
- p = 2
- k = 31

Property Statistics

Property	Min	Max
Homozygous (aa)	99.2971%	99.3159%
Heterozygous (ab)	0.684124%	0.702948%
Genome Haploid Length	300,873,367 bp	301,469,212 bp
Genome Repeat Length	41,285,600 bp	41,367,361 bp
Genome Unique Length	259,587,768 bp	260,101,851 bp
Model Fit	80.5608%	94.3123%
Read Error Rate	0.291992%	0.291992%

Species 2 :

Input File: user_uploads/0k6rrojoYBVqtv91GXXj
Output Directory: user_data/0k6rrojoYBVqtv91GXXj
Parameters:
- p = 2
- k = 31

Property Statistics

Property	Min	Max
Homozygous (aa)	98.9544%	99.0176%
Heterozygous (ab)	0.982437%	1.04558%
Genome Haploid Length	1,340,636,981 bp	1,348,657,949 bp
Genome Repeat Length	528,026,187 bp	531,185,343 bp
Genome Unique Length	812,610,794 bp	817,472,606 bp
Model Fit	65.4274%	98.6014%
Read Error Rate	0.257065%	0.257065%

This time no thresholds. Not sure what to make of it.

mschatz · 2025-01-02T05:25:17Z

Interesting - This shows much better coverage. Both Jellyfish and KMC should give exactly the same output (they just count up how often different kmers occur in the reads) so Im guessing there was an error when you ran jellyfish - either the software didnt process all of the reads or maybe you missed the -C flag for jellyfish to count canonical kmers. But these results look good and would explain why the assembly came out as good as it did Good luck! Mike

…

On Tue, Dec 24, 2024 at 10:37 AM d00bin ***@***.***> wrote: Quick Update I decided to try counting k-mers using *KMC* instead of *jellyfish*. My commands were: kmc -fq -k31 -t32 -m120 -ci1 -cs10000 hifi_reads.fastq kmc_mer31 ./ kmc_tools transform kmc_mer31 histogram kmc31.histo Species 1 Input and Output Details - *Input File*: user_uploads/3iAZHENvlZWekAy7Qdb3 - *Output Directory*: user_data/3iAZHENvlZWekAy7Qdb3 - *Parameters*: - p = 2 - k = 31 Property Statistics Property Min Max Homozygous (aa) 99.2971% 99.3159% Heterozygous (ab) 0.684124% 0.702948% Genome Haploid Length 300,873,367 bp 301,469,212 bp Genome Repeat Length 41,285,600 bp 41,367,361 bp Genome Unique Length 259,587,768 bp 260,101,851 bp Model Fit 80.5608% 94.3123% Read Error Rate 0.291992% 0.291992% sp1.png (view on web) <https://github.com/user-attachments/assets/968c0db8-05ac-41ae-9478-773f56dd7081> Species 2: Input and Output Details - *Input File*: user_uploads/0k6rrojoYBVqtv91GXXj - *Output Directory*: user_data/0k6rrojoYBVqtv91GXXj - *Parameters*: - p = 2 - k = 31 Property Statistics Property Min Max Homozygous (aa) 98.9544% 99.0176% Heterozygous (ab) 0.982437% 1.04558% Genome Haploid Length 1,340,636,981 bp 1,348,657,949 bp Genome Repeat Length 528,026,187 bp 531,185,343 bp Genome Unique Length 812,610,794 bp 817,472,606 bp Model Fit 65.4274% 98.6014% Read Error Rate 0.257065% 0.257065% sp2.png (view on web) <https://github.com/user-attachments/assets/2da175f1-e52a-44ed-b910-864d26b23322> This time no thresholds. Not sure what to make of it. — Reply to this email directly, view it on GitHub <#149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP345AECECU6OA65B2T5L2HF5S3AVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGIZTSMJYGY> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limiting kmer coverage gives better fit. #149

Limiting kmer coverage gives better fit. #149

d00bin commented Dec 23, 2024

mschatz commented Dec 23, 2024 via email

d00bin commented Dec 23, 2024 •

edited

Loading

mschatz commented Dec 23, 2024 via email

d00bin commented Dec 24, 2024

d00bin commented Dec 24, 2024 •

edited

Loading

mschatz commented Jan 2, 2025 via email

Limiting kmer coverage gives better fit. #149

Limiting kmer coverage gives better fit. #149

Comments

d00bin commented Dec 23, 2024

And here I set limit to 1000000 for k31:

mschatz commented Dec 23, 2024 via email

d00bin commented Dec 23, 2024 • edited Loading

mschatz commented Dec 23, 2024 via email

d00bin commented Dec 24, 2024

d00bin commented Dec 24, 2024 • edited Loading

Species 1 :

Property Statistics

Species 2 :

Property Statistics

mschatz commented Jan 2, 2025 via email

d00bin commented Dec 23, 2024 •

edited

Loading

d00bin commented Dec 24, 2024 •

edited

Loading