Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting kmer coverage gives better fit. #149

Open
d00bin opened this issue Dec 23, 2024 · 6 comments
Open

Limiting kmer coverage gives better fit. #149

d00bin opened this issue Dec 23, 2024 · 6 comments

Comments

@d00bin
Copy link

d00bin commented Dec 23, 2024

First, I would like to thank you for creating this software, and especially making the web tool available! It is amazingly useful!

I'm a bit confused about max kmer coverages, how to choose them, and whether there is even a need to choose them.

I'm using PacBio HiFi reads, and the coverage should be ~30-40%. There seems to be some contamination from bacteria, however.
BUSCO is 94% complete and 2%fragmented for assembly 397mb

Here Is my example:

Parameters Used:

  • GenomeScope version 2.0
  • input file = user_uploads/bHO3mNkRYTDjpeuUJUrL
  • output directory = user_data/bHO3mNkRYTDjpeuUJUrL
  • p = 2
  • k = 31
  • max_kmercov = 25
property min max
Homozygous (aa) 99.2346% 99.5236%
Heterozygous (ab) 0.476407% 0.76544%
Genome Haploid Length 342,463,429 bp 370,390,321 bp
Genome Repeat Length 0 bp 0 bp
Genome Unique Length 342,463,429 bp 370,390,321 bp
Model Fit 99.0663% 99.0663%
Read Error Rate 0.465775% 0.465775%

T1_K31

And here I set limit to 1000000 for k31:

Parameters Used:

  • GenomeScope version 2.0
  • input file = user_uploads/9q1g6l9n9oSvG3jylFCI
  • output directory = user_data/9q1g6l9n9oSvG3jylFCI
  • p = 2
  • k = 31
  • max_kmercov = 1000000

Analysis Results:

property min max
Homozygous (aa) 99.2179% 99.2439%
Heterozygous (ab) 0.75612% 0.782069%
Genome Haploid Length 621,267,086 bp 622,690,743 bp
Genome Repeat Length 109,355,991 bp 109,606,585 bp
Genome Unique Length 511,911,095 bp 513,084,158 bp
Model Fit 79.6629% 96.8566%
Read Error Rate 0.275908% 0.275908%

linear_plot

From a sister species and Busco results I expect the genome size to be ~400-420mb. Why in order to get a better model fit and seemingly more realistic estimation I need to lower the max coverage to 25?

Because if the high coverage result is true, then I'm lacking ~200mb which surely should not result in assembly BUSCO ~94%.

Also, is it normal to have uniq value - 100%?

Thank you for all your work!!

@mschatz
Copy link
Contributor

mschatz commented Dec 23, 2024 via email

@d00bin
Copy link
Author

d00bin commented Dec 23, 2024

Thanks for such a fast response, Mike!

This estimates the haploid genome size to be about 600Mbp. This is still
larger than the published genome (which I assume is this:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have
additional bacterial contamination or other contamination issues present.

You assumed right! I'm sequencing a sister species. The thing is, Feulgen staining estimations for my species and sister species (S.scovelli)are around ~500 mb (https://doi.org/10.1186/s13059-016-1126-6)
In both cases assemblies are much smaller, and BUSCO results support the assembly size.
Notice, they also have this weird peak on the left side of the plot.

I have then another question.
For the second species I'm estimating genome size for, I had to limit coverage to 22x to get a good model fit, and the genome size I expect to see, both based on Feulgen staining and other sequencing data/publications.

Parameters Used:

  • GenomeScope version 2.0
  • input file = user_uploads/rDlrrs6Ixupcu2ml71yN
  • output directory = user_data/rDlrrs6Ixupcu2ml71yN
  • p = 2
  • k = 31
  • max_kmercov = 22
property min max
Homozygous (aa) 98.9585% 98.9735%
Heterozygous (ab) 1.02646% 1.04151%
Genome Haploid Length 1,750,614,073 bp 1,754,767,627 bp
Genome Repeat Length 138,686,621 bp 139,015,672 bp
Genome Unique Length 1,611,927,453 bp 1,615,751,955 bp
Model Fit 99.6728% 99.6728%
Read Error Rate 0.380185% 0.380185%

N

What could be the case here? Could it be because of relatively low coverage by HIFi?
If I remove the limit, then the genome size is much smaller than expected.

@mschatz
Copy link
Contributor

mschatz commented Dec 23, 2024 via email

@d00bin
Copy link
Author

d00bin commented Dec 24, 2024

and too low for a good assembly. If at all
possible I would recommend another sequencing run (or two)

Not really possible. But it assembled quite well, and we are planning to combine it with some HiC data.

Thanks a lot for your advice!

@d00bin
Copy link
Author

d00bin commented Dec 24, 2024

So, a quick update

I decided to try counting k-mers using KMC instead of jellyfish. My commands were:

kmc -fq -k31 -t32 -m120 -ci1 -cs10000 hifi_reads.fastq kmc_mer31 ./
kmc_tools transform kmc_mer31 histogram kmc31.histo

Species 1 :

  • Input File: user_uploads/3iAZHENvlZWekAy7Qdb3
  • Output Directory: user_data/3iAZHENvlZWekAy7Qdb3
  • Parameters:
    • p = 2
    • k = 31

Property Statistics

Property Min Max
Homozygous (aa) 99.2971% 99.3159%
Heterozygous (ab) 0.684124% 0.702948%
Genome Haploid Length 300,873,367 bp 301,469,212 bp
Genome Repeat Length 41,285,600 bp 41,367,361 bp
Genome Unique Length 259,587,768 bp 260,101,851 bp
Model Fit 80.5608% 94.3123%
Read Error Rate 0.291992% 0.291992%

sp1

Species 2 :

  • Input File: user_uploads/0k6rrojoYBVqtv91GXXj
  • Output Directory: user_data/0k6rrojoYBVqtv91GXXj
  • Parameters:
    • p = 2
    • k = 31

Property Statistics

Property Min Max
Homozygous (aa) 98.9544% 99.0176%
Heterozygous (ab) 0.982437% 1.04558%
Genome Haploid Length 1,340,636,981 bp 1,348,657,949 bp
Genome Repeat Length 528,026,187 bp 531,185,343 bp
Genome Unique Length 812,610,794 bp 817,472,606 bp
Model Fit 65.4274% 98.6014%
Read Error Rate 0.257065% 0.257065%

sp2

This time no thresholds. Not sure what to make of it.

@mschatz
Copy link
Contributor

mschatz commented Jan 2, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants