-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting kmer coverage gives better fit. #149
Comments
Thanks for your interest. The main reason for the max kmer frequency is to
exclude very high frequency kmers as these are usually contaminates like
phiX or other unnatural sequences. It looks like you might have some as you
have an abnormal peak in coverage at around 7,000x to 9,000x coverage. It
is possible that these are really part of the genome, but it seems somewhat
unlikely. So I would recommend a cutoff about about 5000x to exclude these,
which is what I ran here for you:
http://genomescope.org/genomescope2/analysis.php?code=u3qung3ryHjPCr3Fn21u
This estimates the haploid genome size to be about 600Mbp. This is still
larger than the published genome (which I assume is this:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have
additional bacterial contamination or other contamination issues present.
For this I would try aligning your assembly to the reference (with minimap
or mummer) to see how well it matches. You can also try screening the reads
or contigs using kraken or blast to see which are likely to be from
bacteria. But I would not set the max kmer coverage to only 25 as this will
exclude too many real kmers from the genome. I did notice that you have a
much lower rate of heterozygosity (0.8% compared to 1.2% in the paper) so
perhaps your sample is quite different from the published genome. For
context, a human genome has a heterozygosity rate of about 0.1% so this is
a major difference
Good luck
Mike
…On Mon, Dec 23, 2024 at 12:29 PM d00bin ***@***.***> wrote:
First, I would like to thank you for creating this software, and
especially making the web tool available! It is amazingly useful!
I'm a bit confused about max kmer coverages, how to choose them, and
whether there is even a need to choose them.
I'm using PacBio HiFi reads, and the coverage should be ~30-40%. There
seems to be some contamination from bacteria, however.
BUSCO is 94% complete and 2%fragmented for assembly 397mb
Here Is my example:
*Parameters Used:*
- GenomeScope version 2.0
- input file = user_uploads/bHO3mNkRYTDjpeuUJUrL
- output directory = user_data/bHO3mNkRYTDjpeuUJUrL
- p = 2
- k = 31
- max_kmercov = 25
property min max
Homozygous (aa) 99.2346% 99.5236%
Heterozygous (ab) 0.476407% 0.76544%
Genome Haploid Length 342,463,429 bp 370,390,321 bp
Genome Repeat Length 0 bp 0 bp
Genome Unique Length 342,463,429 bp 370,390,321 bp
Model Fit 99.0663% 99.0663%
Read Error Rate 0.465775% 0.465775%
Syngnathus_typhle_K31.png (view on web)
<https://github.com/user-attachments/assets/f7e49cc0-ab83-47c8-b57c-3cffcc8831d6>
And here I set limit to 1000000 for k31:
*Parameters Used:*
- GenomeScope version 2.0
- input file = user_uploads/9q1g6l9n9oSvG3jylFCI
- output directory = user_data/9q1g6l9n9oSvG3jylFCI
- p = 2
- k = 31
- max_kmercov = 1000000
*Analysis Results:*
property min max
Homozygous (aa) 99.2179% 99.2439%
Heterozygous (ab) 0.75612% 0.782069%
Genome Haploid Length 621,267,086 bp 622,690,743 bp
Genome Repeat Length 109,355,991 bp 109,606,585 bp
Genome Unique Length 511,911,095 bp 513,084,158 bp
Model Fit 79.6629% 96.8566%
Read Error Rate 0.275908% 0.275908%
linear_plot.png (view on web)
<https://github.com/user-attachments/assets/c8b8f3fb-3f30-477d-9821-4d9c3b4d584c>
From a sister species and Busco results I expect the genome size to be
~400-420mb. Why in order to get a better model fit and seemingly more
realistic estimation I need to lower the max coverage to 25?
Because if the high coverage result is true, then I'm lacking ~200mb which
surely should not result in assembly BUSCO ~94%.
Also, is it normal to have uniq value - 100%?
*Thank you for all your work!!*
—
Reply to this email directly, view it on GitHub
<#149>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP3435A2RDHTI4JCHJRWL2HBCAFAVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43ASLTON2WKOZSG42TMNBWG44DQNY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for such a fast response, Mike!
You assumed right! I'm sequencing a sister species. The thing is, Feulgen staining estimations for my species and sister species (S.scovelli)are around ~500 mb (https://doi.org/10.1186/s13059-016-1126-6) I have then another question. Parameters Used:
What could be the case here? Could it be because of relatively low coverage by HIFi? |
For this second species, yes I think the low coverage is an issue. This
shows there is only ~5x coverage per haplotype which is too low for a
reliable estimate of genome size and too low for a good assembly. If at all
possible I would recommend another sequencing run (or two)
Good luck!
Mike
…On Mon, Dec 23, 2024 at 3:11 PM d00bin ***@***.***> wrote:
Thanks for such a fast response, Mike!
This estimates the haploid genome size to be about 600Mbp. This is still
larger than the published genome (which I assume is this:
https://pmc.ncbi.nlm.nih.gov/articles/PMC10038202/) so you may have
additional bacterial contamination or other contamination issues present.
You assumed right! I'm sequencing a sister species. The thing is, Feulgen
staining estimations for my species and sister species (S.scovelli)are
around ~500 mb (https://doi.org/10.1186/s13059-016-1126-6)
In both cases assemblies are much smaller, and BUSCO results support the
assembly size.
I have then another question.
For the second species I'm estimating genome size for I had to limit
coverage to 22x to get good fit and the genome size I expect to see, both
based on Feulgen staining and other sequencing data/publications.
*Parameters Used:*
- GenomeScope version 2.0
- input file = user_uploads/rDlrrs6Ixupcu2ml71yN
- output directory = user_data/rDlrrs6Ixupcu2ml71yN
- p = 2
- k = 31
- max_kmercov = 22
property min max
Homozygous (aa) 98.9585% 98.9735%
Heterozygous (ab) 1.02646% 1.04151%
Genome Haploid Length 1,750,614,073 bp 1,754,767,627 bp
Genome Repeat Length 138,686,621 bp 139,015,672 bp
Genome Unique Length 1,611,927,453 bp 1,615,751,955 bp
Model Fit 99.6728% 99.6728%
Read Error Rate 0.380185% 0.380185%
N.png (view on web)
<https://github.com/user-attachments/assets/e77696af-5897-42da-9390-ed1f461ce48f>
What could be the case here? Could it be because of relatively low
coverage by HIFi?
—
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34YYUL2I7GDQ6HB7CIL2HBVAJAVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRQGI2DINZYHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Not really possible. But it assembled quite well, and we are planning to combine it with some HiC data. Thanks a lot for your advice! |
So, a quick update I decided to try counting k-mers using KMC instead of jellyfish. My commands were: kmc -fq -k31 -t32 -m120 -ci1 -cs10000 hifi_reads.fastq kmc_mer31 ./
kmc_tools transform kmc_mer31 histogram kmc31.histo Species 1 :
Property Statistics
Species 2 :
Property Statistics
This time no thresholds. Not sure what to make of it. |
Interesting - This shows much better coverage. Both Jellyfish and KMC
should give exactly the same output (they just count up how often different
kmers occur in the reads) so Im guessing there was an error when you ran
jellyfish - either the software didnt process all of the reads or maybe you
missed the -C flag for jellyfish to count canonical kmers. But these
results look good and would explain why the assembly came out as good as it
did
Good luck!
Mike
…On Tue, Dec 24, 2024 at 10:37 AM d00bin ***@***.***> wrote:
Quick Update
I decided to try counting k-mers using *KMC* instead of *jellyfish*. My
commands were:
kmc -fq -k31 -t32 -m120 -ci1 -cs10000 hifi_reads.fastq kmc_mer31 ./
kmc_tools transform kmc_mer31 histogram kmc31.histo
Species 1 Input and Output Details
- *Input File*: user_uploads/3iAZHENvlZWekAy7Qdb3
- *Output Directory*: user_data/3iAZHENvlZWekAy7Qdb3
- *Parameters*:
- p = 2
- k = 31
Property Statistics
Property Min Max
Homozygous (aa) 99.2971% 99.3159%
Heterozygous (ab) 0.684124% 0.702948%
Genome Haploid Length 300,873,367 bp 301,469,212 bp
Genome Repeat Length 41,285,600 bp 41,367,361 bp
Genome Unique Length 259,587,768 bp 260,101,851 bp
Model Fit 80.5608% 94.3123%
Read Error Rate 0.291992% 0.291992%
sp1.png (view on web)
<https://github.com/user-attachments/assets/968c0db8-05ac-41ae-9478-773f56dd7081>
Species 2: Input and Output Details
- *Input File*: user_uploads/0k6rrojoYBVqtv91GXXj
- *Output Directory*: user_data/0k6rrojoYBVqtv91GXXj
- *Parameters*:
- p = 2
- k = 31
Property Statistics
Property Min Max
Homozygous (aa) 98.9544% 99.0176%
Heterozygous (ab) 0.982437% 1.04558%
Genome Haploid Length 1,340,636,981 bp 1,348,657,949 bp
Genome Repeat Length 528,026,187 bp 531,185,343 bp
Genome Unique Length 812,610,794 bp 817,472,606 bp
Model Fit 65.4274% 98.6014%
Read Error Rate 0.257065% 0.257065%
sp2.png (view on web)
<https://github.com/user-attachments/assets/2da175f1-e52a-44ed-b910-864d26b23322>
This time no thresholds. Not sure what to make of it.
—
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP345AECECU6OA65B2T5L2HF5S3AVCNFSM6AAAAABUDNEUQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRRGIZTSMJYGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
First, I would like to thank you for creating this software, and especially making the web tool available! It is amazingly useful!
I'm a bit confused about max kmer coverages, how to choose them, and whether there is even a need to choose them.
I'm using PacBio HiFi reads, and the coverage should be ~30-40%. There seems to be some contamination from bacteria, however.
BUSCO is 94% complete and 2%fragmented for assembly 397mb
Here Is my example:
Parameters Used:
And here I set limit to 1000000 for k31:
Parameters Used:
Analysis Results:
From a sister species and Busco results I expect the genome size to be ~400-420mb. Why in order to get a better model fit and seemingly more realistic estimation I need to lower the max coverage to 25?
Because if the high coverage result is true, then I'm lacking ~200mb which surely should not result in assembly BUSCO ~94%.
Also, is it normal to have uniq value - 100%?
Thank you for all your work!!
The text was updated successfully, but these errors were encountered: