Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

understanding cause of poor model fit of heterozygote peak #144

Open
andrbern8000 opened this issue Sep 30, 2024 · 4 comments
Open

understanding cause of poor model fit of heterozygote peak #144

andrbern8000 opened this issue Sep 30, 2024 · 4 comments

Comments

@andrbern8000
Copy link

Good afternoon,

I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21).

The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak.

I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters?

I am very new to genome assembly and would appreciate any advice you (or anyone else) might have.

Thanks,
Andrea

GenomeScope version 2
p = 2
k = 21

property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%

cc_meryl_genomescope2_k21

@andrbern8000 andrbern8000 changed the title understanding cause of poor model fit of heterozygosity peak understanding cause of poor model fit of heterozygote peak Sep 30, 2024
@mschatz
Copy link
Contributor

mschatz commented Oct 2, 2024 via email

@andrbern8000
Copy link
Author

Hi Mike,

Thanks for your response. I have reviewed/gone through the VGP pipeline on Galaxy using the sample data. It is a wonderful training manual/tutorial. Thank you! I’ve also been using hifiasm to perform some preliminary assemblies on the hifi datasets that show clean kmer profiles using genomescope2.

We will likely be obtaining Hi-C data to assist in the assemblies and I’m hoping this will help improve the quality.
I’m sorry to trouble you, but I have another fish that has generated problematic kmer profiles with hifi data and using genomescope2. I’d appreciate any feedback on this issue as well.

First: Genome size estimates of congeners (c-values) suggest these fishes should have a haploid genome size of ~500-600Mbp.

Regarding the kmer profile:

The heterozygous peak is a poor fit for the data (which I now know is okay); but, the real issue is that as I adjust the kmer size from 21 to 31, the haploid genome size almost doubles. I am assuming the cause of this change is the high estimated heterozygosity of the genome.

For instance, could it be that as the kmer size increases (k = 21 to 31), more kmers are being identified as ‘unique’ rather than simply the heterozygous counterpart to an existing (and previously identified) kmer. Thus, the haploid genome size is increasing. Is this potentially the issue I’m experiencing? Or is there another (more obvious) issue that I’ve missed? Will this lead to assembly issues that I should look out for?

Again, help/advice would be appreciated.
Andrea

See linear kmer plots and summary stats below.

elongatus_K21_linear_plot

image

elongatus_K31_linear_plot

image

@mschatz
Copy link
Contributor

mschatz commented Oct 31, 2024 via email

@andrbern8000
Copy link
Author

Hi Mike,
Thank you so much for all your help and guidance.
All the best. AMB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants