understanding cause of poor model fit of heterozygote peak #144

andrbern8000 · 2024-09-30T20:34:55Z

Good afternoon,

I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid);
first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21).

The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak.

I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters?

I am very new to genome assembly and would appreciate any advice you (or anyone else) might have.

Thanks,
Andrea

GenomeScope version 2
p = 2
k = 21

property; min; max
Homozygous (aa); 98.04%; 98.10%
Heterozygous (ab); 1.90%; 1.96%
Genome Haploid Length; 377413934 bp; 379528391 bp
Genome Repeat Length; 61537310 bp; 61882072 bp
Genome Unique Length; 315876624 bp; 317646318 bp
Model Fit; 73.1021%; 88.551%
Read Error Rate; 0.460545%; 0.460545%

mschatz · 2024-10-02T16:37:44Z

Thanks for your interest. I agree the model fit diverges from the observed data, but this is not uncommon as the modeling expects an idealized coverage distribution. We have seen that some fish genomes have certain repeats that can be a little tricky for HiFi, but given the level of coverage you have here I would nevertheless expect a good assembly. For HiFi data I would recommend using the hifiasm genome assembler. You may also want to check out the pipeline we developed for VGP and the associated workflow we have in Galaxy that uses hifiasm plus a few pre- and post-assembly tools for QC and packaging: https://www.nature.com/articles/s41587-023-02100-3 Good luck! Mike

…

On Mon, Sep 30, 2024 at 4:35 PM andrbern8000 ***@***.***> wrote: Good afternoon, I am assembling fish genomes de novo using hifi data and have run into a few issues for a few of my target species (all diploid); first, to better understand the size and heterozygosity of the genome and to confirm our estimates of sequence coverage, I ran meryl (default settings for 'count' and 'histogram', k = 21) and genomescope2 (default settings, k = 21). The summary output of the genomescope2 model fit was not too bad (~73-89% - see below), but when the results were visualized, it appears as though the observed kmer frequencies (blue line) for the 'heterozygote' peak did not match the distribution estimated using the full model (black line). Basically, the observed peak spans a much wider coverage range than the full model peak. I am wondering what may be driving this observed vs. full model difference (i.e., sequencing errors?) and if this is a cause for concern (i.e., a data issue that needs to be addressed prior to assembly). Should I adjust some of the genomescope2 parameters? I am very new to genome assembly and would appreciate any advice you (or anyone else) might have. Thanks, Andrea GenomeScope version 2 p = 2 k = 21 property; min; max Homozygous (aa); 98.04%; 98.10% Heterozygous (ab); 1.90%; 1.96% Genome Haploid Length; 377413934 bp; 379528391 bp Genome Repeat Length; 61537310 bp; 61882072 bp Genome Unique Length; 315876624 bp; 317646318 bp Model Fit; 73.1021%; 88.551% Read Error Rate; 0.460545%; 0.460545% cc_meryl_genomescope2_k21.png (view on web) <https://github.com/user-attachments/assets/77c6a160-317b-4c39-af2c-7792cc3b993e> — Reply to this email directly, view it on GitHub <#144>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34Z4O2QI7UA3K46DI3LZZGYYHAVCNFSM6AAAAABPEFNPUKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TONJYHE3DOMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

andrbern8000 · 2024-10-04T18:52:28Z

Hi Mike,

Thanks for your response. I have reviewed/gone through the VGP pipeline on Galaxy using the sample data. It is a wonderful training manual/tutorial. Thank you! I’ve also been using hifiasm to perform some preliminary assemblies on the hifi datasets that show clean kmer profiles using genomescope2.

We will likely be obtaining Hi-C data to assist in the assemblies and I’m hoping this will help improve the quality.
I’m sorry to trouble you, but I have another fish that has generated problematic kmer profiles with hifi data and using genomescope2. I’d appreciate any feedback on this issue as well.

First: Genome size estimates of congeners (c-values) suggest these fishes should have a haploid genome size of ~500-600Mbp.

Regarding the kmer profile:

The heterozygous peak is a poor fit for the data (which I now know is okay); but, the real issue is that as I adjust the kmer size from 21 to 31, the haploid genome size almost doubles. I am assuming the cause of this change is the high estimated heterozygosity of the genome.

For instance, could it be that as the kmer size increases (k = 21 to 31), more kmers are being identified as ‘unique’ rather than simply the heterozygous counterpart to an existing (and previously identified) kmer. Thus, the haploid genome size is increasing. Is this potentially the issue I’m experiencing? Or is there another (more obvious) issue that I’ve missed? Will this lead to assembly issues that I should look out for?

Again, help/advice would be appreciated.
Andrea

See linear kmer plots and summary stats below.

mschatz · 2024-10-31T04:23:17Z

This comes up in tricky cases where it is ambiguous if the genome has a smaller haploid genome size with a high rate of heterozygosity or a larger genome size with a lower rate of heterozygosity -- in your data the options are 1Gb / 0.66% with an average coverage of 17 or 532Mbp / 2.42% with an average coverage of 31. GenomeScope uses a heuristic to decide and it is sensitive to the shape of the peaks. By changing the kmer size the shape of the peaks get more distorted so if flips between these two estimates. You can also force it to pick one of these versions by setting the parameter "Average k-mer coverage for polyploid genome" (here you could set this to either 17 or 31 to force it into one of these modes. Fortunately, you know the haploid genome size is about 500Mbp, so we can assume the 532 Mbp / 2.42% version is correct. With this rate of heterozygosity the two haplotypes will largely be separate using an assembler like hifiasm, and you should see widespread gene duplicates in BUSCO. You can further confirm this estimate by aligning the duplicate genes to each other and confirming the divergence rate is about 2.42% Hope this helps! Mike

…

On Fri, Oct 4, 2024 at 2:52 PM andrbern8000 ***@***.***> wrote: Hi Mike, Thanks for your response. I have reviewed/gone through the VGP pipeline on Galaxy using the sample data. It is a wonderful training manual/tutorial. Thank you! I’ve also been using hifiasm to perform some preliminary assemblies on the hifi datasets that show clean kmer profiles using genomescope2. We will likely be obtaining Hi-C data to assist in the assemblies and I’m hoping this will help improve the quality. I’m sorry to trouble you, but I have another fish that has generated problematic kmer profiles with hifi data and using genomescope2. I’d appreciate any feedback on this issue as well. First: Genome size estimates of congeners (c-values) suggest these fishes should have a haploid genome size of ~500-600Mbp. Regarding the kmer profile: The heterozygous peak is a poor fit for the data (which I now know is okay); but, the real issue is that as I adjust the kmer size from 21 to 31, the haploid genome size almost doubles. I am assuming the cause of this change is the high estimated heterozygosity of the genome. For instance, could it be that as the kmer size increases (k = 21 to 31), more kmers are being identified as ‘unique’ rather than simply the heterozygous counterpart to an existing (and previously identified) kmer. Thus, the haploid genome size is increasing. Is this potentially the issue I’m experiencing? Or is there another (more obvious) issue that I’ve missed? Will this lead to assembly issues that I should look out for? Again, help/advice would be appreciated. Andrea See linear kmer plots and summary stats below. elongatus_K21_linear_plot.png (view on web) <https://github.com/user-attachments/assets/e3caffae-161d-493b-857f-fc384ab856b2> image.png (view on web) <https://github.com/user-attachments/assets/992c62d5-c117-41db-bbca-34e546ad8b77> elongatus_K31_linear_plot.png (view on web) <https://github.com/user-attachments/assets/8bfd5de0-a563-41ed-9f19-688ae3fa154d> image.png (view on web) <https://github.com/user-attachments/assets/490cf9d7-203b-4afc-a1d8-454bd991351a> — Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP347KKFF54NK4OQYIFI3ZZ3PYFAVCNFSM6AAAAABPEFNPUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUGM3TKMRXGM> . You are receiving this because you commented.Message ID: ***@***.***>

andrbern8000 · 2024-11-15T00:22:36Z

Hi Mike,
Thank you so much for all your help and guidance.
All the best. AMB

andrbern8000 changed the title ~~understanding cause of poor model fit of heterozygosity peak~~ understanding cause of poor model fit of heterozygote peak Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

understanding cause of poor model fit of heterozygote peak #144

understanding cause of poor model fit of heterozygote peak #144

andrbern8000 commented Sep 30, 2024

mschatz commented Oct 2, 2024 via email

andrbern8000 commented Oct 4, 2024

mschatz commented Oct 31, 2024 via email

andrbern8000 commented Nov 15, 2024

understanding cause of poor model fit of heterozygote peak #144

understanding cause of poor model fit of heterozygote peak #144

Comments

andrbern8000 commented Sep 30, 2024

mschatz commented Oct 2, 2024 via email

andrbern8000 commented Oct 4, 2024

mschatz commented Oct 31, 2024 via email

andrbern8000 commented Nov 15, 2024