genomescope2 underestimate the genome size of autotetraploid （a fish from Tibet Plateau，China ) #150

valeryzhu · 2024-12-28T07:05:46Z

Thank you for your fantastic bioinformatics tool! 😄
However, when I used “KMC + genomescope2” to estimate the genome size from Illumina sequence data, I obtained a haploid size of about 750M, which would suggest a tetraploid size of about 3G. 😭In some ways, there is evidence to support that the tetraploid genome should be 4G:

We used PacBio long-read sequencing and Hi-C to assemble to get the diploid genome, which is about 2.2G in size, implying that the tetraploid genome should be around 4.4G.
Another study on a closely related fish species assembled the diploid genome to be about 4G in size. Although they used genomescope1 to evaluate the tetraploid genome size and obtained a result of 2G, this suggests that the total genome size is 4G, which is consistent with the assembled result.

I was confused by this problem for few weeks, I tried many parameter, but the size is still whin 700M~800M.
Here is my code:
`

kmc -k21 -t16 -m64 -ci1 -cs100000 @kmc_FILES kmcdb kmc_tmp

kmc_tools transform kmcdb histogram sample.histo -cx10000

genomescope2 -i sample.histo -o histolpot -k 21 -p 4`

Here is my result(My tutor asked us to get the pdf filetype, so I have to change the plot function😭)

linear_plot.pdf

The text was updated successfully, but these errors were encountered:

mschatz · 2025-01-02T04:52:45Z

Thanks for your interest. The fit here looks pretty good, but it only shows the kmers that have relatively low frequency. Can you send the full sample.histo file? I have also encountered issues with HiFi data in fishes before where the repeats are underestimated, especially GA repeats. This is due to how the PacBio basecaller processes these types of data. Do you have Illumina data available? That might be more reliable for assessing the genome. Another idea is to align your HiFi reads to your assembly, and then you can measure the coverage you have in different sequence contexts. If the coverage substantially deviates from a Poisson or negative binomial distribution, then the genome size estimate will be too small Good luck! Mike

…

On Sat, Dec 28, 2024 at 2:06 AM valeryzhu ***@***.***> wrote: Thank you for your fantastic bioinformatics tool! 😄 However, when I used “KMC + genomescope2” to estimate the genome size from Illumina sequence data, I obtained a haploid size of about 750M, which would suggest a tetraploid size of about 3G. 😭In some ways, there is evidence to support that the tetraploid genome should be 4G: - We used PacBio long-read sequencing and Hi-C to assemble to get the diploid genome, which is about 2.2G in size, implying that the tetraploid genome should be around 4.4G. - Another study on a closely related fish species assembled the diploid genome to be about 4G in size. Although they used genomescope1 to evaluate the tetraploid genome size and obtained a result of 2G, this suggests that the total genome size is 4G, which is consistent with the assembled result. I was confused by this problem for few weeks, I tried many parameter, but the size is still whin 700M~800M. Here is my code: ` kmc -k21 -t16 -m64 -ci1 -cs100000 @kmc_FILES kmcdb kmc_tmp kmc_tools transform kmcdb histogram sample.histo -cx10000 genomescope2 -i sample.histo -o histolpot -k 21 -p 4` Here is my result(My tutor asked us to get the pdf filetype, so I have to change the plot function😭) linear_plot.pdf <https://github.com/user-attachments/files/18265350/linear_plot.pdf> — Reply to this email directly, view it on GitHub <#150>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP342ULE2XV4UIDV753JT2HZEV5AVCNFSM6AAAAABUJRLZTOVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DCNJXGAZTQMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

valeryzhu · 2025-01-04T08:04:52Z

Here is my histo file and other plot:

sample.histo.txt
linear_plot.pdf
![log_plot](https://github.com/user-attachments/assets/3820062d-bd7d-4f00

-b3af-03995f228070)
Just as you said, the model fit very well.

Maybe I didn't make myself clear, because I'm not a native English speaker
I used Illumina data to do the analysis(KMC+genomescope2), and the genome size**(haploid)** was estimated to be 718M.
My tutor used the PacBio data to assembly the genome, and the genome size**(diploid)** was estimated to be 2.2G.

I will try to consider the GA repeats and the "alignment way" , I believe your answer was genuinely helpful.

mschatz · 2025-01-04T20:19:04Z

I just tried your sample.histo.txt file with genomescope2 using ploidy=4, and it estimates the haploid genome size to be 812Mb: http://genomescope.org/genomescope2/analysis.php?code=272CQJaEkjzkAtXQBErg I noticed the histogram file cuts out at 100,000 so is missing some of the very high frequency repeats, which will tend to cause the genome size to be underestimated, so the haploid genome size could be closer to 1Gb. On the other hand, assemblers are prone to duplicate sequences in high ploidy genomes, so your diploid estimate could be inflated. How do the BUSCO results look? I would not be surprised if you had an excess of duplicated BUSCOs in your diploid assembly. Good luck! Mike

…

On Sat, Jan 4, 2025 at 3:05 AM valeryzhu ***@***.***> wrote: Here is my histo file and other plot: sample.histo.txt <https://github.com/user-attachments/files/18306028/sample.histo.txt> linear_plot.pdf <https://github.com/user-attachments/files/18306033/linear_plot.pdf> ![log_plot](https://github.com/user-attachments/assets/3820062d-bd7d-4f00 transformed_log_plot.png (view on web) <https://github.com/user-attachments/assets/dc5b6467-bf43-442b-aa24-d7dce3da9cb9> -b3af-03995f228070) transformed_linear_plot.png (view on web) <https://github.com/user-attachments/assets/28f2acb1-d6d9-4deb-9952-3e78b5727bbd> *Just as you said, the model fit very well.* Maybe I didn't make myself clear, because I'm not a native English speaker I used Illumina data to do the analysis(KMC+genomescope2), and the genome size**(haploid)** was estimated to be 718M. My tutor used the PacBio data to assembly the genome, and the genome size**(diploid)** was estimated to be 2.2G. I will try to consider the GA repeats and the "alignment way" , I believe your answer was genuinely helpful. — Reply to this email directly, view it on GitHub <#150 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34YQT5G2ERFW2VCKFUT2I6I3VAVCNFSM6AAAAABUJRLZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZQGY2DKNZWHA> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomescope2 underestimate the genome size of autotetraploid （a fish from Tibet Plateau，China ) #150

genomescope2 underestimate the genome size of autotetraploid （a fish from Tibet Plateau，China ) #150

valeryzhu commented Dec 28, 2024

mschatz commented Jan 2, 2025 via email

valeryzhu commented Jan 4, 2025

mschatz commented Jan 4, 2025 via email

genomescope2 underestimate the genome size of autotetraploid （a fish from Tibet Plateau，China ) #150

genomescope2 underestimate the genome size of autotetraploid （a fish from Tibet Plateau，China ) #150

Comments

valeryzhu commented Dec 28, 2024

mschatz commented Jan 2, 2025 via email

valeryzhu commented Jan 4, 2025

Here is my histo file and other plot:

mschatz commented Jan 4, 2025 via email