Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genomescope2 underestimate the genome size of autotetraploid (a fish from Tibet Plateau,China ) #150

Open
valeryzhu opened this issue Dec 28, 2024 · 3 comments

Comments

@valeryzhu
Copy link

Thank you for your fantastic bioinformatics tool! 😄
However, when I used “KMC + genomescope2” to estimate the genome size from Illumina sequence data, I obtained a haploid size of about 750M, which would suggest a tetraploid size of about 3G. 😭In some ways, there is evidence to support that the tetraploid genome should be 4G:

  • We used PacBio long-read sequencing and Hi-C to assemble to get the diploid genome, which is about 2.2G in size, implying that the tetraploid genome should be around 4.4G.

  • Another study on a closely related fish species assembled the diploid genome to be about 4G in size. Although they used genomescope1 to evaluate the tetraploid genome size and obtained a result of 2G, this suggests that the total genome size is 4G, which is consistent with the assembled result.

    I was confused by this problem for few weeks, I tried many parameter, but the size is still whin 700M~800M.
    Here is my code:
    `

kmc -k21 -t16 -m64 -ci1 -cs100000 @kmc_FILES kmcdb kmc_tmp

kmc_tools transform kmcdb histogram sample.histo -cx10000

genomescope2 -i sample.histo -o histolpot -k 21 -p 4`

Here is my result(My tutor asked us to get the pdf filetype, so I have to change the plot function😭)

linear_plot.pdf

@mschatz
Copy link
Contributor

mschatz commented Jan 2, 2025 via email

@valeryzhu
Copy link
Author

Here is my histo file and other plot:

sample.histo.txt
linear_plot.pdf
![log_plot](https://github.com/user-attachments/assets/3820062d-bd7d-4f00
transformed_log_plot
-b3af-03995f228070)
transformed_linear_plot Just as you said, the model fit very well.

Maybe I didn't make myself clear, because I'm not a native English speaker
I used Illumina data to do the analysis(KMC+genomescope2), and the genome size**(haploid)** was estimated to be 718M.
My tutor used the PacBio data to assembly the genome, and the genome size**(diploid)** was estimated to be 2.2G.

I will try to consider the GA repeats and the "alignment way" , I believe your answer was genuinely helpful.

@mschatz
Copy link
Contributor

mschatz commented Jan 4, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants