-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genomescope2 underestimate the genome size of autotetraploid (a fish from Tibet Plateau,China ) #150
Comments
Thanks for your interest. The fit here looks pretty good, but it only shows
the kmers that have relatively low frequency. Can you send the full
sample.histo file?
I have also encountered issues with HiFi data in fishes before where the
repeats are underestimated, especially GA repeats. This is due to how the
PacBio basecaller processes these types of data. Do you have Illumina data
available? That might be more reliable for assessing the genome. Another
idea is to align your HiFi reads to your assembly, and then you can
measure the coverage you have in different sequence contexts. If the
coverage substantially deviates from a Poisson or negative binomial
distribution, then the genome size estimate will be too small
Good luck!
Mike
…On Sat, Dec 28, 2024 at 2:06 AM valeryzhu ***@***.***> wrote:
Thank you for your fantastic bioinformatics tool! 😄
However, when I used “KMC + genomescope2” to estimate the genome size from
Illumina sequence data, I obtained a haploid size of about 750M, which
would suggest a tetraploid size of about 3G. 😭In some ways, there is
evidence to support that the tetraploid genome should be 4G:
-
We used PacBio long-read sequencing and Hi-C to assemble to get the
diploid genome, which is about 2.2G in size, implying that the tetraploid
genome should be around 4.4G.
-
Another study on a closely related fish species assembled the diploid
genome to be about 4G in size. Although they used genomescope1 to evaluate
the tetraploid genome size and obtained a result of 2G, this suggests that
the total genome size is 4G, which is consistent with the assembled result.
I was confused by this problem for few weeks, I tried many parameter,
but the size is still whin 700M~800M.
Here is my code:
`
kmc -k21 -t16 -m64 -ci1 -cs100000 @kmc_FILES kmcdb kmc_tmp
kmc_tools transform kmcdb histogram sample.histo -cx10000
genomescope2 -i sample.histo -o histolpot -k 21 -p 4`
Here is my result(My tutor asked us to get the pdf filetype, so I have to change the plot function😭)
linear_plot.pdf
<https://github.com/user-attachments/files/18265350/linear_plot.pdf>
—
Reply to this email directly, view it on GitHub
<#150>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP342ULE2XV4UIDV753JT2HZEV5AVCNFSM6AAAAABUJRLZTOVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DCNJXGAZTQMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Here is my histo file and other plot:sample.histo.txt Maybe I didn't make myself clear, because I'm not a native English speaker I will try to consider the GA repeats and the "alignment way" , I believe your answer was genuinely helpful. |
I just tried your sample.histo.txt file with genomescope2 using ploidy=4,
and it estimates the haploid genome size to be 812Mb:
http://genomescope.org/genomescope2/analysis.php?code=272CQJaEkjzkAtXQBErg
I noticed the histogram file cuts out at 100,000 so is missing some of the
very high frequency repeats, which will tend to cause the genome size to be
underestimated, so the haploid genome size could be closer to 1Gb. On the
other hand, assemblers are prone to duplicate sequences in high ploidy
genomes, so your diploid estimate could be inflated. How do the BUSCO
results look? I would not be surprised if you had an excess of duplicated
BUSCOs in your diploid assembly.
Good luck!
Mike
…On Sat, Jan 4, 2025 at 3:05 AM valeryzhu ***@***.***> wrote:
Here is my histo file and other plot:
sample.histo.txt
<https://github.com/user-attachments/files/18306028/sample.histo.txt>
linear_plot.pdf
<https://github.com/user-attachments/files/18306033/linear_plot.pdf>
![log_plot](https://github.com/user-attachments/assets/3820062d-bd7d-4f00
transformed_log_plot.png (view on web)
<https://github.com/user-attachments/assets/dc5b6467-bf43-442b-aa24-d7dce3da9cb9>
-b3af-03995f228070)
transformed_linear_plot.png (view on web)
<https://github.com/user-attachments/assets/28f2acb1-d6d9-4deb-9952-3e78b5727bbd> *Just
as you said, the model fit very well.*
Maybe I didn't make myself clear, because I'm not a native English speaker
I used Illumina data to do the analysis(KMC+genomescope2), and the genome
size**(haploid)** was estimated to be 718M.
My tutor used the PacBio data to assembly the genome, and the genome
size**(diploid)** was estimated to be 2.2G.
I will try to consider the GA repeats and the "alignment way" , I believe
your answer was genuinely helpful.
—
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34YQT5G2ERFW2VCKFUT2I6I3VAVCNFSM6AAAAABUJRLZTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZQGY2DKNZWHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thank you for your fantastic bioinformatics tool! 😄
However, when I used “KMC + genomescope2” to estimate the genome size from Illumina sequence data, I obtained a haploid size of about 750M, which would suggest a tetraploid size of about 3G. 😭In some ways, there is evidence to support that the tetraploid genome should be 4G:
We used PacBio long-read sequencing and Hi-C to assemble to get the diploid genome, which is about 2.2G in size, implying that the tetraploid genome should be around 4.4G.
Another study on a closely related fish species assembled the diploid genome to be about 4G in size. Although they used genomescope1 to evaluate the tetraploid genome size and obtained a result of 2G, this suggests that the total genome size is 4G, which is consistent with the assembled result.
I was confused by this problem for few weeks, I tried many parameter, but the size is still whin 700M~800M.
Here is my code:
`
kmc -k21 -t16 -m64 -ci1 -cs100000 @kmc_FILES kmcdb kmc_tmp
kmc_tools transform kmcdb histogram sample.histo -cx10000
genomescope2 -i sample.histo -o histolpot -k 21 -p 4`
linear_plot.pdf
The text was updated successfully, but these errors were encountered: