Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested coverage for the tetraploid genome #145

Open
hungweichen0327 opened this issue Oct 15, 2024 · 16 comments
Open

Suggested coverage for the tetraploid genome #145

hungweichen0327 opened this issue Oct 15, 2024 · 16 comments

Comments

@hungweichen0327
Copy link

Dear Community,

I would like to ask about the suggested coverage for the tetraploid genome.
(The targeted species is the plant. I know the suggested coverage is probably also related to heterozygosity and proportion of the repeated region of the genome.)

Thank you for the help.

@mschatz
Copy link
Contributor

mschatz commented Oct 15, 2024 via email

@hungweichen0327
Copy link
Author

Dear Mike,

Thank you for the quick reply.

You mentioned that Id recommend a combination of HiFi and HiC (and ONT ultralong if possible) to get the best possible results. I would like to confirm this recommendation is used for genome assembly, right? it's not related to the genomescope?

@mschatz
Copy link
Contributor

mschatz commented Oct 15, 2024 via email

@hungweichen0327
Copy link
Author

hungweichen0327 commented Nov 14, 2024

Dear @mschatz,

This is the Genomescope2 result I ran recently.

For k = 21,
linear_plot
transformed_linear_plot

For k = 25,
linear_plot
transformed_linear_plot

This is the smudgeplot result (K=25).
image

I have ~54 Gb (180X coverage for the expected haploid genome size 300 Mb) Illumina data for the Genomescope2 analysis, but the highest pick is at <40X coverage. Do you suggest I to obtain more Illumina data? (I would say this plant species might be tetraploid.)
Because I found that the Genomescope2 result of tetraploid Meloidogyne javanica showed 130X coverage at the highest peak (Figure S22) as shown below in your published paper of Genomescope2 (https://www.nature.com/articles/s41467-020-14998-3)

image

Any suggestions or comments are appreciated. Thank you!

@mschatz
Copy link
Contributor

mschatz commented Dec 6, 2024 via email

@hungweichen0327
Copy link
Author

Dear @mschatz Mike,
Thank you for the suggestions. I have used ONT data and Hi-C data to generate good genome assembly. (scaffold N50 is 30 Mb and the number of the scaffold is ~133 for the 1.18 Gb genome assembly representing tetraploid)

Based on my results generated by Genomescope 2 above, is it clear enough to show that it is a tetraploid species?

Besides, based on the proportion of aaaa, aaab, aabc, abcd, could I say:
(1) This tetraploid genome of plant species is allotetraploids since aabb% > aaab%
(2) The divergence of haploid is high since the first peak is much higher than the other three peaks

Thank you.

@mschatz
Copy link
Contributor

mschatz commented Dec 16, 2024 via email

@hungweichen0327
Copy link
Author

Dear @mschatz Mike,

This is the summary of the BUSCO results:

## lineage: embryophyta_odb10
S:7.62%, 123
D:89.10%, 1438
F:0.31%, 5
I:0.00%, 0
M:2.97%, 48
N:1614

I also checked the copy number of these 1614 genes based on the output of BUSCO.
image

The smudgeplot was done (k=25)
image

Thank you.

@mschatz
Copy link
Contributor

mschatz commented Dec 17, 2024 via email

@KamilSJaron
Copy link

KamilSJaron commented Dec 17, 2024

Hi, I would put my money on AA'BB' type of a genome, where both A and B subgenomes are very heterozygous (therefore the dominan 1n peak in the genomescope), and A <-> B is even more diverged, therefore there are less tetraploid k-mers than diploid. So, I would guess allo-.

In your assembly, you will see all A and B uncollapsed, and some bits and pieces of A' and B' here and there therefore nearly always >2 BUSCO copies, but not always. Some of this can be becuase of ongoaing rediplodisation and loss of BUSCOs in one or the other subgenome.

What kind of plant is it?

So, I would not be confident claiming all this just based on the spectra and smudgeplot. You can run mequry (fyi there is a branch of mequry called marqury.fk that you can run directly on FastK k-mer database you already have) to tell how right I was about the assembly (it should give you general idea about how much collapsed it is). I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates. THen you can map back all the reads and call variants; you should see a substantial amount of heterogyzosity. All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x). With that you should be able to piece together how the genome looks like.

HOpe this helps.

K

@mschatz
Copy link
Contributor

mschatz commented Dec 17, 2024 via email

@hungweichen0327
Copy link
Author

Dear @mschatz and @KamilSJaron,

Many thanks for your kind feedback. This is a carnivorous plant species, and it has been reported that both diploid and tetraploid populations exist worldwide. (We speculated our targeted population is tetraploid)

@KamilSJaron, some description I don't fully understand and would like to ask:

Q1:

I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates.

"keep all A and B contigs" > do you mean keep contigs from 4 subgenomes (AA'BB')?

Q2:
For the mequry, do you have any related information on how I run this tool? It seems that it needs k-mer counts of the paternal, maternal, and child haplotypes as input.

Q3:

All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x)

How to know the value of the right amount coverage (250x)? Sorry I am still new to this field.

Thank you.

@KamilSJaron
Copy link

Q1: No, I meant to have a reference A + B (no A', no B'); so when you map reads, two haplotypes map to every position. That reference is expected to be a bit under 150 Mbp.

Q2: Mequry plots allow to plot the k-mer spectra vs the assembly. If you have maternal and paternal datasets, you can also generate phased haplotypes, but that's not necessary. I think the plot you want is called CNplot, but I would double check the manual. https://github.com/thegenemyers/MERQURY.FK

Q3: You map reads to your assembly, then you look at the read depth. ALternatively, you can call variants and then look at the coverage supporting them. ALl that requires some elementary bioinformatics skills, but I think you would be better off asking someone locally or look at some of the onliine bioinformatics communities for a help (this really is out of scope of genomescope support)

@KamilSJaron
Copy link

@mschatz All good, thanks for tagging me. i am always down for checking a funky spectra/smudgeplot. How do you like the updated version? (besides missing those histograms we still need to put back. The core plot is just so much nicer we decided not to hold it back)

@mschatz
Copy link
Contributor

mschatz commented Dec 19, 2024 via email

@hungweichen0327
Copy link
Author

Dear @KamilSJaron,

Thank you for the kind feedback.

Q1: I expect the reference to be about 290 Mb for each haplotype based on the genomescope2 results and some literatures. (I also try setting ploidy = 2 in Genomescope2 and the predicted genome size is 583 Mb)

Q2: Thank you for the information. I will look into the CNplot.

Q3: Understand. What you mentioned is already helpful enough for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants