Suggested coverage for the tetraploid genome #145

hungweichen0327 · 2024-10-15T00:59:42Z

Dear Community,

I would like to ask about the suggested coverage for the tetraploid genome.
(The targeted species is the plant. I know the suggested coverage is probably also related to heterozygosity and proportion of the repeated region of the genome.)

Thank you for the help.

mschatz · 2024-10-15T02:57:05Z

My general recommendation is to aim for 15x per haplotype, so I would recommend a total of 60x coverage. But if the genome is particularly repetitive and/or complex you might need to go even higher. And Id recommend a combination of HiFi and HiC (and ONT ultralong if possible) to get the best possible results Good luck Mike

…

On Mon, Oct 14, 2024 at 9:00 PM Hung-Wei ***@***.***> wrote: Dear Community, I would like to ask about the suggested coverage for the tetraploid genome. (The targeted species is the plant. I know the suggested coverage is probably also related to heterozygosity and proportion of the repeated region of the genome.) Thank you for the help. — Reply to this email directly, view it on GitHub <#145>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34Y6Y4O5EGBJPIUGNO3Z3RSJFAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4DOMZXGUZTSMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hungweichen0327 · 2024-10-15T09:16:30Z

Dear Mike,

Thank you for the quick reply.

You mentioned that Id recommend a combination of HiFi and HiC (and ONT ultralong if possible) to get the best possible results. I would like to confirm this recommendation is used for genome assembly, right? it's not related to the genomescope?

mschatz · 2024-10-15T14:43:23Z

Correct, id recommend HiFi and HiC for the assembly, but for GenomeScope you can use any high quality read type (HiFi, Illumina, Element, etc) Good luck Mike

…

On Tue, Oct 15, 2024 at 5:16 AM Hung-Wei ***@***.***> wrote: Dear Mike, Thank you for the quick reply. You mentioned that Id recommend a combination of HiFi and HiC (and ONT ultralong if possible) to get the best possible results. I would like to confirm this recommendation is used for genome assembly, right? it's not related to the genomescope? — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34YCVWURP5UPGUL62GDZ3TMQLAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTGM2DGMZRGI> . You are receiving this because you commented.Message ID: ***@***.***>

hungweichen0327 · 2024-11-14T03:53:16Z

Dear @mschatz,

This is the Genomescope2 result I ran recently.

For k = 21,

For k = 25,

This is the smudgeplot result (K=25).

I have ~54 Gb (180X coverage for the expected haploid genome size 300 Mb) Illumina data for the Genomescope2 analysis, but the highest pick is at <40X coverage. Do you suggest I to obtain more Illumina data? (I would say this plant species might be tetraploid.)
Because I found that the Genomescope2 result of tetraploid Meloidogyne javanica showed 130X coverage at the highest peak (Figure S22) as shown below in your published paper of Genomescope2 (https://www.nature.com/articles/s41467-020-14998-3)

Any suggestions or comments are appreciated. Thank you!

mschatz · 2024-12-06T16:45:20Z

This looks good in terms of coverage - the first peak (the haploid coverage) is at about 40x which should be enough for genome profiling with GenomeScope. But this will be very challenging to assemble with just Illumina data - there are a few assemblers designed for this (e.g. https://pubmed.ncbi.nlm.nih.gov/24755901/) but the contigs tend to be quite short. A contig N50 of 50kbp is pretty typical for this genome size and complexity. If at all possible, I would highly recommend generating Hifi data and/or ONT data Good luck Mike

…

On Wed, Nov 13, 2024 at 10:53 PM Hung-Wei ***@***.***> wrote: Dear @mschatz <https://github.com/mschatz>, This is the Genomescope2 result I ran recently. linear_plot.png (view on web) <https://github.com/user-attachments/assets/3c9424c8-88b9-4945-8238-eda3299dcf63> transformed_linear_plot.png (view on web) <https://github.com/user-attachments/assets/65051f3a-8bf9-4f71-91f2-77c5af6aa7b2> This is the smudgeplot result. image.png (view on web) <https://github.com/user-attachments/assets/70650e57-9e6f-4d51-88db-817b4c873fe2> I have ~54 Gb (180X coverage for the expected genome size 300 Mb) Illumina data for the Genomescope2 analysis, but the highest pick is at <40X coverage. Do you suggest I to obtain more Illumina data? Because I found that the Genomescope2 result of tetraploid Meloidogyne javanica showed 130X coverage at the highest peak (Figure S22) in your published paper of Genomescope2 ( https://www.nature.com/articles/s41467-020-14998-3) image.png (view on web) <https://github.com/user-attachments/assets/23f50922-c168-4025-9f50-aa2aa15b2d4a> Any suggestions or comments are welcome. Thank you! — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP344VAZJSYWBCRWJF5I32AQNEFAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGM2TQOBSHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hungweichen0327 · 2024-12-16T15:37:08Z

Dear @mschatz Mike,
Thank you for the suggestions. I have used ONT data and Hi-C data to generate good genome assembly. (scaffold N50 is 30 Mb and the number of the scaffold is ~133 for the 1.18 Gb genome assembly representing tetraploid)

Based on my results generated by Genomescope 2 above, is it clear enough to show that it is a tetraploid species?

Besides, based on the proportion of aaaa, aaab, aabc, abcd, could I say:
(1) This tetraploid genome of plant species is allotetraploids since aabb% > aaab%
(2) The divergence of haploid is high since the first peak is much higher than the other three peaks

Thank you.

mschatz · 2024-12-16T15:52:01Z

What do you see in your BUSCO results? If tetraploid would expect a large amount of duplicated genes, especially 4 copy genes. Have you tried smudgeplots - this is very helpful for ploidy assessment Good luck! Mike

…

On Mon, Dec 16, 2024 at 10:37 AM Hung-Wei ***@***.***> wrote: Dear @mschatz <https://github.com/mschatz> Mike, Thank you for the suggestions. I have used ONT data and Hi-C data to generate good genome assembly. (scaffold N50 is 30 Mb and the number of the scaffold is ~133 for the 1.18 Gb genome assembly representing tetraploid) Based on my results generated by Genomescope 2 above, is it clear enough to show that it is a tetraploid species? Besides, based on the proportion of aaaa, aaab, aabc, abcd, could I say: (1) This tetraploid genome of plant species is allotetraploids since aabb% > aaab% (2) The divergence of haploid is high since the first peak is much higher than the other three peaks Thank you. — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP345PI2WXL6JZEBJO23T2F3XTXAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVHE2TKOBVGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hungweichen0327 · 2024-12-17T08:32:05Z

Dear @mschatz Mike,

This is the summary of the BUSCO results:

## lineage: embryophyta_odb10
S:7.62%, 123
D:89.10%, 1438
F:0.31%, 5
I:0.00%, 0
M:2.97%, 48
N:1614

I also checked the copy number of these 1614 genes based on the output of BUSCO.

The smudgeplot was done (k=25)

Thank you.

mschatz · 2024-12-17T16:56:30Z

The large fraction of duplicate genes is expected, but Im a little surprised there arent more 4 copy genes. Kamil, do you have any thoughts on diploid vs tetraploid for this species - especially with the smudgeplot showing AB as the most common pattern of heterogeneity Good luck Mike

…

On Tue, Dec 17, 2024 at 3:32 AM Hung-Wei ***@***.***> wrote: Dear @mschatz <https://github.com/mschatz> Mike, This is the summary of the BUSCO results: ## lineage: embryophyta_odb10 S:7.62%, 123 D:89.10%, 1438 F:0.31%, 5 I:0.00%, 0 M:2.97%, 48 N:1614 I also checked the copy number of these 1614 genes based on the output of BUSCO. image.png (view on web) <https://github.com/user-attachments/assets/5fbc3a9c-051b-49f3-9dc4-37bb49616d7f> The smudgeplot was done (k=25) image.png (view on web) <https://github.com/user-attachments/assets/10b6bcf7-87bd-4a7c-8bb5-f215f90bcf81> Thank you. — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34ZVZMOXLJWPH6BUMTD2F7ORZAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXHAYDANZWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

KamilSJaron · 2024-12-17T19:46:18Z

Hi, I would put my money on AA'BB' type of a genome, where both A and B subgenomes are very heterozygous (therefore the dominan 1n peak in the genomescope), and A <-> B is even more diverged, therefore there are less tetraploid k-mers than diploid. So, I would guess allo-.

In your assembly, you will see all A and B uncollapsed, and some bits and pieces of A' and B' here and there therefore nearly always >2 BUSCO copies, but not always. Some of this can be becuase of ongoaing rediplodisation and loss of BUSCOs in one or the other subgenome.

What kind of plant is it?

So, I would not be confident claiming all this just based on the spectra and smudgeplot. You can run mequry (fyi there is a branch of mequry called marqury.fk that you can run directly on FastK k-mer database you already have) to tell how right I was about the assembly (it should give you general idea about how much collapsed it is). I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates. THen you can map back all the reads and call variants; you should see a substantial amount of heterogyzosity. All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x). With that you should be able to piece together how the genome looks like.

HOpe this helps.

K

mschatz · 2024-12-17T21:23:45Z

Thanks Kamil! Mike

…

On Tue, Dec 17, 2024 at 2:46 PM Kamil S. Jaron ***@***.***> wrote: Hi, I would put my money on AA'BB' type of a genome, where both A and B subgenomes are very heterozygous (therefore the dominan 1n peak in the genomescope), and A <-> B is also quite far apart, therefore there are less tetraploid k-mers than diploid. So, I would guess allo-. In your assembly, you will see all A and B uncollapsed, and some bits and pieces of A' and B' here and there. What kind of plant is it? So, I would not be confident claiming all this just based on the spectra and smudgeplot. You can run mequry (fyi there is a branch of mequry called marqury.fk that you can run directly on FastK k-mer database you already have) to tell how right I was about the assembly (it should give you general idea about how much collapsed it is). I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates. THen you can map back all the reads and call variants; you should see a substantial amount of heterogyzosity. All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x). With that you should be able to piece together how the genome looks like. HOpe this helps. K — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34ZGAU7O3YTSINP55G32GB5R7AVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGQ3DGNBWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hungweichen0327 · 2024-12-17T23:47:23Z

Dear @mschatz and @KamilSJaron,

Many thanks for your kind feedback. This is a carnivorous plant species, and it has been reported that both diploid and tetraploid populations exist worldwide. (We speculated our targeted population is tetraploid)

@KamilSJaron, some description I don't fully understand and would like to ask:

Q1:

I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates.

"keep all A and B contigs" > do you mean keep contigs from 4 subgenomes (AA'BB')?

Q2:
For the mequry, do you have any related information on how I run this tool? It seems that it needs k-mer counts of the paternal, maternal, and child haplotypes as input.

Q3:

All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x)

How to know the value of the right amount coverage (250x)? Sorry I am still new to this field.

Thank you.

KamilSJaron · 2024-12-18T09:22:49Z

Q1: No, I meant to have a reference A + B (no A', no B'); so when you map reads, two haplotypes map to every position. That reference is expected to be a bit under 150 Mbp.

Q2: Mequry plots allow to plot the k-mer spectra vs the assembly. If you have maternal and paternal datasets, you can also generate phased haplotypes, but that's not necessary. I think the plot you want is called CNplot, but I would double check the manual. https://github.com/thegenemyers/MERQURY.FK

Q3: You map reads to your assembly, then you look at the read depth. ALternatively, you can call variants and then look at the coverage supporting them. ALl that requires some elementary bioinformatics skills, but I think you would be better off asking someone locally or look at some of the onliine bioinformatics communities for a help (this really is out of scope of genomescope support)

KamilSJaron · 2024-12-18T09:23:50Z

@mschatz All good, thanks for tagging me. i am always down for checking a funky spectra/smudgeplot. How do you like the updated version? (besides missing those histograms we still need to put back. The core plot is just so much nicer we decided not to hold it back)

mschatz · 2024-12-19T18:49:33Z

Looks great!

…

On Wed, Dec 18, 2024 at 4:24 AM Kamil S. Jaron ***@***.***> wrote: @mschatz <https://github.com/mschatz> All good, thanks for tagging me. i am always down for checking a funky spectra/smudgeplot. How do you like the updated version? (besides missing those histograms we still need to put back. The core plot is just so much nicer we decided not to hold it back) — Reply to this email directly, view it on GitHub <#145 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABP34Z4FJGNZXEPV5YDCF32GE5LZAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJQHAYDQMRWGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hungweichen0327 · 2024-12-20T04:26:45Z

Dear @KamilSJaron,

Thank you for the kind feedback.

Q1: I expect the reference to be about 290 Mb for each haplotype based on the genomescope2 results and some literatures. (I also try setting ploidy = 2 in Genomescope2 and the predicted genome size is 583 Mb)

Q2: Thank you for the information. I will look into the CNplot.

Q3: Understand. What you mentioned is already helpful enough for me.

hungweichen0327 closed this as completed Dec 20, 2024

hungweichen0327 reopened this Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested coverage for the tetraploid genome #145

Suggested coverage for the tetraploid genome #145

hungweichen0327 commented Oct 15, 2024

mschatz commented Oct 15, 2024 via email

hungweichen0327 commented Oct 15, 2024

mschatz commented Oct 15, 2024 via email

hungweichen0327 commented Nov 14, 2024 •

edited

Loading

mschatz commented Dec 6, 2024 via email

hungweichen0327 commented Dec 16, 2024

mschatz commented Dec 16, 2024 via email

hungweichen0327 commented Dec 17, 2024

mschatz commented Dec 17, 2024 via email

KamilSJaron commented Dec 17, 2024 •

edited

Loading

mschatz commented Dec 17, 2024 via email

hungweichen0327 commented Dec 17, 2024

KamilSJaron commented Dec 18, 2024

KamilSJaron commented Dec 18, 2024

mschatz commented Dec 19, 2024 via email

hungweichen0327 commented Dec 20, 2024

Suggested coverage for the tetraploid genome #145

Suggested coverage for the tetraploid genome #145

Comments

hungweichen0327 commented Oct 15, 2024

mschatz commented Oct 15, 2024 via email

hungweichen0327 commented Oct 15, 2024

mschatz commented Oct 15, 2024 via email

hungweichen0327 commented Nov 14, 2024 • edited Loading

mschatz commented Dec 6, 2024 via email

hungweichen0327 commented Dec 16, 2024

mschatz commented Dec 16, 2024 via email

hungweichen0327 commented Dec 17, 2024

mschatz commented Dec 17, 2024 via email

KamilSJaron commented Dec 17, 2024 • edited Loading

mschatz commented Dec 17, 2024 via email

hungweichen0327 commented Dec 17, 2024

KamilSJaron commented Dec 18, 2024

KamilSJaron commented Dec 18, 2024

mschatz commented Dec 19, 2024 via email

hungweichen0327 commented Dec 20, 2024

hungweichen0327 commented Nov 14, 2024 •

edited

Loading

KamilSJaron commented Dec 17, 2024 •

edited

Loading