-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggested coverage for the tetraploid genome #145
Comments
My general recommendation is to aim for 15x per haplotype, so I
would recommend a total of 60x coverage. But if the genome is particularly
repetitive and/or complex you might need to go even higher. And Id
recommend a combination of HiFi and HiC (and ONT ultralong if possible) to
get the best possible results
Good luck
Mike
…On Mon, Oct 14, 2024 at 9:00 PM Hung-Wei ***@***.***> wrote:
Dear Community,
I would like to ask about the suggested coverage for the tetraploid genome.
(The targeted species is the plant. I know the suggested coverage is
probably also related to heterozygosity and proportion of the repeated
region of the genome.)
Thank you for the help.
—
Reply to this email directly, view it on GitHub
<#145>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34Y6Y4O5EGBJPIUGNO3Z3RSJFAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4DOMZXGUZTSMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Dear Mike, Thank you for the quick reply. You mentioned that |
Correct, id recommend HiFi and HiC for the assembly, but for GenomeScope
you can use any high quality read type (HiFi, Illumina, Element, etc)
Good luck
Mike
…On Tue, Oct 15, 2024 at 5:16 AM Hung-Wei ***@***.***> wrote:
Dear Mike,
Thank you for the quick reply.
You mentioned that Id recommend a combination of HiFi and HiC (and ONT
ultralong if possible) to get the best possible results. I would like to
confirm this recommendation is used for genome assembly, right? it's not
related to the genomescope?
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34YCVWURP5UPGUL62GDZ3TMQLAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTGM2DGMZRGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Dear @mschatz, This is the Genomescope2 result I ran recently. This is the smudgeplot result (K=25). I have ~54 Gb (180X coverage for the expected haploid genome size 300 Mb) Illumina data for the Genomescope2 analysis, but the highest pick is at <40X coverage. Do you suggest I to obtain more Illumina data? (I would say this plant species might be tetraploid.) Any suggestions or comments are appreciated. Thank you! |
This looks good in terms of coverage - the first peak (the haploid
coverage) is at about 40x which should be enough for genome profiling with
GenomeScope. But this will be very challenging to assemble with just
Illumina data - there are a few assemblers designed for this (e.g.
https://pubmed.ncbi.nlm.nih.gov/24755901/) but the contigs tend to be quite
short. A contig N50 of 50kbp is pretty typical for this genome size and
complexity. If at all possible, I would highly recommend generating Hifi
data and/or ONT data
Good luck
Mike
…On Wed, Nov 13, 2024 at 10:53 PM Hung-Wei ***@***.***> wrote:
Dear @mschatz <https://github.com/mschatz>,
This is the Genomescope2 result I ran recently.
linear_plot.png (view on web)
<https://github.com/user-attachments/assets/3c9424c8-88b9-4945-8238-eda3299dcf63>
transformed_linear_plot.png (view on web)
<https://github.com/user-attachments/assets/65051f3a-8bf9-4f71-91f2-77c5af6aa7b2>
This is the smudgeplot result.
image.png (view on web)
<https://github.com/user-attachments/assets/70650e57-9e6f-4d51-88db-817b4c873fe2>
I have ~54 Gb (180X coverage for the expected genome size 300 Mb) Illumina
data for the Genomescope2 analysis, but the highest pick is at <40X
coverage. Do you suggest I to obtain more Illumina data?
Because I found that the Genomescope2 result of tetraploid Meloidogyne
javanica showed 130X coverage at the highest peak (Figure S22) in your
published paper of Genomescope2 (
https://www.nature.com/articles/s41467-020-14998-3)
image.png (view on web)
<https://github.com/user-attachments/assets/23f50922-c168-4025-9f50-aa2aa15b2d4a>
Any suggestions or comments are welcome. Thank you!
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP344VAZJSYWBCRWJF5I32AQNEFAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGM2TQOBSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear @mschatz Mike, Based on my results generated by Genomescope 2 above, is it clear enough to show that it is a tetraploid species? Besides, based on the proportion of aaaa, aaab, aabc, abcd, could I say: Thank you. |
What do you see in your BUSCO results? If tetraploid would expect a large
amount of duplicated genes, especially 4 copy genes. Have you tried
smudgeplots - this is very helpful for ploidy assessment
Good luck!
Mike
…On Mon, Dec 16, 2024 at 10:37 AM Hung-Wei ***@***.***> wrote:
Dear @mschatz <https://github.com/mschatz> Mike,
Thank you for the suggestions. I have used ONT data and Hi-C data to
generate good genome assembly. (scaffold N50 is 30 Mb and the number of the
scaffold is ~133 for the 1.18 Gb genome assembly representing tetraploid)
Based on my results generated by Genomescope 2 above, is it clear enough
to show that it is a tetraploid species?
Besides, based on the proportion of aaaa, aaab, aabc, abcd, could I say:
(1) This tetraploid genome of plant species is allotetraploids since aabb%
> aaab%
(2) The divergence of haploid is high since the first peak is much higher
than the other three peaks
Thank you.
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP345PI2WXL6JZEBJO23T2F3XTXAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVHE2TKOBVGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear @mschatz Mike, This is the summary of the BUSCO results:
I also checked the copy number of these 1614 genes based on the output of BUSCO. The smudgeplot was done (k=25) Thank you. |
The large fraction of duplicate genes is expected, but Im a little
surprised there arent more 4 copy genes. Kamil, do you have any thoughts on
diploid vs tetraploid for this species - especially with the smudgeplot
showing AB as the most common pattern of heterogeneity
Good luck
Mike
…On Tue, Dec 17, 2024 at 3:32 AM Hung-Wei ***@***.***> wrote:
Dear @mschatz <https://github.com/mschatz> Mike,
This is the summary of the BUSCO results:
## lineage: embryophyta_odb10
S:7.62%, 123
D:89.10%, 1438
F:0.31%, 5
I:0.00%, 0
M:2.97%, 48
N:1614
I also checked the copy number of these 1614 genes based on the output of
BUSCO.
image.png (view on web)
<https://github.com/user-attachments/assets/5fbc3a9c-051b-49f3-9dc4-37bb49616d7f>
The smudgeplot was done (k=25)
image.png (view on web)
<https://github.com/user-attachments/assets/10b6bcf7-87bd-4a7c-8bb5-f215f90bcf81>
Thank you.
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34ZVZMOXLJWPH6BUMTD2F7ORZAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXHAYDANZWGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, I would put my money on AA'BB' type of a genome, where both A and B subgenomes are very heterozygous (therefore the dominan 1n peak in the genomescope), and A <-> B is even more diverged, therefore there are less tetraploid k-mers than diploid. So, I would guess allo-. In your assembly, you will see all A and B uncollapsed, and some bits and pieces of A' and B' here and there therefore nearly always >2 BUSCO copies, but not always. Some of this can be becuase of ongoaing rediplodisation and loss of BUSCOs in one or the other subgenome. What kind of plant is it? So, I would not be confident claiming all this just based on the spectra and smudgeplot. You can run mequry (fyi there is a branch of mequry called marqury.fk that you can run directly on FastK k-mer database you already have) to tell how right I was about the assembly (it should give you general idea about how much collapsed it is). I would try to purge duplicates, but be careful to keep all A and B contigs, you want to get rid of the uncollapsed duplicates. THen you can map back all the reads and call variants; you should see a substantial amount of heterogyzosity. All the places that don't have the right amount of coverage (250x) are either unpurged (if 125x) or overpurged (if 500x). With that you should be able to piece together how the genome looks like. HOpe this helps. K |
Thanks Kamil!
Mike
…On Tue, Dec 17, 2024 at 2:46 PM Kamil S. Jaron ***@***.***> wrote:
Hi, I would put my money on AA'BB' type of a genome, where both A and B
subgenomes are very heterozygous (therefore the dominan 1n peak in the
genomescope), and A <-> B is also quite far apart, therefore there are less
tetraploid k-mers than diploid. So, I would guess allo-.
In your assembly, you will see all A and B uncollapsed, and some bits and
pieces of A' and B' here and there.
What kind of plant is it?
So, I would not be confident claiming all this just based on the spectra
and smudgeplot. You can run mequry (fyi there is a branch of mequry called
marqury.fk that you can run directly on FastK k-mer database you already
have) to tell how right I was about the assembly (it should give you
general idea about how much collapsed it is). I would try to purge
duplicates, but be careful to keep all A and B contigs, you want to get rid
of the uncollapsed duplicates. THen you can map back all the reads and call
variants; you should see a substantial amount of heterogyzosity. All the
places that don't have the right amount of coverage (250x) are either
unpurged (if 125x) or overpurged (if 500x). With that you should be able to
piece together how the genome looks like.
HOpe this helps.
K
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34ZGAU7O3YTSINP55G32GB5R7AVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGQ3DGNBWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear @mschatz and @KamilSJaron, Many thanks for your kind feedback. This is a carnivorous plant species, and it has been reported that both diploid and tetraploid populations exist worldwide. (We speculated our targeted population is tetraploid) @KamilSJaron, some description I don't fully understand and would like to ask: Q1:
"keep all A and B contigs" > do you mean keep contigs from 4 subgenomes (AA'BB')? Q2: Q3:
How to know the value of the right amount coverage (250x)? Sorry I am still new to this field. Thank you. |
Q1: No, I meant to have a reference A + B (no A', no B'); so when you map reads, two haplotypes map to every position. That reference is expected to be a bit under 150 Mbp. Q2: Mequry plots allow to plot the k-mer spectra vs the assembly. If you have maternal and paternal datasets, you can also generate phased haplotypes, but that's not necessary. I think the plot you want is called CNplot, but I would double check the manual. https://github.com/thegenemyers/MERQURY.FK Q3: You map reads to your assembly, then you look at the read depth. ALternatively, you can call variants and then look at the coverage supporting them. ALl that requires some elementary bioinformatics skills, but I think you would be better off asking someone locally or look at some of the onliine bioinformatics communities for a help (this really is out of scope of genomescope support) |
@mschatz All good, thanks for tagging me. i am always down for checking a funky spectra/smudgeplot. How do you like the updated version? (besides missing those histograms we still need to put back. The core plot is just so much nicer we decided not to hold it back) |
Looks great!
…On Wed, Dec 18, 2024 at 4:24 AM Kamil S. Jaron ***@***.***> wrote:
@mschatz <https://github.com/mschatz> All good, thanks for tagging me. i
am always down for checking a funky spectra/smudgeplot. How do you like the
updated version? (besides missing those histograms we still need to put
back. The core plot is just so much nicer we decided not to hold it back)
—
Reply to this email directly, view it on GitHub
<#145 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP34Z4FJGNZXEPV5YDCF32GE5LZAVCNFSM6AAAAABP6ANHS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJQHAYDQMRWGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear @KamilSJaron, Thank you for the kind feedback. Q1: I expect the reference to be about 290 Mb for each haplotype based on the genomescope2 results and some literatures. (I also try setting ploidy = 2 in Genomescope2 and the predicted genome size is 583 Mb) Q2: Thank you for the information. I will look into the CNplot. Q3: Understand. What you mentioned is already helpful enough for me. |
Dear Community,
I would like to ask about the suggested coverage for the tetraploid genome.
(The targeted species is the plant. I know the suggested coverage is probably also related to heterozygosity and proportion of the repeated region of the genome.)
Thank you for the help.
The text was updated successfully, but these errors were encountered: