Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SVDB merge for merging samples to case #424

Closed
jemten opened this issue Oct 11, 2024 · 5 comments · Fixed by #428
Closed

Use SVDB merge for merging samples to case #424

jemten opened this issue Oct 11, 2024 · 5 comments · Fixed by #428
Assignees
Milestone

Comments

@jemten
Copy link
Collaborator

jemten commented Oct 11, 2024

Hola! Merging of sample SV calls to case should ideally be handled by a tool that can handle the imprecise locations of SV. bcftools merge will only merge exact matches. One option is SVDB merge. Others are Jasmine or SURVIVOR. Maybe a check with @J35P312 could be beneficial.

BCFTOOLS_MERGE ( ch_bcftools_merge_in, ch_fasta, ch_fai, ch_bed )

@fellen31
Copy link
Collaborator

Hm, my question would be: what if you have a sample with a call that have good and exact breakpoints, and then you merge it with 50 other samples and the results becomes less exact?

My idea was that the annotation with SVDB query would is imprecise (and annotate SVs that are the same but not exact matches with the same annotations), but I understand that this would lead to the same SV being reported twice in a "family" / CG case.

@J35P312
Copy link

J35P312 commented Oct 14, 2024

In general The "precisness" of SV varies across the genome, even within high quality data. There are biological reasons complicating the positioning of SV as well, such as microhomology.

BCFtools is nice for the small SV, they behave like INDELS so they can be merged based on the ALT sequence. For large SV you need to take the start, end and SVtype in account. BCFtools does not look at the END tag, so it will treat the SV as a single point. Then you are better of setting the bnd_distance to 1 in SVDB.

But in truth, its probably better to apply some custom approach for the population genomic projects. I would recomend merging the Sniffles2 files directly using Sniffles2 for instance.

"but I understand that this would lead to the same SV being reported twice in a "family" / CG case."

Not only that! Its important to merge the SV to get the correct inheritance patterns.

@fellen31
Copy link
Collaborator

In general The "precisness" of SV varies across the genome, even within high quality data. There are biological reasons complicating the positioning of SV as well, such as microhomology.

BCFtools is nice for the small SV, they behave like INDELS so they can be merged based on the ALT sequence. For large SV you need to take the start, end and SVtype in account. BCFtools does not look at the END tag, so it will treat the SV as a single point. Then you are better of setting the bnd_distance to 1 in SVDB.

But in truth, its probably better to apply some custom approach for the population genomic projects. I would recomend merging the Sniffles2 files directly using Sniffles2 for instance.

Thanks Jesper. If we do want to use SVDB instead and not Sniffles2 for merging calls, do you think the default 0.6 and 10,000 BND distance is good/reasonable for both say creating a small dataset of 100-1000 samples, and a CG case?

We should also merge calls within-sample from HiFiCNV with calls from Severus/Sniffles, same question there :)

Not only that! Its important to merge the SV to get the correct inheritance patterns.

Yes, definitely!

@adameur
Copy link
Collaborator

adameur commented Oct 14, 2024

In my opinion, what should be considered the same SV is a philosophical question and most likely we'll never find a tool that works perfectly. Maybe one thing could be to look at what is being done in big projects around the world, so we're using an approach that facilitates international collaboration? For example, if we're using ColorsDB for filtering maybe it would make sense to use a similar approach as they did.. But I don't know, maybe there are good reasons to choose some other option. In any case I think it's a really interesting and important question. Maybe that graph genomes can improve this at some point but that feels quite far in the future

@fellen31 fellen31 added this to the 0.4 milestone Oct 14, 2024
@fellen31
Copy link
Collaborator

Seems like the most appropriate action is to separate the building and exporting of a VCF for larger population calling/building in-house databases (#372), and exporting a merged case/project VCF (this issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants