Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about DeepVariant evaluation on HG003 #925

Closed
rhpvorderman opened this issue Jan 13, 2025 · 2 comments
Closed

Question about DeepVariant evaluation on HG003 #925

rhpvorderman opened this issue Jan 13, 2025 · 2 comments

Comments

@rhpvorderman
Copy link

I have a question about the DeepVariant evaluation on HG003 as is used here.

HG003 is left out of the training dataset which is good. However, if I understand correctly. HG002 is in the training dataset. Since HG002 is the son of HG003, they have a lot of shared variants. Does this not introduce the risk of overfitting? Or have I misunderstood something?

@pichuan
Copy link
Collaborator

pichuan commented Jan 14, 2025

Thanks for raising this question! You're correct that HG002 and HG003 are related.
While HG002 and HG003 share more variant positions, their genotypes at those positions can differ (e.g., 0/1 in HG002 vs. 1/1 in HG003). This actually makes it harder for the model to simply memorize the training data, as it needs to learn to differentiate between these differences even at shared sites.

And, DeepVariant training, we hold out chromosome 20 from the training data.
You can evaluate on chr20 to get an unbiased evaluation, if you're worried about overfitting.

@rhpvorderman
Copy link
Author

Hi, thanks for this detailed answer! Great to know that there are proper ways to have an unbiased comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants