You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about the DeepVariant evaluation on HG003 as is used here.
HG003 is left out of the training dataset which is good. However, if I understand correctly. HG002 is in the training dataset. Since HG002 is the son of HG003, they have a lot of shared variants. Does this not introduce the risk of overfitting? Or have I misunderstood something?
The text was updated successfully, but these errors were encountered:
Thanks for raising this question! You're correct that HG002 and HG003 are related.
While HG002 and HG003 share more variant positions, their genotypes at those positions can differ (e.g., 0/1 in HG002 vs. 1/1 in HG003). This actually makes it harder for the model to simply memorize the training data, as it needs to learn to differentiate between these differences even at shared sites.
And, DeepVariant training, we hold out chromosome 20 from the training data.
You can evaluate on chr20 to get an unbiased evaluation, if you're worried about overfitting.
I have a question about the DeepVariant evaluation on HG003 as is used here.
HG003 is left out of the training dataset which is good. However, if I understand correctly. HG002 is in the training dataset. Since HG002 is the son of HG003, they have a lot of shared variants. Does this not introduce the risk of overfitting? Or have I misunderstood something?
The text was updated successfully, but these errors were encountered: