Question about DeepVariant evaluation on HG003 #925

rhpvorderman · 2025-01-13T07:56:24Z

I have a question about the DeepVariant evaluation on HG003 as is used here.

HG003 is left out of the training dataset which is good. However, if I understand correctly. HG002 is in the training dataset. Since HG002 is the son of HG003, they have a lot of shared variants. Does this not introduce the risk of overfitting? Or have I misunderstood something?

pichuan · 2025-01-14T00:19:48Z

Thanks for raising this question! You're correct that HG002 and HG003 are related.
While HG002 and HG003 share more variant positions, their genotypes at those positions can differ (e.g., 0/1 in HG002 vs. 1/1 in HG003). This actually makes it harder for the model to simply memorize the training data, as it needs to learn to differentiate between these differences even at shared sites.

And, DeepVariant training, we hold out chromosome 20 from the training data.
You can evaluate on chr20 to get an unbiased evaluation, if you're worried about overfitting.

rhpvorderman · 2025-01-14T11:04:27Z

Hi, thanks for this detailed answer! Great to know that there are proper ways to have an unbiased comparison.

rhpvorderman closed this as completed Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about DeepVariant evaluation on HG003 #925

Question about DeepVariant evaluation on HG003 #925

rhpvorderman commented Jan 13, 2025

pichuan commented Jan 14, 2025

rhpvorderman commented Jan 14, 2025

Question about DeepVariant evaluation on HG003 #925

Question about DeepVariant evaluation on HG003 #925

Comments

rhpvorderman commented Jan 13, 2025

pichuan commented Jan 14, 2025

rhpvorderman commented Jan 14, 2025