Skip to content

Commit

Permalink
Merge pull request #274 from r-causal/overfitting
Browse files Browse the repository at this point in the history
Discuss overfitting from causal perspective
  • Loading branch information
malcolmbarrett authored Oct 12, 2024
2 parents b9d19f9 + 5ae73a2 commit 8d1c6c8
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 6 deletions.
44 changes: 38 additions & 6 deletions chapters/08-building-ps-models.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -217,13 +217,11 @@ Here are some questions to ask to gain diagnostic insights we gain from @fig-mir
<!-- TODO: This section needs to be clarified. -->

1. Look for lack of overlap as a potential positivity problem.
But too much overlap may indicate a poor model
<!-- (TODO: not necessarily. depends on the relative covariate distributions in the two groups). -->
But too much overlap may indicate a poor model <!-- (TODO: not necessarily. depends on the relative covariate distributions in the two groups). -->

2. Avg treatment effect among treated is easier to estimate with precision (because of higher counts) than in the control group.

3. A single outlier in either group concerning range could be a problem and warrant data inspection
<!-- (TODO: do this here?) *look at the model coefs here*. -->
3. A single outlier in either group concerning range could be a problem and warrant data inspection <!-- (TODO: do this here?) *look at the model coefs here*. -->

<!-- *TODO* This may be a "nice" example -- should we also show a bad model (maybe only with ticket season or possibly overspecifying it) -->

Expand All @@ -242,7 +240,10 @@ Conversely, including variables that are predictors of the *exposure but not the
Luckily, this bias seems relatively negligible in practice, especially compared to the risk of confounding bias [@Myers2011].

::: callout-note
Some estimates, such as the odds and hazard ratios, have a property called *non-collapsibility*. This means that marginal odds and hazard ratios are not weighted averages of their conditional versions. In other words, the results might differ depending on the variable added or removed, even when the variable is not a confounder. We'll explore this more in @sec-non-collapse.
Some estimates, such as the odds and hazard ratios, have a property called *non-collapsibility*.
This means that marginal odds and hazard ratios are not weighted averages of their conditional versions.
In other words, the results might differ depending on the variable added or removed, even when the variable is not a confounder.
We'll explore this more in @sec-non-collapse.
:::

Another variable to be wary of is a *collider*, a descendant of both the exposure and outcome.
Expand Down Expand Up @@ -301,7 +302,38 @@ Then, we model `y ~ x + z` and see how much the coefficient on `x` has changed.
A common rule is to add a variable if it changes the coefficient of`x` by 10%.

Unfortunately, this technique is unreliable.
As we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias.
As we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias.
Additionally, the non-collapsibility of the odds and hazards ratios mean they may change with the addition or subtraction of a variable without representing an improvement or worsening in bias.
In other words, there are many different types of variables besides confounders that can cause a change in the coefficient of the exposure.
As discussed above, confounding bias is often the most crucial factor, but systematically searching your variables for anything that changes the exposure coefficient can compound many types of bias.

::: callout-note
## Can you overfit a causal model?

In predictive modeling, data scientists often have to prevent overfitting their models to chance patterns in the data.
When a model captures those chance patterns, it doesn't predict as well on other data sets.
So, can you overfit a causal model?

The short answer is yes, although it's easier to do it with machine learning techniques than with logistic regression and friends.
An overfit model is, essentially, a misspecified model [@Gelman_2017].
A misspecified model will lead to residual confounding and, thus, a biased causal effect.
Overfitting can also exacerbate stochastic positivity violations [@zivich2022positivity].
The correct causal model (the functional form that matches the data-generating mechanism) cannot be overfit.
The same is true for the correct predictive model.

There's some nuance to this answer, though.
Overfitting in causal inference and prediction is different; we're not applying the causal estimate to another dataset (the closest to that is transportability and generalizability, an issue we'll discuss in [Chapter -@sec-evidence]).
It remains true that a causal model doesn't need to predict particularly well to be unbiased.

In prediction modeling, people often use a bias-variance trade-off to improve out-of-data predictions.
In short, some bias for the sample is introduced to improve the variance of model fits and make better predictions out of the sample.
However, we must be careful: the word bias here refers to the discrepancy between the model estimates and the true value of the dependent variable *in the dataset*.
Let's call this statistical bias.
It is not necessarily the same as the difference between the model estimate and the true causal effect *in the population*.
Let's call this causal bias.
If we apply the bias-variance trade-off to causal models, we introduce statistical bias in an attempt to reduce causal bias.
Another subtlety is that overfitting can inflate the standard error of the estimate in the sample, which is not the same variance as the bias-variance trade-off [@schuster2016].
From a frequentist standpoint, the confidence intervals will also not have nominal coverage (see @sec-appendix-bootstrap) because of the causal bias in the estimate.

In practice, cross-validation, a technique to reduce overfitting, is often used with causal models that use machine learning, as we'll discuss in [Chapter -@sec-causal-ml].
:::
32 changes: 32 additions & 0 deletions citations.bib
Original file line number Diff line number Diff line change
Expand Up @@ -942,3 +942,35 @@ @article{Whitcomb2021
url = {http://dx.doi.org/10.1093/aje/kwaa267},
langid = {en}
}

@misc{Gelman_2017,
title={What is “overfitting,” exactly?},
url={https://statmodeling.stat.columbia.edu/2017/07/15/what-is-overfitting-exactly/},
journal={Statistical Modeling, Causal Inference, and Social Science}, author={Gelman, Andrew},
year={2017},
month={Jul}
}
@misc{zivich2022positivity,
title={Positivity: Identifiability and Estimability},
author={Paul N Zivich and Stephen R Cole and Daniel Westreich},
year={2022},
eprint={2207.05010},
archivePrefix={arXiv},
primaryClass={stat.ME},
url={https://arxiv.org/abs/2207.05010},
}

@article{schuster2016,
title = {Propensity score model overfitting led to inflated variance of estimated odds ratios},
author = {Schuster, Tibor and Lowe, Wilfrid Kouokam and Platt, Robert W.},
year = {2016},
month = {12},
date = {2016-12},
journal = {Journal of Clinical Epidemiology},
pages = {97--106},
volume = {80},
doi = {10.1016/j.jclinepi.2016.05.017},
url = {http://dx.doi.org/10.1016/j.jclinepi.2016.05.017},
langid = {en}
}

0 comments on commit 8d1c6c8

Please sign in to comment.