Merge pull request #274 from r-causal/overfitting

Discuss overfitting from causal perspective
r-causal · Oct 12, 2024 · 8d1c6c8 · 8d1c6c8
2 parents b9d19f9 + 5ae73a2
commit 8d1c6c8
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 6 deletions.
diff --git a/chapters/08-building-ps-models.qmd b/chapters/08-building-ps-models.qmd
@@ -217,13 +217,11 @@ Here are some questions to ask to gain diagnostic insights we gain from @fig-mir
 <!-- TODO: This section needs to be clarified. -->
 
 1.  Look for lack of overlap as a potential positivity problem.
-    But too much overlap may indicate a poor model 
-    <!-- (TODO: not necessarily. depends on the relative covariate distributions in the two groups). -->
+    But too much overlap may indicate a poor model <!-- (TODO: not necessarily. depends on the relative covariate distributions in the two groups). -->
 
 2.  Avg treatment effect among treated is easier to estimate with precision (because of higher counts) than in the control group.
 
-3.  A single outlier in either group concerning range could be a problem and warrant data inspection
-<!-- (TODO: do this here?) *look at the model coefs here*. -->
+3.  A single outlier in either group concerning range could be a problem and warrant data inspection <!-- (TODO: do this here?) *look at the model coefs here*. -->
 
 <!-- *TODO* This may be a "nice" example -- should we also show a bad model (maybe only with ticket season or possibly overspecifying it) -->
 
@@ -242,7 +240,10 @@ Conversely, including variables that are predictors of the *exposure but not the
 Luckily, this bias seems relatively negligible in practice, especially compared to the risk of confounding bias [@Myers2011].
 
 ::: callout-note
-Some estimates, such as the odds and hazard ratios, have a property called *non-collapsibility*. This means that marginal odds and hazard ratios are not weighted averages of their conditional versions. In other words, the results might differ depending on the variable added or removed, even when the variable is not a confounder. We'll explore this more in @sec-non-collapse.
+Some estimates, such as the odds and hazard ratios, have a property called *non-collapsibility*.
+This means that marginal odds and hazard ratios are not weighted averages of their conditional versions.
+In other words, the results might differ depending on the variable added or removed, even when the variable is not a confounder.
+We'll explore this more in @sec-non-collapse.
 :::
 
 Another variable to be wary of is a *collider*, a descendant of both the exposure and outcome.
@@ -301,7 +302,38 @@ Then, we model `y ~ x + z` and see how much the coefficient on `x` has changed.
 A common rule is to add a variable if it changes the coefficient of`x` by 10%.
 
 Unfortunately, this technique is unreliable.
-As we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias. 
+As we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias.
 Additionally, the non-collapsibility of the odds and hazards ratios mean they may change with the addition or subtraction of a variable without representing an improvement or worsening in bias.
 In other words, there are many different types of variables besides confounders that can cause a change in the coefficient of the exposure.
 As discussed above, confounding bias is often the most crucial factor, but systematically searching your variables for anything that changes the exposure coefficient can compound many types of bias.
+
+::: callout-note
+## Can you overfit a causal model?
+
+In predictive modeling, data scientists often have to prevent overfitting their models to chance patterns in the data.
+When a model captures those chance patterns, it doesn't predict as well on other data sets.
+So, can you overfit a causal model?
+
+The short answer is yes, although it's easier to do it with machine learning techniques than with logistic regression and friends.
+An overfit model is, essentially, a misspecified model [@Gelman_2017].
+A misspecified model will lead to residual confounding and, thus, a biased causal effect.
+Overfitting can also exacerbate stochastic positivity violations [@zivich2022positivity].
+The correct causal model (the functional form that matches the data-generating mechanism) cannot be overfit.
+The same is true for the correct predictive model.
+
+There's some nuance to this answer, though.
+Overfitting in causal inference and prediction is different; we're not applying the causal estimate to another dataset (the closest to that is transportability and generalizability, an issue we'll discuss in [Chapter -@sec-evidence]).
+It remains true that a causal model doesn't need to predict particularly well to be unbiased.
+
+In prediction modeling, people often use a bias-variance trade-off to improve out-of-data predictions.
+In short, some bias for the sample is introduced to improve the variance of model fits and make better predictions out of the sample.
+However, we must be careful: the word bias here refers to the discrepancy between the model estimates and the true value of the dependent variable *in the dataset*.
+Let's call this statistical bias.
+It is not necessarily the same as the difference between the model estimate and the true causal effect *in the population*.
+Let's call this causal bias.
+If we apply the bias-variance trade-off to causal models, we introduce statistical bias in an attempt to reduce causal bias.
+Another subtlety is that overfitting can inflate the standard error of the estimate in the sample, which is not the same variance as the bias-variance trade-off [@schuster2016].
+From a frequentist standpoint, the confidence intervals will also not have nominal coverage (see @sec-appendix-bootstrap) because of the causal bias in the estimate.
+
+In practice, cross-validation, a technique to reduce overfitting, is often used with causal models that use machine learning, as we'll discuss in [Chapter -@sec-causal-ml].
+:::
diff --git a/citations.bib b/citations.bib
@@ -942,3 +942,35 @@ @article{Whitcomb2021
 	url = {http://dx.doi.org/10.1093/aje/kwaa267},
 	langid = {en}
 }
+
+@misc{Gelman_2017, 
+  title={What is “overfitting,” exactly?}, 
+  url={https://statmodeling.stat.columbia.edu/2017/07/15/what-is-overfitting-exactly/}, 
+  journal={Statistical Modeling, Causal Inference, and Social Science}, author={Gelman, Andrew},
+  year={2017}, 
+  month={Jul}
+} 
+
+@misc{zivich2022positivity,
+      title={Positivity: Identifiability and Estimability}, 
+      author={Paul N Zivich and Stephen R Cole and Daniel Westreich},
+      year={2022},
+      eprint={2207.05010},
+      archivePrefix={arXiv},
+      primaryClass={stat.ME},
+      url={https://arxiv.org/abs/2207.05010}, 
+}
+
+@article{schuster2016,
+	title = {Propensity score model overfitting led to inflated variance of estimated odds ratios},
+	author = {Schuster, Tibor and Lowe, Wilfrid Kouokam and Platt, Robert W.},
+	year = {2016},
+	month = {12},
+	date = {2016-12},
+	journal = {Journal of Clinical Epidemiology},
+	pages = {97--106},
+	volume = {80},
+	doi = {10.1016/j.jclinepi.2016.05.017},
+	url = {http://dx.doi.org/10.1016/j.jclinepi.2016.05.017},
+	langid = {en}
+}