From 75626f54cbdd96b235f4bc172b735cbfce225eb1 Mon Sep 17 00:00:00 2001 From: Malcolm Barrett Date: Mon, 15 Jan 2024 16:27:55 -0500 Subject: [PATCH] start on pred section --- .../execute-results/html.json | 4 +- chapters/06-not-just-a-stats-problem.qmd | 40 ++++++++++++++++++- 2 files changed, 41 insertions(+), 3 deletions(-) diff --git a/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json b/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json index e11e954..357a828 100644 --- a/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json +++ b/_freeze/chapters/06-not-just-a-stats-problem/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "a27be79395f8da11754f15a0f2d7ce72", + "hash": "8d78e5b223e925d85dda679f1aa6a618", "result": { - "markdown": "# Causal inference is not (just) a statistical problem {#sec-quartets}\n\n\n\n\n\n## The Causal Quartet\n\nWe now have the tools to look at a detail of causal inference alluded to thus far in the book: causal inference is not (just) a statistical problem.\nOf course, we use statistics to answer causal questions.\nIt's necessary to answer most questions, even if the statistics are basic (as they often are in randomized designs).\nBut statistics alone do not allow us to address all of the assumptions of causal inference.\n\nIn 1973, Francis Anscombe introduced a set of four datasets known as *Anscombe's Quartet*.\nThese data illustrated an important lesson: summary statistics alone cannot help you understand data; you also need to visualize your data.\nIn the plots in @fig-anscombe, each data set has remarkably similar summary statistics, including means and correlations that are nearly identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(quartets)\n\nanscombe_quartet |> \n ggplot(aes(x, y)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![Anscombe's Quartet, a set of four datasets with nearly identical summary statistics. Anscombe's point was that one must visualize the data to understand it.](06-not-just-a-stats-problem_files/figure-html/fig-anscombe-1.png){#fig-anscombe width=672}\n:::\n:::\n\n\nThe Datasaurus Dozen is a modern take on Anscombe's Quartet. The mean, standard deviation, and correlation are nearly identical in each dataset, but the visualizations are very different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasauRus)\n\n# roughly the same correlation in each dataset\ndatasaurus_dozen |> \n group_by(dataset) |> \n summarize(cor = round(cor(x, y), 3))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 13 × 2\n dataset cor\n \n 1 away -0.064\n 2 bullseye -0.069\n 3 circle -0.068\n 4 dino -0.064\n 5 dots -0.06 \n 6 h_lines -0.062\n 7 high_lines -0.069\n 8 slant_down -0.069\n 9 slant_up -0.069\n10 star -0.063\n11 v_lines -0.069\n12 wide_lines -0.067\n13 x_shape -0.066\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndatasaurus_dozen |> \n ggplot(aes(x, y)) + \n geom_point() + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Datasaurus Dozen, a set of datasets with nearly identical summary statistics. The Datasaurus Dozen is a modern version of Anscombe's Quartet. It's actually a baker's dozen, but who's counting?](06-not-just-a-stats-problem_files/figure-html/fig-datasaurus-1.png){#fig-datasaurus width=672}\n:::\n:::\n\n\nIn causal inference, though, even visualization is not enough to untangle causal effects.\nBackground knowledge, as we visualized in DAGs in @sec-dags, is required to meet the assumptions of causal inference.\n\nIn 2023, we introduced the *causal quartet* [@dagostinomcgowan2023].\nThe causal quartet has many of the same properties of Anscombe's quartet and the Datasaurus Dozen: the numerical summaries of the variables in the dataset are basically the same.\nUnlike these data, the causal quartet also *look* the same as each other.\nThe difference is the causal structure that generated each dataset.\n@fig-causal_quartet_hidden shows four datasets where the observational relationship between `exposure` and `outcome` is virtually identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, four data sets with nearly identical summary statistics and visualizations. The causal structure of each dataset is different, and data alone cannot tell us which is which.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_hidden-1.png){#fig-causal_quartet_hidden width=672}\n:::\n:::\n\n\nThe question for each dataset is whether to adjust for a third variable, `covariate`.\nIs `covariate` a confounder?\nA mediator?\nA collider?\nWe can't use data to figure this problem out.\nIn @tbl-quartet_lm, it's not clear which effect is right.\nLikewise, the correlation between `exposure` and `covariate` is no help: it's the same!\n\n\n::: {#tbl-quartet_lm .cell tbl-cap='The causal quartet, with the estimated effect of `exposure` on `outcome` with and without adjustment for `covariate`. The unadjusted estimate is identical for all four datasets, as is the correlation between `exposure` and `covariate`. The adjusted estimate varies. Without background knowledge, it\\'s not clear which is right.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
DatasetNot adjusting for `covariate`Adjusting for `covariate`Correlation of `exposure` and `covariate`
11.000.550.70
21.000.500.70
31.000.000.70
41.000.880.70
\n
\n```\n\n:::\n:::\n\n\n::: {.callout-warning}\n\n## The ten percent rule\n\nThe ten percent rule is a common technique in epidemiology and other fields to determine whether a variable is a confounder. The problem is, it doesn't work. The ten percent rule says that you should include a variable in your model if including it changes the effect estimate by more than ten percent. However, *every* example in the causal quartet causes a change of more than ten percent. As we know, this leads to the wrong answer in some of the datasets. Even the reverse technique, *disincluding* a variable when it's *less* than ten percent, can cause trouble because many small confounding effects can add up to larger bias. \n\n\n::: {#tbl-quartet_ten_percent .cell tbl-cap='The percent change in the coefficient for `exposure` when including `covariate` in the model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n \n \n
DatasetPercent Change
144.6%
249.7%
399.8%
412.5%
\n
\n```\n\n:::\n:::\n\n\n:::\n\nWhile the visual relationship between `covariate` and `exposure` is not identical between datasets, all have the same correlation.\nin @fig-causal_quartet_covariate, dataset 4 seems to have more variance in `covariate`, but that's not actionable information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The correlation is the same in each dataset, but the visual relationship is not. This is not enough information to determine whether `covariate` is a confounder, mediator, or collider, however.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate-1.png){#fig-causal_quartet_covariate width=672}\n:::\n:::\n\n\nRevealing the labels in @fig-causal_quartet, `covariate` plays a different role in each dataset.\nIn 1 and 4, it's a collider (we *shouldn't* adjust for it).\nIn 2, it's a confounder (we *should* adjust for it).\nIn 3, it's a mediator (it depends on the research question).\nIf the data can't tell us the answer, what can we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, revealed. The first and last datasets are types of collider-bias; we should *not* control for `covariate`. In the second dataset, `covariate` is a confounder, and we *should* control for it. In the third dataset, `covariate` is a mediator, and we should control for it if we want the total effect, but not if we want the direct effect.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet-1.png){#fig-causal_quartet width=672}\n:::\n:::\n\n\nThe best answer is to have a good sense of the data generating mechanism.\nIn @fig-quartet-dag, we show the DAG for each dataset. Once we compile a DAG for each dataset, we only need to query the DAG for the correct adjustment set, assuming the DAG is right.\n\n\n::: {.cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![The DAG for dataset 1, where `covariate` is a collider. We should *not* adjust for `covariate`, which is a descendant of `exposure` and `outcome`.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-1.png){#fig-quartet-dag-1 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` is a confounder. `covariate` is a mutual cause of `exposure` and `outcome`, representing a backdoor path, so we *must* adjust for it to get the right answer.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-2.png){#fig-quartet-dag-2 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 3, where `covariate` is a mediator. `covariate` is a descendant of `exposure` and a cause of `outcome`. The path through `covariate` is the indirect path, and the path through `exposure` is the direct path. We should adjust for `covariate` if we want the direct effect, but not if we want the total effect.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-3.png){#fig-quartet-dag-3 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 4, where `covariate` is a collider via M-Bias. Although `covariate` happens before both `outcome` and `exposure`, it's still a collider. We should *not* adjust for `covariate`, particularly since we can't control for the bias via `u1` and `u2`, which are unmeasured.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-4.png){#fig-quartet-dag-4 width=288}\n:::\n:::\n\n\nThe data generating mechanism[^1] in the DAGs matches what was actually used to generate the datasets, so we can use the DAGs to determine the correct effect: `unadjusted` in datasets 1 and 4 and `adjusted` in dataset 2. For dataset 3, it depends on which mediation effect we want: `adjusted` for the direct effect and `unadjusted` for the total effect.\n\n[^1]: See @dagostinomcgowan2023 for the models that generated the datasets.\n\n\n::: {#tbl-quartets_true_effects .cell tbl-cap='The data generating mechanism and true causal effects in each dataset. Sometimes, the unadjusted effect is the same, and sometimes it is not, depending on the mechanism and question.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
Data generating mechanismCorrect causal modelCorrect causal effect
(1) ColliderY ~ X1
(2) ConfounderY ~ X ; Z0.5
(3) MediatorDirect effect: Y ~ X ; Z, Total Effect: Y ~ XDirect effect: 0, Total effect: 1
(4) M-BiasY ~ X1
\n
\n```\n\n:::\n:::\n\n\n## Time as a hueristic for causal structure\n\nHopefully, we have convinced of the usefulness of DAGs. But how do we know what the DAG is? In the causal quartet, we knew the DAGs because we generated the data. In real life, we need to use background knowledge to assemble a candidate causal structure. For some questions, such background knowledge is not available. For others, we may worry about the complexity of the causal structure, particularly when variables mutually evolve with each other, as in @fig-feedback-loop.\n\nThere is one hueristic that is particularly useful when a DAG is incomplete or uncertain: time. Because causality is temporal in nature, a causal variable must precede an effect. Many, but not all, problems in deciding if we should adjust for a confounder are solved by simply putting the variables in order by time. Time ordering is also one of the most critical assumptions you visualize in a DAG, so it's a good place to start regardless of the completeness of the DAG.\n\nConsider @fig-quartets_time_ordered, a time-ordered version of the collider DAG where the covariate is measured at baseline and follow-up. The original DAG actually represents the *second* measurement, where the covariate is a descendant of both the outcome and exposure. If, however, we control for the same covariate as measured at the start of the study, it cannot be a descendant of the outcome at follow-up because it hasn't happened yet. Thus, when you are missing background knowledge as to the causal structure of the covariate, you can use time-ordering as a defensive measure to avoid bias. Only control for variables that precede the outcome.\n\n\n::: {.cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![In a time-ordered version of the collider DAG, controlling for the covariate at follow-up induces bias.](06-not-just-a-stats-problem_files/figure-html/fig-quartets_time_ordered-1.png){#fig-quartets_time_ordered-1 width=384}\n:::\n\n::: {.cell-output-display}\n![On the other hand, controlling for the covariate as measured at baseline does not induce bias because it is not a descendant of the outcome.](06-not-just-a-stats-problem_files/figure-html/fig-quartets_time_ordered-2.png){#fig-quartets_time_ordered-2 width=384}\n:::\n:::\n\n\n::: {.callout-warning}\n## Don't adjust for the future\n\nThe time-ordering heuristic relies on a simple rule: don't adjust for the future. \n:::\n\nThe quartet package's `causal_quartet_time` has time-ordered measurements of each variable for the four datasets. Each has a `*_baseline` and `*_followup` measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet_time\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 400 × 12\n covariate_baseline exposure_baseline\n \n 1 -0.0963 -1.43 \n 2 -1.11 0.0593 \n 3 0.647 0.370 \n 4 0.755 0.00471\n 5 1.19 0.340 \n 6 -0.588 -3.61 \n 7 -1.13 1.44 \n 8 0.689 1.02 \n 9 -1.49 -2.43 \n10 -2.78 -1.26 \n# ℹ 390 more rows\n# ℹ 10 more variables: outcome_baseline ,\n# covariate_followup , exposure_followup ,\n# outcome_followup , exposure_mid ,\n# covariate_mid , outcome_mid , u1 ,\n# u2 , dataset \n```\n\n\n:::\n:::\n\n\nUsing the formula `outcome_followup ~ exposure_baseline + covariate_baseline` works for three out of four datasets. Even though `covariate_baseline` is only in the adjustment set for the second dataset, it's not a collider in the two of the other datasets, so it's not a problem. \n\n\n::: {#tbl-quartet_time_adjusted .cell tbl-cap='The adjusted effect of `exposure_baseline` on `outcome_followup` in each dataset. The effect adjusted for `covariate_baseline` is correct for three out of four datasets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetAdjusted effectTruth
(1) Collider1.001.00
(2) Confounder0.500.50
(3) Mediator1.001.00
(4) M-Bias0.881.00
\n
\n```\n\n:::\n:::\n\n\nWhere it fails is in dataset 4, the M-bias example. In this case, `covariate_baseline` is still a collider, because the collision occurs prior to both the exposure and outcome. As we discussed in @sec-m-bias, however, if you are in doubt whether something is truly M-bias, it is better to adjust for it than not. Confounding bias tends to be worse, and meaningful M-bias is probably rare in real life. As the true causal structure deviates from perfect M-bias, the severity of the bias tends to decrease. So, if it is clearly M-bias, don't adjust for the variable. If it's not clear, adjust for it. \n\n::: {.callout-tip}\nRemember as well that it is possible to block bias induced by adjusting for a collider in certain circumstances because collider bias is just another open path. If we had `u1` and `u2`, we could both control for `covariate` and block any potential collider bias. In other words, sometimes when we open a path, we can close it again.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The DAG for dataset 4, where the covariate is a collider via M-Bias. If we know the structure, we shouldn't control for the covariate, but if we do, we can theoretically close the opened pathway via `u1` and `u2`, assuming we have them measured. In other words, thorough measurement and control of pre-exposure variables will still save the day.](06-not-just-a-stats-problem_files/figure-html/fig-m_bias_adjust-1.png){#fig-m_bias_adjust width=672}\n:::\n:::\n\n:::\n\n## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\nPredictive measurements also fail to distinguish between the four datasets. \n\n\n::: {#tbl-quartet_time_predictive .cell}\n\n:::\n\n\n\n\n\n\n\n\n\n\n\n", + "markdown": "# Causal inference is not (just) a statistical problem {#sec-quartets}\n\n\n\n\n\n## The Causal Quartet\n\nWe now have the tools to look at a detail of causal inference alluded to thus far in the book: causal inference is not (just) a statistical problem.\nOf course, we use statistics to answer causal questions.\nIt's necessary to answer most questions, even if the statistics are basic (as they often are in randomized designs).\nBut statistics alone do not allow us to address all of the assumptions of causal inference.\n\nIn 1973, Francis Anscombe introduced a set of four datasets known as *Anscombe's Quartet*.\nThese data illustrated an important lesson: summary statistics alone cannot help you understand data; you also need to visualize your data.\nIn the plots in @fig-anscombe, each data set has remarkably similar summary statistics, including means and correlations that are nearly identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(quartets)\n\nanscombe_quartet |> \n ggplot(aes(x, y)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![Anscombe's Quartet, a set of four datasets with nearly identical summary statistics. Anscombe's point was that one must visualize the data to understand it.](06-not-just-a-stats-problem_files/figure-html/fig-anscombe-1.png){#fig-anscombe width=672}\n:::\n:::\n\n\nThe Datasaurus Dozen is a modern take on Anscombe's Quartet. The mean, standard deviation, and correlation are nearly identical in each dataset, but the visualizations are very different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(datasauRus)\n\n# roughly the same correlation in each dataset\ndatasaurus_dozen |> \n group_by(dataset) |> \n summarize(cor = round(cor(x, y), 3))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 13 × 2\n dataset cor\n \n 1 away -0.064\n 2 bullseye -0.069\n 3 circle -0.068\n 4 dino -0.064\n 5 dots -0.06 \n 6 h_lines -0.062\n 7 high_lines -0.069\n 8 slant_down -0.069\n 9 slant_up -0.069\n10 star -0.063\n11 v_lines -0.069\n12 wide_lines -0.067\n13 x_shape -0.066\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndatasaurus_dozen |> \n ggplot(aes(x, y)) + \n geom_point() + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Datasaurus Dozen, a set of datasets with nearly identical summary statistics. The Datasaurus Dozen is a modern version of Anscombe's Quartet. It's actually a baker's dozen, but who's counting?](06-not-just-a-stats-problem_files/figure-html/fig-datasaurus-1.png){#fig-datasaurus width=672}\n:::\n:::\n\n\nIn causal inference, though, even visualization is not enough to untangle causal effects.\nBackground knowledge, as we visualized in DAGs in @sec-dags, is required to meet the assumptions of causal inference.\n\nIn 2023, we introduced the *causal quartet* [@dagostinomcgowan2023].\nThe causal quartet has many of the same properties of Anscombe's quartet and the Datasaurus Dozen: the numerical summaries of the variables in the dataset are basically the same.\nUnlike these data, the causal quartet also *look* the same as each other.\nThe difference is the causal structure that generated each dataset.\n@fig-causal_quartet_hidden shows four datasets where the observational relationship between `exposure` and `outcome` is virtually identical.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, four data sets with nearly identical summary statistics and visualizations. The causal structure of each dataset is different, and data alone cannot tell us which is which.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_hidden-1.png){#fig-causal_quartet_hidden width=672}\n:::\n:::\n\n\nThe question for each dataset is whether to adjust for a third variable, `covariate`.\nIs `covariate` a confounder?\nA mediator?\nA collider?\nWe can't use data to figure this problem out.\nIn @tbl-quartet_lm, it's not clear which effect is right.\nLikewise, the correlation between `exposure` and `covariate` is no help: it's the same!\n\n\n::: {#tbl-quartet_lm .cell tbl-cap='The causal quartet, with the estimated effect of `exposure` on `outcome` with and without adjustment for `covariate`. The unadjusted estimate is identical for all four datasets, as is the correlation between `exposure` and `covariate`. The adjusted estimate varies. Without background knowledge, it\\'s not clear which is right.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n
DatasetNot adjusting for `covariate`Adjusting for `covariate`Correlation of `exposure` and `covariate`
11.000.550.70
21.000.500.70
31.000.000.70
41.000.880.70
\n
\n```\n\n:::\n:::\n\n\n::: {.callout-warning}\n\n## The ten percent rule\n\nThe ten percent rule is a common technique in epidemiology and other fields to determine whether a variable is a confounder. The problem is, it doesn't work. The ten percent rule says that you should include a variable in your model if including it changes the effect estimate by more than ten percent. However, *every* example in the causal quartet causes a change of more than ten percent. As we know, this leads to the wrong answer in some of the datasets. Even the reverse technique, *disincluding* a variable when it's *less* than ten percent, can cause trouble because many small confounding effects can add up to larger bias. \n\n\n::: {#tbl-quartet_ten_percent .cell tbl-cap='The percent change in the coefficient for `exposure` when including `covariate` in the model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n \n \n
DatasetPercent Change
144.6%
249.7%
399.8%
412.5%
\n
\n```\n\n:::\n:::\n\n\n:::\n\nWhile the visual relationship between `covariate` and `exposure` is not identical between datasets, all have the same correlation.\nin @fig-causal_quartet_covariate, dataset 4 seems to have more variance in `covariate`, but that's not actionable information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n # hide the dataset names\n mutate(dataset = as.integer(factor(dataset))) |> \n ggplot(aes(covariate, exposure)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The correlation is the same in each dataset, but the visual relationship is not. This is not enough information to determine whether `covariate` is a confounder, mediator, or collider, however.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet_covariate-1.png){#fig-causal_quartet_covariate width=672}\n:::\n:::\n\n\nRevealing the labels in @fig-causal_quartet, `covariate` plays a different role in each dataset.\nIn 1 and 4, it's a collider (we *shouldn't* adjust for it).\nIn 2, it's a confounder (we *should* adjust for it).\nIn 3, it's a mediator (it depends on the research question).\nIf the data can't tell us the answer, what can we do?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet |> \n ggplot(aes(exposure, outcome)) + \n geom_point() + \n geom_smooth(method = \"lm\", se = FALSE) + \n facet_wrap(~ dataset)\n```\n\n::: {.cell-output-display}\n![The Causal Quartet, revealed. The first and last datasets are types of collider-bias; we should *not* control for `covariate`. In the second dataset, `covariate` is a confounder, and we *should* control for it. In the third dataset, `covariate` is a mediator, and we should control for it if we want the total effect, but not if we want the direct effect.](06-not-just-a-stats-problem_files/figure-html/fig-causal_quartet-1.png){#fig-causal_quartet width=672}\n:::\n:::\n\n\nThe best answer is to have a good sense of the data generating mechanism.\nIn @fig-quartet-dag, we show the DAG for each dataset. Once we compile a DAG for each dataset, we only need to query the DAG for the correct adjustment set, assuming the DAG is right.\n\n\n::: {.cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![The DAG for dataset 1, where `covariate` is a collider. We should *not* adjust for `covariate`, which is a descendant of `exposure` and `outcome`.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-1.png){#fig-quartet-dag-1 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 2, where `covariate` is a confounder. `covariate` is a mutual cause of `exposure` and `outcome`, representing a backdoor path, so we *must* adjust for it to get the right answer.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-2.png){#fig-quartet-dag-2 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 3, where `covariate` is a mediator. `covariate` is a descendant of `exposure` and a cause of `outcome`. The path through `covariate` is the indirect path, and the path through `exposure` is the direct path. We should adjust for `covariate` if we want the direct effect, but not if we want the total effect.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-3.png){#fig-quartet-dag-3 width=288}\n:::\n\n::: {.cell-output-display}\n![The DAG for dataset 4, where `covariate` is a collider via M-Bias. Although `covariate` happens before both `outcome` and `exposure`, it's still a collider. We should *not* adjust for `covariate`, particularly since we can't control for the bias via `u1` and `u2`, which are unmeasured.](06-not-just-a-stats-problem_files/figure-html/fig-quartet-dag-4.png){#fig-quartet-dag-4 width=288}\n:::\n:::\n\n\nThe data generating mechanism[^1] in the DAGs matches what was actually used to generate the datasets, so we can use the DAGs to determine the correct effect: `unadjusted` in datasets 1 and 4 and `adjusted` in dataset 2. For dataset 3, it depends on which mediation effect we want: `adjusted` for the direct effect and `unadjusted` for the total effect.\n\n[^1]: See @dagostinomcgowan2023 for the models that generated the datasets.\n\n\n::: {#tbl-quartets_true_effects .cell tbl-cap='The data generating mechanism and true causal effects in each dataset. Sometimes, the unadjusted effect is the same, and sometimes it is not, depending on the mechanism and question.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
Data generating mechanismCorrect causal modelCorrect causal effect
(1) ColliderY ~ X1
(2) ConfounderY ~ X ; Z0.5
(3) MediatorDirect effect: Y ~ X ; Z, Total Effect: Y ~ XDirect effect: 0, Total effect: 1
(4) M-BiasY ~ X1
\n
\n```\n\n:::\n:::\n\n\n## Time as a hueristic for causal structure\n\nHopefully, we have convinced of the usefulness of DAGs. But how do we know what the DAG is? In the causal quartet, we knew the DAGs because we generated the data. In real life, we need to use background knowledge to assemble a candidate causal structure. For some questions, such background knowledge is not available. For others, we may worry about the complexity of the causal structure, particularly when variables mutually evolve with each other, as in @fig-feedback-loop.\n\nThere is one hueristic that is particularly useful when a DAG is incomplete or uncertain: time. Because causality is temporal in nature, a causal variable must precede an effect. Many, but not all, problems in deciding if we should adjust for a confounder are solved by simply putting the variables in order by time. Time ordering is also one of the most critical assumptions you visualize in a DAG, so it's a good place to start regardless of the completeness of the DAG.\n\nConsider @fig-quartets_time_ordered, a time-ordered version of the collider DAG where the covariate is measured at baseline and follow-up. The original DAG actually represents the *second* measurement, where the covariate is a descendant of both the outcome and exposure. If, however, we control for the same covariate as measured at the start of the study, it cannot be a descendant of the outcome at follow-up because it hasn't happened yet. Thus, when you are missing background knowledge as to the causal structure of the covariate, you can use time-ordering as a defensive measure to avoid bias. Only control for variables that precede the outcome.\n\n\n::: {.cell layout-ncol=\"2\"}\n::: {.cell-output-display}\n![In a time-ordered version of the collider DAG, controlling for the covariate at follow-up induces bias.](06-not-just-a-stats-problem_files/figure-html/fig-quartets_time_ordered-1.png){#fig-quartets_time_ordered-1 width=384}\n:::\n\n::: {.cell-output-display}\n![On the other hand, controlling for the covariate as measured at baseline does not induce bias because it is not a descendant of the outcome.](06-not-just-a-stats-problem_files/figure-html/fig-quartets_time_ordered-2.png){#fig-quartets_time_ordered-2 width=384}\n:::\n:::\n\n\n::: {.callout-warning}\n## Don't adjust for the future\n\nThe time-ordering heuristic relies on a simple rule: don't adjust for the future. \n:::\n\nThe quartet package's `causal_quartet_time` has time-ordered measurements of each variable for the four datasets. Each has a `*_baseline` and `*_followup` measurement.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncausal_quartet_time\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 400 × 12\n covariate_baseline exposure_baseline\n \n 1 -0.0963 -1.43 \n 2 -1.11 0.0593 \n 3 0.647 0.370 \n 4 0.755 0.00471\n 5 1.19 0.340 \n 6 -0.588 -3.61 \n 7 -1.13 1.44 \n 8 0.689 1.02 \n 9 -1.49 -2.43 \n10 -2.78 -1.26 \n# ℹ 390 more rows\n# ℹ 10 more variables: outcome_baseline ,\n# covariate_followup , exposure_followup ,\n# outcome_followup , exposure_mid ,\n# covariate_mid , outcome_mid , u1 ,\n# u2 , dataset \n```\n\n\n:::\n:::\n\n\nUsing the formula `outcome_followup ~ exposure_baseline + covariate_baseline` works for three out of four datasets. Even though `covariate_baseline` is only in the adjustment set for the second dataset, it's not a collider in the two of the other datasets, so it's not a problem. \n\n\n::: {#tbl-quartet_time_adjusted .cell tbl-cap='The adjusted effect of `exposure_baseline` on `outcome_followup` in each dataset. The effect adjusted for `covariate_baseline` is correct for three out of four datasets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetAdjusted effectTruth
(1) Collider1.001.00
(2) Confounder0.500.50
(3) Mediator1.001.00
(4) M-Bias0.881.00
\n
\n```\n\n:::\n:::\n\n\nWhere it fails is in dataset 4, the M-bias example. In this case, `covariate_baseline` is still a collider, because the collision occurs prior to both the exposure and outcome. As we discussed in @sec-m-bias, however, if you are in doubt whether something is truly M-bias, it is better to adjust for it than not. Confounding bias tends to be worse, and meaningful M-bias is probably rare in real life. As the true causal structure deviates from perfect M-bias, the severity of the bias tends to decrease. So, if it is clearly M-bias, don't adjust for the variable. If it's not clear, adjust for it. \n\n::: {.callout-tip}\nRemember as well that it is possible to block bias induced by adjusting for a collider in certain circumstances because collider bias is just another open path. If we had `u1` and `u2`, we could both control for `covariate` and block any potential collider bias. In other words, sometimes when we open a path, we can close it again.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The DAG for dataset 4, where the covariate is a collider via M-Bias. If we know the structure, we shouldn't control for the covariate, but if we do, we can theoretically close the opened pathway via `u1` and `u2`, assuming we have them measured. In other words, thorough measurement and control of pre-exposure variables will still save the day.](06-not-just-a-stats-problem_files/figure-html/fig-m_bias_adjust-1.png){#fig-m_bias_adjust width=672}\n:::\n:::\n\n:::\n\n## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\nPredictive measurements also fail to distinguish between the four datasets. In @tbl-quartet_time_predictive, we show the difference in a couple of common predictive metrics when we add `covariate` to the model. In each dataset, `covariate` adds information to the model because it contains associational information about the outcome. The RMSE goes down, indicating a better fit, and the R^2^ goes up, indicating more variance explained. The coefficients for `covariate` represent the information about `outcome` it contains, not from where that information comes. In the case of the collider data set, it's not even a useful prediction tool, because you wouldn't have `covariate` at the time of prediction, given that it happens after the exposure and outcome.\n\n\n::: {#tbl-quartet_time_predictive .cell tbl-cap='The difference in predictive metrics on `outcome` in each dataset with and without `covariate`. In each dataset, `covariate` adds information to the model, but this offers little guidances as to the proper causal model.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n\n\n \n\n\n \n\n\n \n\n\n \n \n \n
DatasetRMSER2
(1) Collider−0.140.12
(2) Confounder−0.200.14
(3) Mediator−0.480.37
(4) M-Bias−0.010.01
\n
\n```\n\n:::\n:::\n\n\nRelatedly, coefficients besides those for causal effects of interest are difficult to interpret. \n\n\n\n\n\nx\n\n\n\n\n", "supporting": [ "06-not-just-a-stats-problem_files" ], diff --git a/chapters/06-not-just-a-stats-problem.qmd b/chapters/06-not-just-a-stats-problem.qmd index 4db8696..02c2511 100644 --- a/chapters/06-not-just-a-stats-problem.qmd +++ b/chapters/06-not-just-a-stats-problem.qmd @@ -448,13 +448,51 @@ d_mbias |> ## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit} -Predictive measurements also fail to distinguish between the four datasets. +Predictive measurements also fail to distinguish between the four datasets. In @tbl-quartet_time_predictive, we show the difference in a couple of common predictive metrics when we add `covariate` to the model. In each dataset, `covariate` adds information to the model because it contains associational information about the outcome. The RMSE goes down, indicating a better fit, and the R^2^ goes up, indicating more variance explained. The coefficients for `covariate` represent the information about `outcome` it contains, not from where that information comes. In the case of the collider data set, it's not even a useful prediction tool, because you wouldn't have `covariate` at the time of prediction, given that it happens after the exposure and outcome. ```{r} #| label: tbl-quartet_time_predictive +#| echo: false +#| tbl-cap: "The difference in predictive metrics on `outcome` in each dataset with and without `covariate`. In each dataset, `covariate` adds information to the model, but this offers little guidances as to the proper causal model." + +get_rmse <- function(data, model) { + sqrt(mean((data$outcome - predict(model, data)) ^ 2)) +} +get_r_squared <- function(model) { + summary(model)$r.squared +} + +causal_quartet |> + nest_by(dataset) |> + mutate( + rmse1 = get_rmse( + data, + lm(outcome ~ exposure, data = data) + ), + rmse2 = + get_rmse( + data, + lm(outcome ~ exposure + covariate, data = data) + ), + rmse_diff = rmse2 - rmse1, + r_squared1 = get_r_squared(lm(outcome ~ exposure, data = data)), + r_squared2 = get_r_squared(lm(outcome ~ exposure + covariate, data = data)), + r_squared_diff = r_squared2 - r_squared1 + ) |> + select(dataset, rmse = rmse_diff, r_squared = r_squared_diff) |> + ungroup() |> + gt() |> + fmt_number() |> + cols_label( + dataset = "Dataset", + rmse = "RMSE", + r_squared = md("R^2^") + ) ``` +Relatedly, coefficients besides those for causal effects of interest are difficult to interpret. +