diff --git a/R/ggdag-mask.R b/R/ggdag-mask.R new file mode 100644 index 0000000..b7e7c9e --- /dev/null +++ b/R/ggdag-mask.R @@ -0,0 +1,37 @@ +# this is all a hack to make work with quick plotting +# TODO: when `geom_dag_label_repel2` exists, add to namespace as 1 then delete this first bit +# copied from source to avoid recursion issue in overriding in ggdag namsespace +ggdag_geom_dag_label_repel <- function( + mapping = NULL, data = NULL, parse = FALSE, ..., + box.padding = grid::unit(0.35,"lines"), label.padding = grid::unit(0.25, "lines"), + point.padding = grid::unit(1.5, "lines"), label.r = grid::unit(0.15, "lines"), + label.size = 0.25, segment.color = "grey50", segment.size = 0.5, arrow = NULL, + force = 1, max.iter = 2000, nudge_x = 0, nudge_y = 0, na.rm = FALSE, + show.legend = NA, inherit.aes = TRUE) { + ggplot2::layer(data = data, mapping = mapping, stat = ggdag:::StatNodesRepel, + geom = ggrepel::GeomLabelRepel, position = "identity", + show.legend = show.legend, inherit.aes = inherit.aes, + params = list(parse = parse, box.padding = box.padding, + label.padding = label.padding, point.padding = point.padding, + label.r = label.r, label.size = label.size, segment.colour = segment.color %||% + segment.colour, segment.size = segment.size, + arrow = arrow, na.rm = na.rm, force = force, max.iter = max.iter, + nudge_x = nudge_x, nudge_y = nudge_y, segment.alpha = 1, ...)) +} + +geom_dag_label_repel_internal <- function(..., seed = 10) { + ggdag_geom_dag_label_repel( + mapping = aes(x, y, label = label), + # TODO: make sure this looks ok. slightly different from above + box.padding = 2, + max.overlaps = Inf, + inherit.aes = FALSE, + family = getOption("book.base_family"), + seed = seed, + label.size = NA, + label.padding = 0.1 + ) +} + +# apply to quick functions as well +assignInNamespace("geom_dag_label_repel", geom_dag_label_repel_internal, ns = "ggdag") \ No newline at end of file diff --git a/R/setup.R b/R/setup.R index 30f22cb..04047dc 100644 --- a/R/setup.R +++ b/R/setup.R @@ -31,7 +31,7 @@ theme_dag <- function() { } geom_dag_label_repel <- function(..., seed = 10) { - ggdag::geom_dag_label_repel( + ggdag_geom_dag_label_repel( aes(x, y, label = label), box.padding = 3.5, inherit.aes = FALSE, diff --git a/_freeze/chapters/chapter-05/execute-results/html.json b/_freeze/chapters/chapter-05/execute-results/html.json index f0fa776..f2da6da 100644 --- a/_freeze/chapters/chapter-05/execute-results/html.json +++ b/_freeze/chapters/chapter-05/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "99b120943a993da92395d2e9c6806eba", + "hash": "18212c992d7646da91cff7153d518dfb", "result": { - "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\n> Draw your assumptions before your conclusions --@hernan2021\n\nCausal diagrams are a tool to visualize your assumptions about the causal structure of the questions you're trying to answer. In a randomized experiment, the causal structure is quite simple: while there may be many causes of an outcome, the only cause of the exposure is the randomization process itself (we hope!). In many non-randomized settings, however, the structure of your question can be a complex web of causality. Causal diagrams help communicate what we think this structure looks like. In addition to being open about what we think the causal structure is, causal diagrams have incredible mathematics properties that allow us to identify a way to estimate unbiased causal effects even with observational data.\n\nThe type of causal diagrams we use are also called directed acyclic graphs (DAGs). These graphs are directed because they include arrows going in a specific direction. They're acyclic because they don't go in circles; a variable can't cause itself, for instance. DAGs are used for a wide variety of problems, but we're specificaly concerned with *causal* DAGs. This class of DAGs is also sometimes referred to as Structural Causal Models (SCMs) because they are a model of the causal structure of a question. \n\n::: {.callout-tip}\n## DAGs down under\n\nWe highly recommend asking an Australian friend about DAGs. \n:::\n\nDAGs depict causal relationships between variables. Visually, the way they depict variables is as *edges* and *nodes*. Edges are the arrows going from one variable to another, also sometimes called arcs or just arrows. Nodes are the variables themselves, sometimes called vertices, points, or just variables. in @fig-dag-basic, there are two nodes: `x` and `y` and one edge going from `x` to `y`. Here, we are saying that `x` causes `y`. In some capacity, `y` \"listens\" to `x`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-dag-basic-1.png){#fig-dag-basic width=288}\n:::\n:::\n\n\nIf we're interested in the causal effect of `x` on `y`, we're trying to estimate a numeric representation of that arrow. Usually, though, there are many other variables and arrows in the causal structure of a given question. A series of arrows is called a *path*. There are three types of paths you'll see in DAGs: forks, chains, and colliders (sometimes called inverse forks).\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-dag-path-types-1.png){#fig-dag-path-types width=672}\n:::\n:::\n\n\nForks represent a common cause of two variables. Here, we're saying that `q` causes both `x` and `y`. This is the traditional definition of a confounder. They're called forks because the arrows from `x` to `y` are in different directions. Chains, on the other hand, represent a series of arrows going in the same direction. Here `q` is called a *mediator*: it is along the causal path from `x` to `y`. In this diagram, the only path from `x` to `y` is the one mediated through `q`. Finally, a collider is a path where two arrowheads meet at a variable. Because causality always goes forward in time, this naturally means that the collider variable is caused by two other variables. Here, we're saying that `x` and `y` both cause `q`.\n\n::: {.callout-tip}\n## Are DAGs SEMs?\nIf you're familiar with structural equation models (SEMs), a modeling technique commonly used in psychology and other social science settings, you may notice some similarities between SEMs and DAGs. In fact, DAGs are a form of *non-parametric* SEM. SEMs estimate entire graphs using parametric assumptions. Causal DAGs, on the other hand, don't estimate anything; an arrow going from one variable to another says nothing about the strength or functional form of that relationship, only that we think it exists.\n:::\n\nOne of the major benefits of DAGs is that they help us identify sources of bias and, often, provide clues in how to address them. However, talking about an unbiased effect estimate only makes sense when we have a specific causal question in mind. Since each arrow represents a cause, it's causality all the way down, so no individual arrow is inherently problematic. Here, we're interested in the effect of `x` on `y`. This question defines which paths we're interested in and which we're not. \n\nThese three types of paths have different implications for the statistical relationship between `x` and `y`. If we only look at the correlation between the two variables under these assumptions:\n\n1. In the fork, `x` and `y` will be associated, despite there being no arrow from `x` to `y`.\n2. In the chain, `x` and `y` are related but only through `q`.\n3. In the collider, `x` and `y` will *not* be related.\n\nPaths that transmit association are called *open paths*. Paths that do not transmit association are called *closed paths*. Forks and chains are open while colliders are closed. \n\nSo, should we adjust for `q`? That depends on the nature of the path. Forks are confounding paths. Because `q` causes both `x` and `y`, `x` and `y` will have a spurious association. They both contain information from `q`, their mutual cause, and that mutual causal relationship makes `x` and `y` associated statistically. Adjusting for `q` will *block* the bias from confounding and give us the true relationship between `x` and `y`. \n\n::: {.callout-tip}\n## Adjustment\nWe can use a variety of techniques to account for a variable. We use the term \"adjustment\" to generally refer to any technique that removes the effect of variables we're not interested in.\n:::\n\n@fig-confounder-scatter depicts this effect visually. Here, `x` and `y` are continuous, and by definition of the DAG, they are unrelated. `q`, however, causes both. The unadjusted effect is biased because it includes information about the open path from `x` to `y` via `z`. Within levels of `z`, however, `x` and `y` are unrelated. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-confounder-scatter-1.png){#fig-confounder-scatter width=672}\n:::\n:::\n\n\nFor chains, whether or not we adjusting for mediators depends on the research question. Here, adjusting for `q` would result in a null estimate of the effect of `x` on `y`. Because the only effect of `x` on `y` is via `q`, no other effect remains. The effect of `x` on `y` mediated by `q` is called the *indirect* effect, while the effect of `x` on `y` directly is called the *direct* effect. If we're only interested in the direct effect, controlling for `q` might be what we want. If we want to know about both effects, we shouldn't try to adjust for `q`. We'll learn more estimating different mediation effects in @sec-mediation.\n\n@fig-mediator-scatter shows this effect visually. The unadjusted effect of `x` on `y` represents the total effect. Since the total effect is due entirely to the path mediated by `q`, when we adjust for `q`, no relationship remains. This is the direct effect. Neither of these effects is due to bias but rather each answers a different research question. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-mediator-scatter-1.png){#fig-mediator-scatter width=672}\n:::\n:::\n\n\nColliders are different. In the collider DAG of @fig-dag-path-types, `x` and `y` are *not* associated, but both cause `q`. Adjusting for `q` has the opposite effect than with confounding: it *opens* a biasing pathway. Sometimes, people draw the path opened up by conditioning on a collider connecting `x` and `y`. \n\nVisually, we can see this happen when `x` and `y` are continuous and `q` is binary. In @fig-collider-scatter, when we don't include `q`, we find there is no relationship between `x` and `y`. That's the correct result. However, when we include `q`, we can detect information about both `x` and `y`, and they appear correlated: across levels of `x`, those with `q = 0` have lower levels of `y`. Association seemingly flows back in time. Of course, that can't happen from a causal perspective, so controlling for `q` is the wrong thing to do. We end up with a biased effect of `x` on `y`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-collider-scatter-1.png){#fig-collider-scatter width=672}\n:::\n:::\n\n\nHow can this be? Since `x` and `y` happen before `q`, `q` can't have an impact on them. Let's turn the DAG on its side and consider @fig-collider-time. If we break down the two time points, at time point 1, `q` hasn't happened yet, and `x` and `y` are unrelated. At time point 2, `q` happens due to `x` and `y`. But causality only goes forward in time. `q` happening later can't change the fact that `x` and `y` happened independently at time point 1.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-collider-time-1.png){#fig-collider-time width=672}\n:::\n:::\n\n\nCausality only goes forward. Association, however, is time-agnostic. It's just an observation about the numerical relationships between variables. When we control for the future, we run the risk of introducing bias. It's challenging to develop an intuition for this. Consider a case where `x` and `y` are the only causes of `q`, and all three variables are binary. When *either* `x` or `y` equals 1, then `m` happens. If we know `q = 1` and `x = 0` then logically it must be that `y = 1`. Thus, knowing about `q` gives us information about `y` via `x`. This is an extreme example, but it shows how this type of bias, sometimes called *collider-stratification bias* or *selection bias*, occurs: conditioning on `q` provides statistical information about `x` and `y` and distorts their relationship.\n\n::: {.callout-tip}\n## Exchangability revisited\nWe commonly refer exchangability as the assumption of no confounding. Actually, this isn't quite right. It's the assumption of no *open, non-causal* paths. Many times, these are confounding pathways. However, paths can also be opened by conditioning on a collider. Even though these aren't confounders, it creates non-exchangability between the two groups: they are different in a way that matters to the exposure and outcome.\n:::\n\nCorrectly identifying the causal structure between the exposure and outcome thus helps us 1) communicate the assumptions we're making about the relationships between variables and 2) identify sources of bias. Importantly, in doing 2), we are also often able to identify ways to prevent bias based on the assumptions in 1). In the simple case of three DAGs in @fig-dag-path-types, we know whether or not to control for `q` depending on the nature of the causal structure. The set or sets of variables we need to adjust for is called the *adjustment set*. DAGs can help us identify adjustment sets even in complex settings.\n\n::: {.callout-tip}\n## What about interaction?\n\nDAGs don't make a statement about interaction or effect estimate modification even though they are an important part of inference. Technically, interaction is a matter of the functional form of the relationships in the DAG. Much as we don't need to specify how we're going to model a variable in the DAG (e.g., with splines), we don't need to specify how variables statistically interact with one another. That's a matter for the modeling stage.\n\nThere are several ways we use interactions in causal inference. In one extreme, they are simply a matter of functional form: interaction terms are included in models but marginalized over to get an overall causal effect. In the other extreme, we're interested in *joint causal effects*, where the two variables interacting with each other are both causal. In between, we can use interaction terms to identify *heterogenous causal effects*, effects which vary by a second variable that is not assumed to be causal. As with many tools in causal inference, we use the same statistical technique many ways to answer different questions. \n\n\n\nMany people have tried ways of expressing interaction in DAGs using different types of arcs, nodes, and other annotations, but no approach has taken off as the preferred way. \n:::\n\nLet's take a look at an example in R. We'll learn how to build DAGs, visualize them, and identify important information like adjustment sets.\n\n## DAGs in R\n\nFirst, let's consider the a research question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores? We can diagram this using the method describe in @sec-diag (@fig-diagram-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Does listening to a comedy podcast the morning before an exam improve graduate student test scores?](../images/podcast-diagram.png){#fig-diagram-podcast width=2267}\n:::\n:::\n\n\nThe main tool we'll use for making DAGs is ggdag. ggdag is a package that connects ggplot2, the most powerful visualization tool in R, to dagitty, an R package with sophisticated algorithms for querying DAGs. \n\nTo create a DAG object, we'll use the `dagify()` function. The `dagify()` function takes formulas, separated by commas, that specify causes and effects, with the left element of of the formula specifying the effect and the right all of the factors that cause it. This is just like the type of formula we specify for most regression models in R. `dagify()` returns a `dagitty` object that works with both the dagitty and ggdag packages.\n\nWhat are all of the factors that cause graduate students to listen to a podcast the morning before an exam? What are all of the factors that could cause a graduate student to do well on a test? Let’s posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n\n\n:::\n:::\n\n\nIn the code above, we assume that a graduate student's mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the test. Likewise, we assume that their mood and how prepared they are also influences their exam score. Notice we *do not* see podcast in the exam equation; this means that we assume that there is no causal relationship between podcast and the exam score.\n\nThere are some other useful arguments you'll often find yourself supplying to `dagify()`: \n\n* `exposure` and `outcome`: Telling ggdag the variables that are the exposure and outcome of your research question is required for many of the most useful queries we can make of DAGs. \n* `latent`: This argument lets us tell ggdag that some variables in the DAG are unmeasured. This is really useful for identifying valid adjustment sets with the data we actually have.\n* `coords`: Coordinates for the variables. You can choose between algorithmic or manual layouts, as discussed in below. We'll use `time_ordered_coords()` here.\n* `labels`: A character vector of labels for the variables. \n\nLet's create a DAG object, `podcast_dag`, that has some of these attributes, then visualize the DAG with `ggdag()`. `ggdag()` returns a ggplot object, so we can add additional layers to the plot like themes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores?](chapter-05_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=384}\n:::\n:::\n\n\n::: {.callout-note}\nFor the rest of the chapter, we'll use `theme_dag()`, a ggplot theme from ggdag meant for DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_set(\n theme_dag() %+replace%\n theme(legend.position = \"bottom\")\n)\n```\n:::\n\n:::\n\n::: {.callout-tip}\n## DAG coordinates \nYou don't need to specify coordinates to ggdag. If you don't, it uses algorithms designed for automatic layouts. There are many such algorithms, and they focus on different aspects of the layout, e.g. the shape, the space between the nodes, minimizing how many edges cross, etc. These layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# no coordinates specified\nset.seed(123)\npod_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n\n# automatically determine layouts\npod_dag |> \n ggdag(text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nWe can also ask for a specific layout, e.g. the popular Sugiyama algorithm for DAGs\n\n\n::: {.cell}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"sugiyama\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nFor causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = \"time_ordered\"`. We'll discuss time ordering in greater detail in @sec-time-ordered\n\n\n::: {.cell}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"time_ordered\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nYou can also manually specify coordinates using a list or data frame and provide them to the `coords` argument of `dagify()`. \n\nAdditionally, because ggdag is based on dagitty, you can use `dagitty.net` to create and organize a DAG using a graphical interface, then export the result as dagitty code for ggdag to consume.\n\nAlgorithmic layouts are often nice for fast visualization of DAGs or particularly complex graphs. Once you want to share your DAG, it's usually best to be a little more intentional about the layout, perhaps by specifying the coordinates manually. `time_ordered_coords()` is often the best of both worlds, and we'll use it for most DAGs in this book.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# TODO: Why aren't okabe-ito colors propgating here and other spots in ggdag?\npodcast_dag |> \n ggdag_paths(shadow = TRUE, text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-paths-1.png){#fig-paths width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(podcast_dag, text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndagitty::adjustmentSets(podcast_dag)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ mood, prepared }\n```\n\n\n:::\n:::\n\n\n\n- building\n- popout: coordinates. show manual. mention dagitty.net. mention `time_ordered`.\n- paths\n- adjustment sets\n- with ggplot\n\n## Common Structures of Bias\n\n- advanced forms of confounding, e.g. L happens after X\n- Selection bias, M-Bias, Butterfly bias. L2FU later.\n- instrumental variables, precision/competing exposure variables\n\n## Recommendations in building DAGs\n\n### Iterate early and often\n\n- Ideally before when designing your research, at least before analyzing data (avoid overfitting)\n\n### Consider your question\n\n- estimand\n- population and context\n\n### Order nodes by time {#sec-time-ordered}\n\n- Time ordering algorithm\n- Feedback loops: global warming and A/C use\n\n### Consider the whole data collection process\n\n- race/shooting\n- healthy worker bias\n\n### Include variables you don't have\n\n- Examples where you can and can't adjust depending\n\n### Saturate your DAG by default\n\n### Include instruments and competing exposures\n\n### Focus on the causal structure, then consider measurement bias\n\n### Be accurate, but focus on clarity\n\n### Pick adjustment sets most likely to be succesful\n\n- measurement error, certainty\n- use the path with the most observed variables\n\n### Use robustness checks\n\n- Negative controls/Falsification end-points, dag-data consitency, alternate adjustment sets\n\n## Causal Inference is not (just) a statistical problem {#sec-quartets}\n\n* This may be better off as the next chapter\n\n### Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\n- Probably too long, but if possible, condense to a popout\n- DAGs showing examples where prediction can lean on measured confounders, colliders. It's the amount of information a variable brings, not whether the coeffecient is unbiased affect of variable on outcome.\n- Not practical to fit a prediction model with future variable\n- Table 2 Bias examples. Unmeasure confounding of Z-Y relationship. Mediation example.\n\n", + "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\n> Draw your assumptions before your conclusions --@hernan2021\n\nCausal diagrams are a tool to visualize your assumptions about the causal structure of the questions you're trying to answer. In a randomized experiment, the causal structure is quite simple: while there may be many causes of an outcome, the only cause of the exposure is the randomization process itself (we hope!). In many non-randomized settings, however, the structure of your question can be a complex web of causality. Causal diagrams help communicate what we think this structure looks like. In addition to being open about what we think the causal structure is, causal diagrams have incredible mathematics properties that allow us to identify a way to estimate unbiased causal effects even with observational data.\n\nThe type of causal diagrams we use are also called directed acyclic graphs (DAGs). These graphs are directed because they include arrows going in a specific direction. They're acyclic because they don't go in circles; a variable can't cause itself, for instance. DAGs are used for a wide variety of problems, but we're specificaly concerned with *causal* DAGs. This class of DAGs is also sometimes referred to as Structural Causal Models (SCMs) because they are a model of the causal structure of a question. \n\n::: {.callout-tip}\n## DAGs down under\n\nWe highly recommend asking an Australian friend about DAGs. \n:::\n\nDAGs depict causal relationships between variables. Visually, the way they depict variables is as *edges* and *nodes*. Edges are the arrows going from one variable to another, also sometimes called arcs or just arrows. Nodes are the variables themselves, sometimes called vertices, points, or just variables. in @fig-dag-basic, there are two nodes: `x` and `y` and one edge going from `x` to `y`. Here, we are saying that `x` causes `y`. In some capacity, `y` \"listens\" to `x`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-dag-basic-1.png){#fig-dag-basic width=288}\n:::\n:::\n\n\nIf we're interested in the causal effect of `x` on `y`, we're trying to estimate a numeric representation of that arrow. Usually, though, there are many other variables and arrows in the causal structure of a given question. A series of arrows is called a *path*. There are three types of paths you'll see in DAGs: forks, chains, and colliders (sometimes called inverse forks).\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-dag-path-types-1.png){#fig-dag-path-types width=672}\n:::\n:::\n\n\nForks represent a common cause of two variables. Here, we're saying that `q` causes both `x` and `y`. This is the traditional definition of a confounder. They're called forks because the arrows from `x` to `y` are in different directions. Chains, on the other hand, represent a series of arrows going in the same direction. Here `q` is called a *mediator*: it is along the causal path from `x` to `y`. In this diagram, the only path from `x` to `y` is the one mediated through `q`. Finally, a collider is a path where two arrowheads meet at a variable. Because causality always goes forward in time, this naturally means that the collider variable is caused by two other variables. Here, we're saying that `x` and `y` both cause `q`.\n\n::: {.callout-tip}\n## Are DAGs SEMs?\nIf you're familiar with structural equation models (SEMs), a modeling technique commonly used in psychology and other social science settings, you may notice some similarities between SEMs and DAGs. In fact, DAGs are a form of *non-parametric* SEM. SEMs estimate entire graphs using parametric assumptions. Causal DAGs, on the other hand, don't estimate anything; an arrow going from one variable to another says nothing about the strength or functional form of that relationship, only that we think it exists.\n:::\n\nOne of the major benefits of DAGs is that they help us identify sources of bias and, often, provide clues in how to address them. However, talking about an unbiased effect estimate only makes sense when we have a specific causal question in mind. Since each arrow represents a cause, it's causality all the way down, so no individual arrow is inherently problematic. Here, we're interested in the effect of `x` on `y`. This question defines which paths we're interested in and which we're not. \n\nThese three types of paths have different implications for the statistical relationship between `x` and `y`. If we only look at the correlation between the two variables under these assumptions:\n\n1. In the fork, `x` and `y` will be associated, despite there being no arrow from `x` to `y`.\n2. In the chain, `x` and `y` are related but only through `q`.\n3. In the collider, `x` and `y` will *not* be related.\n\nPaths that transmit association are called *open paths*. Paths that do not transmit association are called *closed paths*. Forks and chains are open while colliders are closed. \n\nSo, should we adjust for `q`? That depends on the nature of the path. Forks are confounding paths. Because `q` causes both `x` and `y`, `x` and `y` will have a spurious association. They both contain information from `q`, their mutual cause, and that mutual causal relationship makes `x` and `y` associated statistically. Adjusting for `q` will *block* the bias from confounding and give us the true relationship between `x` and `y`. \n\n::: {.callout-tip}\n## Adjustment\nWe can use a variety of techniques to account for a variable. We use the term \"adjustment\" to generally refer to any technique that removes the effect of variables we're not interested in.\n:::\n\n@fig-confounder-scatter depicts this effect visually. Here, `x` and `y` are continuous, and by definition of the DAG, they are unrelated. `q`, however, causes both. The unadjusted effect is biased because it includes information about the open path from `x` to `y` via `z`. Within levels of `z`, however, `x` and `y` are unrelated. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-confounder-scatter-1.png){#fig-confounder-scatter width=672}\n:::\n:::\n\n\nFor chains, whether or not we adjusting for mediators depends on the research question. Here, adjusting for `q` would result in a null estimate of the effect of `x` on `y`. Because the only effect of `x` on `y` is via `q`, no other effect remains. The effect of `x` on `y` mediated by `q` is called the *indirect* effect, while the effect of `x` on `y` directly is called the *direct* effect. If we're only interested in the direct effect, controlling for `q` might be what we want. If we want to know about both effects, we shouldn't try to adjust for `q`. We'll learn more estimating different mediation effects in @sec-mediation.\n\n@fig-mediator-scatter shows this effect visually. The unadjusted effect of `x` on `y` represents the total effect. Since the total effect is due entirely to the path mediated by `q`, when we adjust for `q`, no relationship remains. This is the direct effect. Neither of these effects is due to bias but rather each answers a different research question. \n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-mediator-scatter-1.png){#fig-mediator-scatter width=672}\n:::\n:::\n\n\nColliders are different. In the collider DAG of @fig-dag-path-types, `x` and `y` are *not* associated, but both cause `q`. Adjusting for `q` has the opposite effect than with confounding: it *opens* a biasing pathway. Sometimes, people draw the path opened up by conditioning on a collider connecting `x` and `y`. \n\nVisually, we can see this happen when `x` and `y` are continuous and `q` is binary. In @fig-collider-scatter, when we don't include `q`, we find there is no relationship between `x` and `y`. That's the correct result. However, when we include `q`, we can detect information about both `x` and `y`, and they appear correlated: across levels of `x`, those with `q = 0` have lower levels of `y`. Association seemingly flows back in time. Of course, that can't happen from a causal perspective, so controlling for `q` is the wrong thing to do. We end up with a biased effect of `x` on `y`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-collider-scatter-1.png){#fig-collider-scatter width=672}\n:::\n:::\n\n\nHow can this be? Since `x` and `y` happen before `q`, `q` can't have an impact on them. Let's turn the DAG on its side and consider @fig-collider-time. If we break down the two time points, at time point 1, `q` hasn't happened yet, and `x` and `y` are unrelated. At time point 2, `q` happens due to `x` and `y`. But causality only goes forward in time. `q` happening later can't change the fact that `x` and `y` happened independently at time point 1.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-collider-time-1.png){#fig-collider-time width=672}\n:::\n:::\n\n\nCausality only goes forward. Association, however, is time-agnostic. It's just an observation about the numerical relationships between variables. When we control for the future, we run the risk of introducing bias. It's challenging to develop an intuition for this. Consider a case where `x` and `y` are the only causes of `q`, and all three variables are binary. When *either* `x` or `y` equals 1, then `m` happens. If we know `q = 1` and `x = 0` then logically it must be that `y = 1`. Thus, knowing about `q` gives us information about `y` via `x`. This is an extreme example, but it shows how this type of bias, sometimes called *collider-stratification bias* or *selection bias*, occurs: conditioning on `q` provides statistical information about `x` and `y` and distorts their relationship.\n\n::: {.callout-tip}\n## Exchangability revisited\nWe commonly refer exchangability as the assumption of no confounding. Actually, this isn't quite right. It's the assumption of no *open, non-causal* paths. Many times, these are confounding pathways. However, paths can also be opened by conditioning on a collider. Even though these aren't confounders, it creates non-exchangability between the two groups: they are different in a way that matters to the exposure and outcome.\n:::\n\nCorrectly identifying the causal structure between the exposure and outcome thus helps us 1) communicate the assumptions we're making about the relationships between variables and 2) identify sources of bias. Importantly, in doing 2), we are also often able to identify ways to prevent bias based on the assumptions in 1). In the simple case of three DAGs in @fig-dag-path-types, we know whether or not to control for `q` depending on the nature of the causal structure. The set or sets of variables we need to adjust for is called the *adjustment set*. DAGs can help us identify adjustment sets even in complex settings.\n\n::: {.callout-tip}\n## What about interaction?\n\nDAGs don't make a statement about interaction or effect estimate modification even though they are an important part of inference. Technically, interaction is a matter of the functional form of the relationships in the DAG. Much as we don't need to specify how we're going to model a variable in the DAG (e.g., with splines), we don't need to specify how variables statistically interact with one another. That's a matter for the modeling stage.\n\nThere are several ways we use interactions in causal inference. In one extreme, they are simply a matter of functional form: interaction terms are included in models but marginalized over to get an overall causal effect. In the other extreme, we're interested in *joint causal effects*, where the two variables interacting with each other are both causal. In between, we can use interaction terms to identify *heterogenous causal effects*, effects which vary by a second variable that is not assumed to be causal. As with many tools in causal inference, we use the same statistical technique many ways to answer different questions. \n\n\n\nMany people have tried ways of expressing interaction in DAGs using different types of arcs, nodes, and other annotations, but no approach has taken off as the preferred way. \n:::\n\nLet's take a look at an example in R. We'll learn how to build DAGs, visualize them, and identify important information like adjustment sets.\n\n## DAGs in R\n\nFirst, let's consider the a research question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores? We can diagram this using the method describe in @sec-diag (@fig-diagram-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Does listening to a comedy podcast the morning before an exam improve graduate student test scores?](../images/podcast-diagram.png){#fig-diagram-podcast width=2267}\n:::\n:::\n\n\nThe main tool we'll use for making DAGs is ggdag. ggdag is a package that connects ggplot2, the most powerful visualization tool in R, to dagitty, an R package with sophisticated algorithms for querying DAGs. \n\nTo create a DAG object, we'll use the `dagify()` function. The `dagify()` function takes formulas, separated by commas, that specify causes and effects, with the left element of of the formula specifying the effect and the right all of the factors that cause it. This is just like the type of formula we specify for most regression models in R. `dagify()` returns a `dagitty` object that works with both the dagitty and ggdag packages.\n\nWhat are all of the factors that cause graduate students to listen to a podcast the morning before an exam? What are all of the factors that could cause a graduate student to do well on a test? Let’s posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n\n\n:::\n:::\n\n\nIn the code above, we assume that a graduate student's mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the test. Likewise, we assume that their mood and how prepared they are also influences their exam score. Notice we *do not* see podcast in the exam equation; this means that we assume that there is no causal relationship between podcast and the exam score.\n\nThere are some other useful arguments you'll often find yourself supplying to `dagify()`: \n\n* `exposure` and `outcome`: Telling ggdag the variables that are the exposure and outcome of your research question is required for many of the most useful queries we can make of DAGs. \n* `latent`: This argument lets us tell ggdag that some variables in the DAG are unmeasured. This is really useful for identifying valid adjustment sets with the data we actually have.\n* `coords`: Coordinates for the variables. You can choose between algorithmic or manual layouts, as discussed in below. We'll use `time_ordered_coords()` here.\n* `labels`: A character vector of labels for the variables. \n\nLet's create a DAG object, `podcast_dag`, that has some of these attributes, then visualize the DAG with `ggdag()`. `ggdag()` returns a ggplot object, so we can add additional layers to the plot like themes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores?](chapter-05_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=384}\n:::\n:::\n\n\n::: {.callout-note}\nFor the rest of the chapter, we'll use `theme_dag()`, a ggplot theme from ggdag meant for DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_set(\n theme_dag() %+replace%\n theme(legend.position = \"bottom\")\n)\n```\n:::\n\n:::\n\n::: {.callout-tip}\n## DAG coordinates \nYou don't need to specify coordinates to ggdag. If you don't, it uses algorithms designed for automatic layouts. There are many such algorithms, and they focus on different aspects of the layout, e.g. the shape, the space between the nodes, minimizing how many edges cross, etc. These layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# no coordinates specified\nset.seed(123)\npod_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n\n# automatically determine layouts\npod_dag |> \n ggdag(text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-12-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nWe can also ask for a specific layout, e.g. the popular Sugiyama algorithm for DAGs\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"sugiyama\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-13-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nFor causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = \"time_ordered\"`. We'll discuss time ordering in greater detail in @sec-time-ordered\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"time_ordered\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-14-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nYou can also manually specify coordinates using a list or data frame and provide them to the `coords` argument of `dagify()`. \n\nAdditionally, because ggdag is based on dagitty, you can use `dagitty.net` to create and organize a DAG using a graphical interface, then export the result as dagitty code for ggdag to consume.\n\nAlgorithmic layouts are often nice for fast visualization of DAGs or particularly complex graphs. Once you want to share your DAG, it's usually best to be a little more intentional about the layout, perhaps by specifying the coordinates manually. `time_ordered_coords()` is often the best of both worlds, and we'll use it for most DAGs in this book.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# TODO: Why aren't okabe-ito colors propgating here and other spots in ggdag?\npodcast_dag |> \n ggdag_paths(shadow = TRUE, text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-paths-1.png){#fig-paths width=672}\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag, \n text = FALSE, \n use_labels = \"label\"\n)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-16-1.png){fig-align='center' width=384}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndagitty::adjustmentSets(podcast_dag)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ mood, prepared }\n```\n\n\n:::\n:::\n\n\n\n- paths\n- adjustment sets\n- with ggplot\n\n## Common Structures of Bias\n\n- advanced forms of confounding, e.g. L happens after X\n- Selection bias, M-Bias, Butterfly bias. L2FU later.\n- instrumental variables, precision/competing exposure variables\n\n## Recommendations in building DAGs\n\n### Iterate early and often\n\n- Ideally before when designing your research, at least before analyzing data (avoid overfitting)\n\n### Consider your question\n\n- estimand\n- population and context\n\n### Order nodes by time {#sec-time-ordered}\n\n- Time ordering algorithm\n- Feedback loops: global warming and A/C use\n\n### Consider the whole data collection process\n\n- race/shooting\n- healthy worker bias\n\n### Include variables you don't have\n\n- Examples where you can and can't adjust depending\n\n### Saturate your DAG by default\n\n### Include instruments and competing exposures\n\n### Focus on the causal structure, then consider measurement bias\n\n### Be accurate, but focus on clarity\n\n### Pick adjustment sets most likely to be succesful\n\n- measurement error, certainty\n- use the path with the most observed variables\n\n### Use robustness checks\n\n- Negative controls/Falsification end-points, dag-data consitency, alternate adjustment sets\n\n## Causal Inference is not (just) a statistical problem {#sec-quartets}\n\n* This may be better off as the next chapter\n\n### Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\n- Probably too long, but if possible, condense to a popout\n- DAGs showing examples where prediction can lean on measured confounders, colliders. It's the amount of information a variable brings, not whether the coeffecient is unbiased affect of variable on outcome.\n- Not practical to fit a prediction model with future variable\n- Table 2 Bias examples. Unmeasure confounding of Z-Y relationship. Mediation example.\n\n", "supporting": [ "chapter-05_files" ], diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png index 1b75243..bb54db0 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png and b/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png b/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png index 598755d..73d49ef 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png and b/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-12-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-12-1.png index 380e01d..24bc661 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-12-1.png and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-12-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-13-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-13-1.png index 71cf8df..8d38e80 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-13-1.png and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-13-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png index a44e72c..e18ed7a 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png index 5738da0..11db7e7 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png differ diff --git a/chapters/00-setup.qmd b/chapters/00-setup.qmd index cfb4974..4e6f544 100644 --- a/chapters/00-setup.qmd +++ b/chapters/00-setup.qmd @@ -1,5 +1,6 @@ ```{r} #| include: false +source(here::here("R/ggdag-mask.R")) source(here::here("R/setup.R")) library(tidyverse) ``` diff --git a/chapters/chapter-05.qmd b/chapters/chapter-05.qmd index 2d1765d..50973aa 100644 --- a/chapters/chapter-05.qmd +++ b/chapters/chapter-05.qmd @@ -361,6 +361,9 @@ theme_set( You don't need to specify coordinates to ggdag. If you don't, it uses algorithms designed for automatic layouts. There are many such algorithms, and they focus on different aspects of the layout, e.g. the shape, the space between the nodes, minimizing how many edges cross, etc. These layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result. ```{r} +#| fig-width: 4 +#| fig-height: 4 +#| fig-align: center # no coordinates specified set.seed(123) pod_dag <- dagify( @@ -376,6 +379,9 @@ pod_dag |> We can also ask for a specific layout, e.g. the popular Sugiyama algorithm for DAGs ```{r} +#| fig-width: 4 +#| fig-height: 4 +#| fig-align: center pod_dag |> ggdag(layout = "sugiyama", text_size = 2.8) ``` @@ -383,6 +389,9 @@ pod_dag |> For causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = "time_ordered"`. We'll discuss time ordering in greater detail in @sec-time-ordered ```{r} +#| fig-width: 4 +#| fig-height: 4 +#| fig-align: center pod_dag |> ggdag(layout = "time_ordered", text_size = 2.8) ``` @@ -398,11 +407,18 @@ Algorithmic layouts are often nice for fast visualization of DAGs or particularl #| label: fig-paths # TODO: Why aren't okabe-ito colors propgating here and other spots in ggdag? podcast_dag |> - ggdag_paths(shadow = TRUE, text_size = 2.8) + ggdag_paths(shadow = TRUE, text = FALSE, use_labels = "label") ``` ```{r} -ggdag_adjustment_set(podcast_dag, text_size = 2.8) +#| fig-width: 4 +#| fig-height: 4 +#| fig-align: center +ggdag_adjustment_set( + podcast_dag, + text = FALSE, + use_labels = "label" +) ``` @@ -411,8 +427,6 @@ dagitty::adjustmentSets(podcast_dag) ``` -- building -- popout: coordinates. show manual. mention dagitty.net. mention `time_ordered`. - paths - adjustment sets - with ggplot