diff --git a/DESCRIPTION b/DESCRIPTION index 3ce98cd..e2b61e7 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -27,7 +27,6 @@ Imports: kableExtra, lubridate, MatchIt, - patchwork, propensity (>= 0.0.0.9000), PSW, ragg, diff --git a/R/ggdag-mask.R b/R/ggdag-mask.R deleted file mode 100644 index dc9f1d0..0000000 --- a/R/ggdag-mask.R +++ /dev/null @@ -1,42 +0,0 @@ -# this is all a hack to make work with quick plotting -# TODO: when `geom_dag_label_repel2` exists, add to namespace as 1 then delete this first bit -# copied from source to avoid recursion issue in overriding in ggdag namsespace -ggdag_geom_dag_label_repel <- function( - mapping = NULL, data = NULL, parse = FALSE, ..., - box.padding = grid::unit(0.35,"lines"), label.padding = grid::unit(0.25, "lines"), - point.padding = grid::unit(1.5, "lines"), label.r = grid::unit(0.15, "lines"), - label.size = 0.25, segment.color = "grey50", segment.size = 0.5, arrow = NULL, - force = 1, max.iter = 2000, nudge_x = 0, nudge_y = 0, na.rm = FALSE, - show.legend = NA, inherit.aes = TRUE) { - ggplot2::layer(data = data, mapping = mapping, stat = ggdag:::StatNodesRepel, - geom = ggrepel::GeomLabelRepel, position = "identity", - show.legend = show.legend, inherit.aes = inherit.aes, - params = list(parse = parse, box.padding = box.padding, - label.padding = label.padding, point.padding = point.padding, - label.r = label.r, label.size = label.size, segment.colour = segment.color %||% - segment.colour, segment.size = segment.size, - arrow = arrow, na.rm = na.rm, force = force, max.iter = max.iter, - nudge_x = nudge_x, nudge_y = nudge_y, segment.alpha = 1, ...)) -} - -geom_dag_label_repel_internal <- function(..., seed = 10) { - ggdag_geom_dag_label_repel( - mapping = aes(x, y, label = label), - # TODO: make sure this looks ok. slightly different from above - box.padding = 2, - max.overlaps = Inf, - inherit.aes = FALSE, - family = getOption("book.base_family"), - seed = seed, - label.size = NA, - label.padding = 0.1 - ) -} - -# apply to quick functions as well -assignInNamespace("geom_dag_label_repel", geom_dag_label_repel_internal, ns = "ggdag") - -# override some other clumsy internals in ggdag until addressed - -assignInNamespace("scale_color_hue", ggplot2::scale_color_discrete, ns = "ggplot2") -assignInNamespace("scale_edge_colour_hue", \(...) ggraph::scale_edge_colour_manual(..., values = ggokabeito::palette_okabe_ito()), ns = "ggraph") diff --git a/R/setup.R b/R/setup.R index 04047dc..30f22cb 100644 --- a/R/setup.R +++ b/R/setup.R @@ -31,7 +31,7 @@ theme_dag <- function() { } geom_dag_label_repel <- function(..., seed = 10) { - ggdag_geom_dag_label_repel( + ggdag::geom_dag_label_repel( aes(x, y, label = label), box.padding = 3.5, inherit.aes = FALSE, diff --git a/_freeze/chapters/chapter-01/execute-results/html.json b/_freeze/chapters/chapter-01/execute-results/html.json index 0f4e26c..13c6dc9 100644 --- a/_freeze/chapters/chapter-01/execute-results/html.json +++ b/_freeze/chapters/chapter-01/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "ccfc706d9bab0bbf8fafa9b700653ee4", + "hash": "ff9e98faf79959ed0a580bcf113af565", "result": { - "markdown": "\\mainmatter\n\n# What is a causal question? {#sec-causal-question}\n\n\n\n\n\n## Schrödinger's Causality\n\nThe heart of causal analysis is the causal question; it dictates what data we analyze, how we analyze it, and to which populations our inferences apply.\nThis book, being applied in nature, deals primarily with the analysis stage of causal inference.\nRelative to the complexity of specifying a good causal question, the analysis stage is considerably more straightforward.\nIn the first six chapters of this book, we'll discuss what a causal question is, how to improve our questions, and consider some examples.\n\nCausal questions are part of a broader set of questions we can ask with statistical techniques related to the primary tasks of data science: description, prediction, and causal inference [@hernan2019].\nUnfortunately, these tasks are often muddled by the techniques we use (regression, for instance, is helpful for all three tasks) and how we talk about them.\nWhen researchers are interested in causal inference from non-randomized data, we often use euphemistic language like \"association\" instead of declaring our intent to estimate a causal effect [@Hernan2018].\n\nIn a recent study of the language of analyses in epidemiologic research, for instance, the most common root word describing the estimated effect was \"associate,\" but many researchers also felt that \"associate\" implied at least *some* causal effect (@fig-word-ranking) [@haber_causal_language].\nOnly around 1% of the studies analyzed used the root word \"cause\" at all.\nYet, a third of studies had action recommendations, and researchers rated 80% of these recommendations as having at least some causal implication.\nOften, these studies have stronger action recommendations (alluding to causal effects) than those implied by the description of the effect (root words like \"associate\" and \"compare\").\nDespite how many studies implied that the goal was causal inference, only about 4% used formal causal models like those discussed in this book.\nHowever, most discussed how such a cause might be justified by previous research or theory.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rankings of causal strength of root words used by researchers. Root words with more Strong rankings have stronger causal implications than those with many None or Weak rankings. Data from Haber et al.](chapter-01_files/figure-html/fig-word-ranking-1.png){#fig-word-ranking width=672}\n:::\n:::\n\n\nInstead of clear questions with obvious assumptions and goals, we end up with \"Schrödinger's causal inference\":\n\n> Our results suggest that \"Schrödinger's causal inference,\" — where studies avoid stating (or even explicitly deny) an interest in estimating causal effects yet are otherwise embedded with causal intent, inference, implications, and recommendations — is common.\n>\n> --- @haber_causal_language\n\n## Description, prediction, and explanation\n\nAn excellent first step to address this problem is recognizing that questions about description, prediction, and explanation are fundamentally different.\nData science in industry isn't quite as burdened by Schrödinger's causal inference as the academic sciences, but being explicit in the differences in analytic intent is still helpful.\nFor instance, when a stakeholder asks for \"drivers\" of a particular event, what are they asking?\nFor a model to predict the event?\nFor a deeper understanding of what causes the event?\nIt's a vague request, but it smacks of causal interest to us.\nWhen we're clear about our goals, we can use all three approaches more effectively (and, as we'll see, both descriptive analysis and prediction models are still helpful when the goal is to make causal inferences).\nMoreover, all three approaches are useful decision-making tools.\n\n### Description\n\nDescriptive analysis aims to describe the distribution of variables, often stratified by key variables of interest.\nA closely related idea is exploratory data analysis (EDA), although descriptive studies often have more explicit goals than those in EDA.\n\nDescriptive analyses are usually based on statistical summaries such as measures of centrality (means, medians) and spread (minimums, maximums, quartiles), but they also occasionally use techniques like regression modeling.\nThe goal of applying more advanced techniques like regression is different in descriptive analyses than in either predictive or causal studies.\n\"Adjusting\" for a variable in descriptive analyses means that we are removing its associational effect (and thus changing our question), *not* that we are controlling for confounding.\n\nIn epidemiology, a valuable concept for descriptive analyses is \"person, place, and time\" -- who has what disease, where, and when.\nThis concept is also a good template for descriptive analyses in other fields.\nUsually, we want to be clear about what population we're trying to describe, so we need to be as specific as possible.\nFor human health, describing the people involved, the location, and the period are all critical.\nIn other words, focus on the first principles of generating understanding of your data and describe your data accordingly.\n\n#### Examples\n\nCounting things is one of the best things we can do with data.\nEDA benefits both predictive and causal analyses, but descriptive analyses are valuable independent of the other analysis tasks.\nAsk any data scientist who thought they'd be developing complex machine learning models and found themselves spending most of their time on dashboards.\nUnderstanding the distributions of the data, particularly for key analysis goals (say, KPIs in industry or disease incidence in epidemiology), is critical for many types of decision-making.\n\nOne of the best recent examples of descriptive analyses arose from the COVID-19 pandemic [@Fox2022].\nIn 2020, particularly in the early months of the pandemic, descriptive analyses were vital to understanding risk and allocating resources.\nSince the coronavirus is similar to other respiratory diseases, we had many public health tools to reduce risk (e.g., distancing and, later, face masks).\nDescriptive statistics of cases by region were vital for deciding local policies and the strength of those policies.\n\nA great example of a more complex descriptive analysis during the pandemic was an [ongoing report by the Financial Times of expected deaths vs. observed deaths](https://www.ft.com/content/a2901ce8-5eb7-4633-b89c-cbdf5b386938) in various countries and regions[^3].\nWhile the calculation of expected deaths is slightly more sophisticated than most descriptive statistics, it provided a tremendous amount of information about current deaths without needing to untangle causal effects (e.g., were they due to COVID-19 directly? Inaccessible healthcare? Cardiovascular events post-COVID?).\nIn this (simplified) recreation of their plot from July 2020, you can see the staggering effect of the pandemic's early months.\n\n[^3]: John Burn-Murdoch was responsible for many of these presentations and gave a [fascinating talk on the subject](https://cloud.rstudio.com/resources/rstudioglobal-2021/reporting-on-and-visualising-the-pandemic/).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2020 excess deaths vs. historical expected deaths from any cause. Data from the Financial Times.](chapter-01_files/figure-html/fig-ft-chart-1.png){#fig-ft-chart width=960}\n:::\n:::\n\n\nHere are some other great examples of descriptive analyses.\n\n- Deforestation around the world. Our World in Data [@owidforestsanddeforestation] is a data journalism organization that produces thoughtful, usually descriptive reports on various topics. In this report, they present data visualizations of both absolute change in forest coverage (forest transitions) and relative change (deforestation or reforestation), using basic statistics and forestry theory to present helpful information about the state of forests over time.\n- The prevalence of chlamydial and gonococcal infections [@Miller2004]. Measuring the prevalence of disease (how many people currently have a disease, usually expressed as a rate per number of people) is helpful for basic public health (resources, prevention, education) and scientific understanding. In this study, the authors conducted a complex survey meant to be representative of all high schools in the United States (the target population); they used survey weights to address a variety of factors related to their question, then calculated prevalence rates and other statistics. As we'll see, weights are helpful in causal inference for the same reason: targeting a particular population. That said, not all weighting techniques are causal in nature, and they were not here.\n- Estimating race and ethnicity-specific hysterectomy inequalities [@Gartner2020]. Descriptive techniques also help us understand disparities in areas like economics and epidemiology. In this study, the authors asked: Does the risk of hysterectomy differ by racial or ethnic background? Although the analysis is stratified by a key variable, it's still descriptive. Another interesting aspect of this paper is the authors' work ensuring the research answered questions about the right target population. Their analysis combined several data sources to better estimate the true population prevalence (instead of the prevalence among those in hospitals, as commonly presented). They also adjusted for the prevalence of hysterectomy, e.g., they calculated the incidence (new case) rate only among those who could actually have a hysterectomy (e.g., they hadn't had one yet).\n\n#### Validity\n\nThere are two critical validity issues in descriptive analyses: measurement and sampling errors.\n\nMeasurement error is when we have mismeasured one or more variables in some capacity.\nFor descriptive analyses, mismeasuring things means we may not get the answer to our question.\nHowever, the degree to which that is the case depends on both the severity of the measurement error and the question itself.\n\nSampling error is a more nuanced topic in descriptive analyses.\nIt's related to the population we're analyzing (who should the analysis produce descriptions about) and uncertainty (how certain are we that the descriptions of those we have data for represent the population we're trying to describe.).\n\nThe population from which our data come and the population we're trying to describe must be the same for us to provide valid descriptions.\nConsider a dataset generated by an online survey.\nWho are the people who are answering these questions, and how do they relate to the people we want to describe?\nFor many analyses, the people who take the time to answer surveys are different than the people we want to describe, e.g., the group of people who fill out surveys may have a different distribution of variables than those who don't.\nResults from data like these are not technically biased because, outside of sample size-related uncertainty and measurement error, the descriptions are accurate---they're just not for the right group of people!\nIn other words, *we've gotten an answer to the wrong question*.\n\nNotably, sometimes our data represent the entire population (or close enough) that sampling error is irrelevant.\nConsider a company with certain data on every customer using their service.\nFor many analyses, this represents the entire population (current customers) about whom we want information.\nSimilarly, in countries with population-wide health registries, the data available for specific practical purposes is close enough to the entire population that no sampling is needed (although researchers might use sampling for simpler computations).\nIn these cases, there's not really such a thing as uncertainty.\nAssuming everything is well-measured, the summary statistics we generate *are inherently unbiased and precise* because we have information from everyone in the population.\nOf course, in practice, we usually have some mixture of measurement error, missing data, and so on, even in the best of circumstances.\n\nOne crucial detail of descriptive analysis is that confounding bias, one of the chief concerns of this book, is undefined.\nThat's because confounding is a causal concern.\nDescriptive analyses cannot be confounded because they are a statistical description of relationships *as-is*, not the mechanisms behind those relationships.\n\n#### Relationship to causal inference\n\nHumans see patterns very well.\nPattern-finding is a helpful feature of our brains, but it can also lead down a slippery slope of inference when we're not working with data or methods that can allow us to do that validly.\nThe biggest thing to be cautious of when your goal is to describe is making the leap from description to causation (implicitly or explicitly).\n\nBut, of course, descriptive analysis is helpful when we *are* estimating causal effects.\nIt helps us understand the population we're working with, the distribution of the outcomes, exposures (variables we think might be causal), and confounders (variables we need to account for to get unbiased causal effects for the exposure).\nIt also helps us be sure that the data structure we're using matches the question we're trying to answer, as we'll see in [Chapter -@sec-data-causal].\nYou should always do descriptive analyses of your data when conducting causal research.\n\nFinally, as we'll see in [Chapter -@sec-trials-std], there are certain circumstances where we can make causal inferences with basic statistics.\nBe cautious about the distinction between the causal question and the descriptive component here, too: just because we're using the same calculation (e.g., a difference in means) doesn't mean that all descriptions you can generate are causal. \nWhether a descriptive analysis overlaps with a causal analysis is a function of the data and the question.\n\n### Prediction\n\nThe goal of prediction is to use data to make accurate predictions about variables, usually on new data.\nWhat this means depends on the question, domain, and so on.\nPrediction models are used in about every setting imaginable, from peer-reviewed clinical models to bespoke machine learning models embedded in consumer devices.\nEven large language models like the ones ChatGPT is based on are prediction models: they predict what a response to a prompt would look like.\n\nPredictive modeling generally uses a different workflow than the workflow for causal modeling we'll present in this book.\nSince the goal of prediction is usually related to making predictions on new data, the workflow of this type of modeling focuses on maximizing predictive accuracy while retaining generalization to new data, sometimes called the bias-variance trade-off.\nIn practice, this means splitting your data into training sets (the part of the data you build your model on) and test sets (the part you assess your model with, a proxy for how it would perform on new data).\nUsually, data scientists use cross-validation or other sampling techniques to reduce further the risk of overfitting your model to the training set.\n\nThere are many excellent texts on predictive modeling, and so we refer you to those for a deeper exploration of the goals and methods of these techniques [@kuhn2013a; @harrell2001; @Kuhn_Silge_2022; @James_Witten_Hastie_Tibshirani_2022].\n\n#### Examples\n\nPrediction is the most popular topic in data science, largely thanks to machine learning applications in industry.\nPrediction, of course, has a long history in statistics, and many models popular today have been used for decades in and outside academia.\n\nLet's look at an example of prediction about COVID-19 [^chapter-01-1].\nIn 2021, researchers published the ISARIC 4C Deterioration model, a clinical prognostic model for predicting severe adverse outcomes for acute COVID-19 [@Gupta2021].\nThe authors included a descriptive analysis to understand the population from which this model was developed, particularly the distribution of the outcome and candidate predictors.\nOne helpful aspect of this model is that it uses items commonly measured on day one of COVID-related hospitalization.\nThe authors built this model using cross-validation by region of the UK and then tested the model on data from a hold-out region.\nThe final model included eleven items and a description of their model attributes, relation with the outcome, and so on.\nNotably, the authors used clinical domain knowledge to select candidate variables but did not fall into the temptation of interpreting the model coefficients as causal.\nWithout question, some of the predictive value of this model stems from the causal structure of the variables as they relate to the outcome, but the researchers had a different goal entirely for this model and stuck to it.\n\n[^chapter-01-1]: A natural model here is predicting cases, but infectious disease modeling is complex and usually uses techniques outside the usual predictive modeling workflow.\n\nHere are other good examples from the predictive space:\n\n- Some of the most exciting work in predictive modeling is in industry.\n Netflix regularly shares details on their modeling success and novel strategies in their [research blog](https://research.netflix.com/).\n They also recently published a paper reviewing their use of deep learning models for recommender systems (in this case, recommending shows and movies to users) [@steck2021].\n The authors explain their experimentation with models, the details of those models, and many of the challenges they faced, resulting in a practical guide on using such models.\n\n- In early 2020, researchers experienced with predictive and prognostic modeling in health research published a review of models for diagnosis and prognosis of COVID-19 [@Wynants2020].\n This review is interesting not just for its breadth but also the astounding number of models that were rated as poor quality: \"\\[232\\] models were rated at high or unclear risk of bias, mostly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, high risk of model overfitting, and unclear reporting.\" This research is also a [living review](https://www.covprecise.org/).\n\n#### Validity\n\nThe key measure of validity in prediction modeling is predictive accuracy, which can be measured in several ways, such as root mean squared error (RMSE), mean absolute error (MAE), area under the curve (AUC), and many more.\nA convenient detail about predictive modeling is that we can often assess if we're right, which is not true of descriptive statistics for which we only have a subset of data or causal inference for which we don't know the true causal structure.\nWe aren't always able to assess against the truth, but it's almost always required for fitting the initial predictive model [^chapter-01-2].\n\n[^chapter-01-2]: We say model singular, but usually data scientists fit many models for experimentation, and often the best prediction models are some combination of predictions from several models, called a stacked model\n\nMeasurement error is also a concern for predictive modeling because we usually need accurate data for accurate predictions.\nInterestingly, measurement error and missingness can be informative in predictive settings.\nIn a causal setting, this might induce bias, but predictive models can consume that information with no issue.\nFor instance, in the famous Netflix Prize, winning models leveraged information about whether or not a customer rated a movie at all to improve recommendation systems.\n\nLike descriptive analysis, confounding is undefined for predictive modeling.\nA coefficient in a prediction model cannot be confounded; we only care about whether or not the variable provides predictive information, not if that information is because of a causal relationship or something else.\n\n#### Relationship to causal inference\n\nThe single biggest risk in prediction is to assume that a given coefficient in a model has a causal interpretation.\nThere is a good chance that this isn't so.\nA model may predict well but may also have completely biased coefficients from a causal point of view.\nWe'll see more about this in @sec-pred-or-explain and the rest of the book.\n\nOften, people mistakenly use methods for selecting features (variables) for prediction models to select confounders in causal models.\nAside from their risk of overfitting, these methods are appropriate for prediction models but not for causal models.\nPrediction metrics cannot determine the causal structure of your question, and predictive value for the outcome does not make a variable a confounder.\nIn general, background knowledge (not prediction or associational statistics) should help you select variables for causal models @robinsImpossible; we'll discuss this process in detail in [Chapter -@sec-dags] and [Chapter -@sec-building-models].\n\nPrediction is nevertheless crucial to causal inference.\nFrom a philosophical perspective, we're comparing predictions from different *what if* scenarios: What would the outcome had one thing happened vs. if another thing happened?\nWe'll spend much time on this subject, particularly in [Chapter -@sec-counterfactuals].\nWe'll also talk a lot about prediction from a practical perspective: just like in prediction and some description, we'll use modeling techniques to answer causal questions.\nTechniques like propensity score methods and g-computation use model predictions to answer causal questions, but the workflow for building and interpreting the models themselves are quite different.\n\n### Causal Inference\n\nThe goal of causal inference is to understand the impact that a variable, sometimes called an exposure, has on another variable, sometimes called an outcome.\n\"Exposure\" and \"outcome\" are the terms we'll use in this book to describe the causal relationship we're interested in.\nImportantly, our goal is to answer this question clearly and precisely.\nIn practice, this means using techniques like study design (e.g., a randomized trial) or statistical methods (like propensity scores) to calculate an unbiased effect of the exposure on the outcome.\n\nAs with prediction and description, it's better to start with a clear, precise question to get a clear, precise answer.\nIn statistics and data science, particularly as we swim through the ocean of data of the modern world, we often end up with an answer without a question (e.g., `42`).\nThis, of course, makes interpretation of the answer difficult.\nIn @sec-diag, we'll discuss the structure of causal questions.\nWe'll discuss philosophical and practical ways to sharpen our questions in [Chapter -@sec-counterfactuals] and [Chapter -@sec-trials-std].\n\n::: callout-note\n## Causal inference and explanation\n\nSome authors use the phrases \"causal inference\" and \"explanation\" interchangeably.\nWe're a little more cautious about that.\nCausal inference always has a relationship to explanation, but we can accurately estimate the effect of one thing on another without understanding how it happens.\n\nConsider John Snow, the so-called father of epidemiology.\nIn 1854, Snow famously investigated a cholera outbreak in London and identified that specific water sources were to blame for the disease.\nHe was right: contaminated water was a mechanism for cholera transmission.\nYet, he didn't have enough information to explain how: *Vibrio cholerae*, the bacteria responsible for cholera, wasn't identified until nearly thirty years later.\n:::\n\n#### Examples\n\nWe'll see many examples of causal inference in this book, but let's continue with an example related to COVID-19.\nAs the pandemic continued and tools like vaccines and anti-viral treatments became available, policies like universal masking also began to change.\nIn February 2022, the US state of Massachusetts rescinded a statewide policy that required universal masking in public schools [@Cowger2022].\nIn the greater Boston area, some school districts continued the policy while others discontinued it; those that discontinued it also did so at different times over the following weeks after the policy change.\nThis difference in policy allowed researchers to take advantage of the differences in district policies over this period to study the impact of universal masking on COVID-19 cases.\nThe researchers included a descriptive analysis of the school districts to understand the distribution of factors related to COVID-19 and other determinants of health.\nTo estimate the effect of universal masking on cases, the authors used a technique common in policy-related causal inference called difference-in-differences to estimate this effect.\nTheir design alleviates some problematic assumptions needed for causal inference, but they also wisely controlled for potential confounders despite that advantage.\nThe authors found that districts that continued masking saw a drastically lower caseload than those that didn't; their analysis concluded that almost 12,000 additional cases occurred due to the policy change, nearly 30% of the cases in the districts during the 15 weeks of the study.\n\n\n\nHere are a few other interesting examples:\n\n- Netflix regularly uses causal inference in their work.\n In 2022, they published a [blog post summarizing some causal tasks](https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f) they have engaged with.\n One interesting example is localization.\n Netflix, being worldwide, localizes content through subtitles and dubbing.\n Randomized experiments were a bad idea because they meant withholding content from users, so researchers at Netflix used several approaches to understand the value of localization while addressing potential confounding.\n One example is studying the impact of pandemic-related delays in dubbing.\n Researchers used synthetic controls to simulate the impact on viewership with and without these delays.\n Presumably, the timing of the pandemic-related delays was unrelated to many factors that would typically be related to dubbing processes, thus reducing some of the potential confounding.\n\n- The Tuskegee Study is one of modern history's most infamous examples of medical abuse.\n It is commonly pointed to as a source of distrust in the medical community from Black Americans.\n Health economics researchers used a variation of difference-in-difference techniques to assess the effect of the Tuskegee Study on distrust and life expectancy in older Black men [@Alsan2018].\n The results are important and disturbing: \"We find that the disclosure of the study in 1972 is correlated with increases in medical mistrust and mortality and decreases in both outpatient and inpatient physician interactions for older black men. Our estimates imply life expectancy at age 45 for black men fell by up to 1.5 years in response to the disclosure, accounting for approximately 35% of the 1980 life expectancy gap between black and white men and 25% of the gap between black men and women.\"\n\n#### Validity\n\nMaking valid causal inferences requires several assumptions that we'll discuss in @sec-assump.\nUnlike prediction, we generally cannot confirm that our causal models are correct.\nIn other words, most assumptions we need to make are unverifiable.\nWe'll come back to this topic time and time again in the book---from the basics of these assumptions to practical decision-making to probing our models for problems.\n\n### Why isn't the right causal model just the best prediction model? {#sec-pred-or-explain}\n\nAt this point, you may wonder why the right causal model isn't just the best prediction model.\nIt makes sense that the two would be related: naturally, things that cause other things would be predictors.\nIt's causality all the way down, so any predictive information *is* related, in some capacity, to the causal structure of the thing we're predicting. \nDoesn't it stand to reason that a model that predicts well is causal, too?\nIt's true that *some* predictive models can be great causal models and vice versa.\nUnfortunately, this is not always the case; causal effects needn't predict particularly well, and good predictors needn't be causally unbiased [@shmueli2010a]. \nThere is no way to know using data alone.\n\nLet's look at the causal perspective first because it's a bit simpler. \nConsider a causally unbiased model for an exposure but only includes variables related to the outcome *and* the exposure. \nIn other words, this model provides us with the correct answer for the exposure of interest but doesn't include other predictors of the outcome (which can sometimes be a good idea, as discussed in @sec-data-causal).\nIf an outcome has many causes, a model that accurately describes the relationship with the exposure likely won't predict the outcome very well.\nLikewise, if a true causal effect of the exposure on the outcome is small, it will bring little predictive value.\nIn other words, the predictive ability of a model, whether high or small, can't help us distinguish if the model is giving us the correct answer.\nOf course, low predictive power might also indicate that a causal effect isn't much use from an applied perspective, although that depends on several statistical factors.\n\nThere are two more complex reasons that predictive models won't always be unbiased causal models.\nFor the first reason, let's consider an accurate model from a causal perspective: it estimates effects on an outcome, and all of these effects are unbiased.\nEven in this ideal setting, you might get better predictions using a different model.\nThe reason has to do with the bias-variance trade-off in predictive modeling.\nWhen effects are small, data are noisy, predictors are highly correlated, or there's not much data, using a biased model like penalized regression might make sense.\nThese models intentionally introduce bias in favor of improved variance in out-of-data predictions.\nSince the goals of prediction and causal inference are different (accurate predictions, usually for out-of-data observations vs. an unbiased effect), the best model for inference is not necessarily the best prediction model.\n\n\n\nSecondly, variables that are biased from a causal perspective often bring along with them predictive power.\nWe'll discuss which variables to include and *not* include in your models in @sec-dags, but let's consider a simple example.\nOne of the famous examples of confounded relationships is ice cream sales and crime in the summer.\n[Descriptively, ice cream sales and crime are related](https://slate.com/news-and-politics/2013/07/warm-weather-homicide-rates-when-ice-cream-sales-rise-homicides-rise-coincidence.html), but this relationship is confounded by weather, e.g., both ice cream sales and crime increases when it's warmer.\n(This is simplistic, of course, as weather itself doesn't cause crime, but it's on the causal pathway.)\n\nConsider a thought experiment where you are in a dark room.\nYour goal is to predict crime, but you don't know the weather or time of year.\nYou do, however, have information on ice cream sales.\nA model with ice cream sales on crime would be biased from a causal perspective---ice cream sales do not cause crime, even though the model would show an effect---but would provide some predictive value to your crime model.\nThe reason for both of these conditions is the same: weather and ice cream sales are correlated, and so are weather and crime.\nIce cream sales can successfully, if imperfectly, serve as a proxy for weather.\nThat results in a biased effect estimate of the causal impact of ice cream sales on crime but a partially effective prediction of crime.\nOther variables, too, which are invalid from a causal perspective, either by being biased themselves or by introducing bias into the causal effect estimate, often bring good predictive value.\nThus, predictive accuracy is not a good measure of causality.\n\nA closely related idea is the *Table Two Fallacy*, so-called because, in health research papers, descriptive analyses are often presented in Table 1, and regression models are often presented in Table 2 [@Westreich2013].\nThe Table Two Fallacy is when a researcher presents confounders and other non-effect variables, particularly when they interpret those coefficients as if they, too, were causal.\nThe problem is that in some situations, the model to estimate an unbiased effect of one variable may not be the same model to estimate an unbiased effect of another variable.\nIn other words, we can't interpret the effects of confounders as causal because they might *themselves* be confounded by another variable unrelated to the original exposure.\n\nDescriptive, predictive, and causal analyses will always contain some aspect of each other.\nA predictive model gains some of its predictive power from the causal structure of the outcome, and a causal model has some predictive power because it contains information about the outcome.\nHowever, the same model in the same data with different goals will have different usefulness depending on those goals.\n\n\n## Diagraming a causal claim {#sec-diag}\n\nEach analysis task, whether descriptive, predictive, or inferential, should start with a clear, precise question. \nLet's diagram them to understand better the structure of causal questions (to which we'll return our focus). \nDiagramming sentences is a grammatical method used to visually represent the structure of a sentence, occasionally taught in grammar school.\nIn this technique, sentences are deconstructed into their constituent parts, such as subjects, verbs, objects, and modifiers, and then displayed using a series of lines and symbols.\nThe arrangement of these elements on the diagram reflects their syntactic roles and how they interact within the sentence's overall structure.\nBy breaking down sentences into these visual representations, diagramming can help learners grasp the nuances of sentence construction, identify grammatical errors, and appreciate the intricate connections between words.\nWe can apply a similar idea to *causal claims*.\nHere is an example of how one might diagram a causal claim.\nWe've pulled out the *cause*, the *effect*, the *subject* (for whom?), and the *timing* (when?).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example of diagraming a causal claim.](../images/sentence-diagram-1.png){#fig-diagram-1 width=2219}\n:::\n:::\n\n\nLet's start with a basic causal question: **Does smoking cause lung cancer?**\n\nThe causal claim here could be that *smoking causes lung cancer*.\n@fig-diagram-2 shows a potential diagram of this causal claim.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagram of the causal claim \"smoking causes lung cancer\".](../images/sentence-diagram-2.png){#fig-diagram-2 width=2219}\n:::\n:::\n\n\nLet's get more specific.\nA study was published in *JAMA* (the Journal of the American Medical Association) in 2005 titled \"Effect of Smoking Reduction on Lung Cancer Risk.\"\nThis study concluded: \"Among individuals who smoke 15 or more cigarettes per day, smoking reduction by 50% significantly reduces the risk of lung cancer\".\n[@godtfredsen2005effect] The study describes the time frame studied as 5-10 years.\nLet's diagram this causal claim.\nHere, we assume that the eligibility criteria and the target population for the estimated causal effect are the same (individuals who smoke 15 or more cigarettes per day); this need not always be the case.\nIn @sec-estimands, we will discuss other potential target populations.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram of a more specific causal claim based on results from @godtfredsen2005effect.](../images/sentence-diagram-3.png){#fig-diagram-3 width=2222}\n:::\n:::\n\n\nTranslating this idea into asking good causal questions, we can map the following terms that you will see throughout this book to these diagrams: *exposure* (the cause), *outcome* (the effect), *eligibility criteria* (for whom?), *time zero* (when did the participants begin to be followed?), *target population*, (who can we estimate an outcome effect for?) and *follow-up period* (when?).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram mapped to causal analysis terminology](../images/sentence-diagram-4.png){#fig-diagram-4 width=2219}\n:::\n:::\n\n\nAsking good causal questions means we map the *question* to the observable *evidence*.\nLet's return to the smoking example.\nOur initial question was: *Does smoking cause lung cancer?*; The evidence in the study shows: *For people who smoke 15+ cigarettes a day, reducing smoking by 50% reduces the risk of lung cancer over 5-10 years*.\nDoes the answer match the question?\nNot quite.\nLet's update our question to match what the study actually showed: *For people who smoke 15+ cigarettes a day, does reducing smoking by 50% reduce the lung cancer risk over 5-10 years?*\nHoning this skill — asking answerable causal questions — is essential and one we will discuss throughout this book.\n", + "markdown": "\\mainmatter\n\n# What is a causal question? {#sec-causal-question}\n\n\n\n\n\n## Schrödinger's Causality\n\nThe heart of causal analysis is the causal question; it dictates what data we analyze, how we analyze it, and to which populations our inferences apply.\nThis book, being applied in nature, deals primarily with the analysis stage of causal inference.\nRelative to the complexity of specifying a good causal question, the analysis stage is considerably more straightforward.\nIn the first six chapters of this book, we'll discuss what a causal question is, how to improve our questions, and consider some examples.\n\nCausal questions are part of a broader set of questions we can ask with statistical techniques related to the primary tasks of data science: description, prediction, and causal inference [@hernan2019].\nUnfortunately, these tasks are often muddled by the techniques we use (regression, for instance, is helpful for all three tasks) and how we talk about them.\nWhen researchers are interested in causal inference from non-randomized data, we often use euphemistic language like \"association\" instead of declaring our intent to estimate a causal effect [@Hernan2018].\n\nIn a recent study of the language of analyses in epidemiologic research, for instance, the most common root word describing the estimated effect was \"associate,\" but many researchers also felt that \"associate\" implied at least *some* causal effect (@fig-word-ranking) [@haber_causal_language].\nOnly around 1% of the studies analyzed used the root word \"cause\" at all.\nYet, a third of studies had action recommendations, and researchers rated 80% of these recommendations as having at least some causal implication.\nOften, these studies have stronger action recommendations (alluding to causal effects) than those implied by the description of the effect (root words like \"associate\" and \"compare\").\nDespite how many studies implied that the goal was causal inference, only about 4% used formal causal models like those discussed in this book.\nHowever, most discussed how such a cause might be justified by previous research or theory.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Rankings of causal strength of root words used by researchers. Root words with more Strong rankings have stronger causal implications than those with many None or Weak rankings. Data from Haber et al.](chapter-01_files/figure-html/fig-word-ranking-1.png){#fig-word-ranking width=672}\n:::\n:::\n\n\nInstead of clear questions with obvious assumptions and goals, we end up with \"Schrödinger's causal inference\":\n\n> Our results suggest that \"Schrödinger's causal inference,\" — where studies avoid stating (or even explicitly deny) an interest in estimating causal effects yet are otherwise embedded with causal intent, inference, implications, and recommendations — is common.\n>\n> --- @haber_causal_language\n\n## Description, prediction, and explanation\n\nAn excellent first step to address this problem is recognizing that questions about description, prediction, and explanation are fundamentally different.\nData science in industry isn't quite as burdened by Schrödinger's causal inference as the academic sciences, but being explicit in the differences in analytic intent is still helpful.\nFor instance, when a stakeholder asks for \"drivers\" of a particular event, what are they asking?\nFor a model to predict the event?\nFor a deeper understanding of what causes the event?\nIt's a vague request, but it smacks of causal interest to us.\nWhen we're clear about our goals, we can use all three approaches more effectively (and, as we'll see, both descriptive analysis and prediction models are still helpful when the goal is to make causal inferences).\nMoreover, all three approaches are useful decision-making tools.\n\n### Description\n\nDescriptive analysis aims to describe the distribution of variables, often stratified by key variables of interest.\nA closely related idea is exploratory data analysis (EDA), although descriptive studies often have more explicit goals than those in EDA.\n\nDescriptive analyses are usually based on statistical summaries such as measures of centrality (means, medians) and spread (minimums, maximums, quartiles), but they also occasionally use techniques like regression modeling.\nThe goal of applying more advanced techniques like regression is different in descriptive analyses than in either predictive or causal studies.\n\"Adjusting\" for a variable in descriptive analyses means that we are removing its associational effect (and thus changing our question), *not* that we are controlling for confounding.\n\nIn epidemiology, a valuable concept for descriptive analyses is \"person, place, and time\" -- who has what disease, where, and when.\nThis concept is also a good template for descriptive analyses in other fields.\nUsually, we want to be clear about what population we're trying to describe, so we need to be as specific as possible.\nFor human health, describing the people involved, the location, and the period are all critical.\nIn other words, focus on the first principles of generating understanding of your data and describe your data accordingly.\n\n#### Examples\n\nCounting things is one of the best things we can do with data.\nEDA benefits both predictive and causal analyses, but descriptive analyses are valuable independent of the other analysis tasks.\nAsk any data scientist who thought they'd be developing complex machine learning models and found themselves spending most of their time on dashboards.\nUnderstanding the distributions of the data, particularly for key analysis goals (say, KPIs in industry or disease incidence in epidemiology), is critical for many types of decision-making.\n\nOne of the best recent examples of descriptive analyses arose from the COVID-19 pandemic [@Fox2022].\nIn 2020, particularly in the early months of the pandemic, descriptive analyses were vital to understanding risk and allocating resources.\nSince the coronavirus is similar to other respiratory diseases, we had many public health tools to reduce risk (e.g., distancing and, later, face masks).\nDescriptive statistics of cases by region were vital for deciding local policies and the strength of those policies.\n\nA great example of a more complex descriptive analysis during the pandemic was an [ongoing report by the Financial Times of expected deaths vs. observed deaths](https://www.ft.com/content/a2901ce8-5eb7-4633-b89c-cbdf5b386938) in various countries and regions[^3].\nWhile the calculation of expected deaths is slightly more sophisticated than most descriptive statistics, it provided a tremendous amount of information about current deaths without needing to untangle causal effects (e.g., were they due to COVID-19 directly? Inaccessible healthcare? Cardiovascular events post-COVID?).\nIn this (simplified) recreation of their plot from July 2020, you can see the staggering effect of the pandemic's early months.\n\n[^3]: John Burn-Murdoch was responsible for many of these presentations and gave a [fascinating talk on the subject](https://cloud.rstudio.com/resources/rstudioglobal-2021/reporting-on-and-visualising-the-pandemic/).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![2020 excess deaths vs. historical expected deaths from any cause. Data from the Financial Times.](chapter-01_files/figure-html/fig-ft-chart-1.png){#fig-ft-chart width=960}\n:::\n:::\n\n\nHere are some other great examples of descriptive analyses.\n\n- Deforestation around the world. Our World in Data [@owidforestsanddeforestation] is a data journalism organization that produces thoughtful, usually descriptive reports on various topics. In this report, they present data visualizations of both absolute change in forest coverage (forest transitions) and relative change (deforestation or reforestation), using basic statistics and forestry theory to present helpful information about the state of forests over time.\n- The prevalence of chlamydial and gonococcal infections [@Miller2004]. Measuring the prevalence of disease (how many people currently have a disease, usually expressed as a rate per number of people) is helpful for basic public health (resources, prevention, education) and scientific understanding. In this study, the authors conducted a complex survey meant to be representative of all high schools in the United States (the target population); they used survey weights to address a variety of factors related to their question, then calculated prevalence rates and other statistics. As we'll see, weights are helpful in causal inference for the same reason: targeting a particular population. That said, not all weighting techniques are causal in nature, and they were not here.\n- Estimating race and ethnicity-specific hysterectomy inequalities [@Gartner2020]. Descriptive techniques also help us understand disparities in areas like economics and epidemiology. In this study, the authors asked: Does the risk of hysterectomy differ by racial or ethnic background? Although the analysis is stratified by a key variable, it's still descriptive. Another interesting aspect of this paper is the authors' work ensuring the research answered questions about the right target population. Their analysis combined several data sources to better estimate the true population prevalence (instead of the prevalence among those in hospitals, as commonly presented). They also adjusted for the prevalence of hysterectomy, e.g., they calculated the incidence (new case) rate only among those who could actually have a hysterectomy (e.g., they hadn't had one yet).\n\n#### Validity\n\nThere are two critical validity issues in descriptive analyses: measurement and sampling errors.\n\nMeasurement error is when we have mismeasured one or more variables in some capacity.\nFor descriptive analyses, mismeasuring things means we may not get the answer to our question.\nHowever, the degree to which that is the case depends on both the severity of the measurement error and the question itself.\n\nSampling error is a more nuanced topic in descriptive analyses.\nIt's related to the population we're analyzing (who should the analysis produce descriptions about) and uncertainty (how certain are we that the descriptions of those we have data for represent the population we're trying to describe.).\n\nThe population from which our data come and the population we're trying to describe must be the same for us to provide valid descriptions.\nConsider a dataset generated by an online survey.\nWho are the people who are answering these questions, and how do they relate to the people we want to describe?\nFor many analyses, the people who take the time to answer surveys are different than the people we want to describe, e.g., the group of people who fill out surveys may have a different distribution of variables than those who don't.\nResults from data like these are not technically biased because, outside of sample size-related uncertainty and measurement error, the descriptions are accurate---they're just not for the right group of people!\nIn other words, *we've gotten an answer to the wrong question*.\n\nNotably, sometimes our data represent the entire population (or close enough) that sampling error is irrelevant.\nConsider a company with certain data on every customer using their service.\nFor many analyses, this represents the entire population (current customers) about whom we want information.\nSimilarly, in countries with population-wide health registries, the data available for specific practical purposes is close enough to the entire population that no sampling is needed (although researchers might use sampling for simpler computations).\nIn these cases, there's not really such a thing as uncertainty.\nAssuming everything is well-measured, the summary statistics we generate *are inherently unbiased and precise* because we have information from everyone in the population.\nOf course, in practice, we usually have some mixture of measurement error, missing data, and so on, even in the best of circumstances.\n\nOne crucial detail of descriptive analysis is that confounding bias, one of the chief concerns of this book, is undefined.\nThat's because confounding is a causal concern.\nDescriptive analyses cannot be confounded because they are a statistical description of relationships *as-is*, not the mechanisms behind those relationships.\n\n#### Relationship to causal inference\n\nHumans see patterns very well.\nPattern-finding is a helpful feature of our brains, but it can also lead down a slippery slope of inference when we're not working with data or methods that can allow us to do that validly.\nThe biggest thing to be cautious of when your goal is to describe is making the leap from description to causation (implicitly or explicitly).\n\nBut, of course, descriptive analysis is helpful when we *are* estimating causal effects.\nIt helps us understand the population we're working with, the distribution of the outcomes, exposures (variables we think might be causal), and confounders (variables we need to account for to get unbiased causal effects for the exposure).\nIt also helps us be sure that the data structure we're using matches the question we're trying to answer, as we'll see in [Chapter -@sec-data-causal].\nYou should always do descriptive analyses of your data when conducting causal research.\n\nFinally, as we'll see in [Chapter -@sec-trials-std], there are certain circumstances where we can make causal inferences with basic statistics.\nBe cautious about the distinction between the causal question and the descriptive component here, too: just because we're using the same calculation (e.g., a difference in means) doesn't mean that all descriptions you can generate are causal. \nWhether a descriptive analysis overlaps with a causal analysis is a function of the data and the question.\n\n### Prediction\n\nThe goal of prediction is to use data to make accurate predictions about variables, usually on new data.\nWhat this means depends on the question, domain, and so on.\nPrediction models are used in about every setting imaginable, from peer-reviewed clinical models to bespoke machine learning models embedded in consumer devices.\nEven large language models like the ones ChatGPT is based on are prediction models: they predict what a response to a prompt would look like.\n\nPredictive modeling generally uses a different workflow than the workflow for causal modeling we'll present in this book.\nSince the goal of prediction is usually related to making predictions on new data, the workflow of this type of modeling focuses on maximizing predictive accuracy while retaining generalization to new data, sometimes called the bias-variance trade-off.\nIn practice, this means splitting your data into training sets (the part of the data you build your model on) and test sets (the part you assess your model with, a proxy for how it would perform on new data).\nUsually, data scientists use cross-validation or other sampling techniques to reduce further the risk of overfitting your model to the training set.\n\nThere are many excellent texts on predictive modeling, and so we refer you to those for a deeper exploration of the goals and methods of these techniques [@kuhn2013a; @harrell2001; @Kuhn_Silge_2022; @James_Witten_Hastie_Tibshirani_2022].\n\n#### Examples\n\nPrediction is the most popular topic in data science, largely thanks to machine learning applications in industry.\nPrediction, of course, has a long history in statistics, and many models popular today have been used for decades in and outside academia.\n\nLet's look at an example of prediction about COVID-19 [^chapter-01-1].\nIn 2021, researchers published the ISARIC 4C Deterioration model, a clinical prognostic model for predicting severe adverse outcomes for acute COVID-19 [@Gupta2021].\nThe authors included a descriptive analysis to understand the population from which this model was developed, particularly the distribution of the outcome and candidate predictors.\nOne helpful aspect of this model is that it uses items commonly measured on day one of COVID-related hospitalization.\nThe authors built this model using cross-validation by region of the UK and then tested the model on data from a hold-out region.\nThe final model included eleven items and a description of their model attributes, relation with the outcome, and so on.\nNotably, the authors used clinical domain knowledge to select candidate variables but did not fall into the temptation of interpreting the model coefficients as causal.\nWithout question, some of the predictive value of this model stems from the causal structure of the variables as they relate to the outcome, but the researchers had a different goal entirely for this model and stuck to it.\n\n[^chapter-01-1]: A natural model here is predicting cases, but infectious disease modeling is complex and usually uses techniques outside the usual predictive modeling workflow.\n\nHere are other good examples from the predictive space:\n\n- Some of the most exciting work in predictive modeling is in industry.\n Netflix regularly shares details on their modeling success and novel strategies in their [research blog](https://research.netflix.com/).\n They also recently published a paper reviewing their use of deep learning models for recommender systems (in this case, recommending shows and movies to users) [@steck2021].\n The authors explain their experimentation with models, the details of those models, and many of the challenges they faced, resulting in a practical guide on using such models.\n\n- In early 2020, researchers experienced with predictive and prognostic modeling in health research published a review of models for diagnosis and prognosis of COVID-19 [@Wynants2020].\n This review is interesting not just for its breadth but also the astounding number of models that were rated as poor quality: \"\\[232\\] models were rated at high or unclear risk of bias, mostly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, high risk of model overfitting, and unclear reporting.\" This research is also a [living review](https://www.covprecise.org/).\n\n#### Validity\n\nThe key measure of validity in prediction modeling is predictive accuracy, which can be measured in several ways, such as root mean squared error (RMSE), mean absolute error (MAE), area under the curve (AUC), and many more.\nA convenient detail about predictive modeling is that we can often assess if we're right, which is not true of descriptive statistics for which we only have a subset of data or causal inference for which we don't know the true causal structure.\nWe aren't always able to assess against the truth, but it's almost always required for fitting the initial predictive model [^chapter-01-2].\n\n[^chapter-01-2]: We say model singular, but usually data scientists fit many models for experimentation, and often the best prediction models are some combination of predictions from several models, called a stacked model\n\nMeasurement error is also a concern for predictive modeling because we usually need accurate data for accurate predictions.\nInterestingly, measurement error and missingness can be informative in predictive settings.\nIn a causal setting, this might induce bias, but predictive models can consume that information with no issue.\nFor instance, in the famous Netflix Prize, winning models leveraged information about whether or not a customer rated a movie at all to improve recommendation systems.\n\nLike descriptive analysis, confounding is undefined for predictive modeling.\nA coefficient in a prediction model cannot be confounded; we only care about whether or not the variable provides predictive information, not if that information is because of a causal relationship or something else.\n\n#### Relationship to causal inference\n\nThe single biggest risk in prediction is to assume that a given coefficient in a model has a causal interpretation.\nThere is a good chance that this isn't so.\nA model may predict well but may also have completely biased coefficients from a causal point of view.\nWe'll see more about this in @sec-pred-or-explain and the rest of the book.\n\nOften, people mistakenly use methods for selecting features (variables) for prediction models to select confounders in causal models.\nAside from their risk of overfitting, these methods are appropriate for prediction models but not for causal models.\nPrediction metrics cannot determine the causal structure of your question, and predictive value for the outcome does not make a variable a confounder.\nIn general, background knowledge (not prediction or associational statistics) should help you select variables for causal models @robinsImpossible; we'll discuss this process in detail in [Chapter -@sec-dags] and [Chapter -@sec-building-models].\n\nPrediction is nevertheless crucial to causal inference.\nFrom a philosophical perspective, we're comparing predictions from different *what if* scenarios: What would the outcome had one thing happened vs. if another thing happened?\nWe'll spend much time on this subject, particularly in [Chapter -@sec-counterfactuals].\nWe'll also talk a lot about prediction from a practical perspective: just like in prediction and some description, we'll use modeling techniques to answer causal questions.\nTechniques like propensity score methods and g-computation use model predictions to answer causal questions, but the workflow for building and interpreting the models themselves are quite different.\n\n### Causal Inference\n\nThe goal of causal inference is to understand the impact that a variable, sometimes called an exposure, has on another variable, sometimes called an outcome.\n\"Exposure\" and \"outcome\" are the terms we'll use in this book to describe the causal relationship we're interested in.\nImportantly, our goal is to answer this question clearly and precisely.\nIn practice, this means using techniques like study design (e.g., a randomized trial) or statistical methods (like propensity scores) to calculate an unbiased effect of the exposure on the outcome.\n\nAs with prediction and description, it's better to start with a clear, precise question to get a clear, precise answer.\nIn statistics and data science, particularly as we swim through the ocean of data of the modern world, we often end up with an answer without a question (e.g., `42`).\nThis, of course, makes interpretation of the answer difficult.\nIn @sec-diag, we'll discuss the structure of causal questions.\nWe'll discuss philosophical and practical ways to sharpen our questions in [Chapter -@sec-counterfactuals] and [Chapter -@sec-trials-std].\n\n::: callout-note\n## Causal inference and explanation\n\nSome authors use the phrases \"causal inference\" and \"explanation\" interchangeably.\nWe're a little more cautious about that.\nCausal inference always has a relationship to explanation, but we can accurately estimate the effect of one thing on another without understanding how it happens.\n\nConsider John Snow, the so-called father of epidemiology.\nIn 1854, Snow famously investigated a cholera outbreak in London and identified that specific water sources were to blame for the disease.\nHe was right: contaminated water was a mechanism for cholera transmission.\nYet, he didn't have enough information to explain how: *Vibrio cholerae*, the bacteria responsible for cholera, wasn't identified until nearly thirty years later.\n:::\n\n#### Examples\n\nWe'll see many examples of causal inference in this book, but let's continue with an example related to COVID-19.\nAs the pandemic continued and tools like vaccines and anti-viral treatments became available, policies like universal masking also began to change.\nIn February 2022, the US state of Massachusetts rescinded a statewide policy that required universal masking in public schools [@Cowger2022].\nIn the greater Boston area, some school districts continued the policy while others discontinued it; those that discontinued it also did so at different times over the following weeks after the policy change.\nThis difference in policy allowed researchers to take advantage of the differences in district policies over this period to study the impact of universal masking on COVID-19 cases.\nThe researchers included a descriptive analysis of the school districts to understand the distribution of factors related to COVID-19 and other determinants of health.\nTo estimate the effect of universal masking on cases, the authors used a technique common in policy-related causal inference called difference-in-differences to estimate this effect.\nTheir design alleviates some problematic assumptions needed for causal inference, but they also wisely controlled for potential confounders despite that advantage.\nThe authors found that districts that continued masking saw a drastically lower caseload than those that didn't; their analysis concluded that almost 12,000 additional cases occurred due to the policy change, nearly 30% of the cases in the districts during the 15 weeks of the study.\n\n\n\nHere are a few other interesting examples:\n\n- Netflix regularly uses causal inference in their work.\n In 2022, they published a [blog post summarizing some causal tasks](https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f) they have engaged with.\n One interesting example is localization.\n Netflix, being worldwide, localizes content through subtitles and dubbing.\n Randomized experiments were a bad idea because they meant withholding content from users, so researchers at Netflix used several approaches to understand the value of localization while addressing potential confounding.\n One example is studying the impact of pandemic-related delays in dubbing.\n Researchers used synthetic controls to simulate the impact on viewership with and without these delays.\n Presumably, the timing of the pandemic-related delays was unrelated to many factors that would typically be related to dubbing processes, thus reducing some of the potential confounding.\n\n- The Tuskegee Study is one of modern history's most infamous examples of medical abuse.\n It is commonly pointed to as a source of distrust in the medical community from Black Americans.\n Health economics researchers used a variation of difference-in-difference techniques to assess the effect of the Tuskegee Study on distrust and life expectancy in older Black men [@Alsan2018].\n The results are important and disturbing: \"We find that the disclosure of the study in 1972 is correlated with increases in medical mistrust and mortality and decreases in both outpatient and inpatient physician interactions for older black men. Our estimates imply life expectancy at age 45 for black men fell by up to 1.5 years in response to the disclosure, accounting for approximately 35% of the 1980 life expectancy gap between black and white men and 25% of the gap between black men and women.\"\n\n#### Validity\n\nMaking valid causal inferences requires several assumptions that we'll discuss in @sec-assump.\nUnlike prediction, we generally cannot confirm that our causal models are correct.\nIn other words, most assumptions we need to make are unverifiable.\nWe'll come back to this topic time and time again in the book---from the basics of these assumptions to practical decision-making to probing our models for problems.\n\n### Why isn't the right causal model just the best prediction model? {#sec-pred-or-explain}\n\nAt this point, you may wonder why the right causal model isn't just the best prediction model.\nIt makes sense that the two would be related: naturally, things that cause other things would be predictors.\nIt's causality all the way down, so any predictive information *is* related, in some capacity, to the causal structure of the thing we're predicting. \nDoesn't it stand to reason that a model that predicts well is causal, too?\nIt's true that *some* predictive models can be great causal models and vice versa.\nUnfortunately, this is not always the case; causal effects needn't predict particularly well, and good predictors needn't be causally unbiased [@shmueli2010a]. \nThere is no way to know using data alone.\n\nLet's look at the causal perspective first because it's a bit simpler. \nConsider a causally unbiased model for an exposure but only includes variables related to the outcome *and* the exposure. \nIn other words, this model provides us with the correct answer for the exposure of interest but doesn't include other predictors of the outcome (which can sometimes be a good idea, as discussed in @sec-data-causal).\nIf an outcome has many causes, a model that accurately describes the relationship with the exposure likely won't predict the outcome very well.\nLikewise, if a true causal effect of the exposure on the outcome is small, it will bring little predictive value.\nIn other words, the predictive ability of a model, whether high or small, can't help us distinguish if the model is giving us the correct answer.\nOf course, low predictive power might also indicate that a causal effect isn't much use from an applied perspective, although that depends on several statistical factors.\n\nThere are two more complex reasons that predictive models won't always be unbiased causal models.\nFor the first reason, let's consider an accurate model from a causal perspective: it estimates effects on an outcome, and all of these effects are unbiased.\nEven in this ideal setting, you might get better predictions using a different model.\nThe reason has to do with the bias-variance trade-off in predictive modeling.\nWhen effects are small, data are noisy, predictors are highly correlated, or there's not much data, using a biased model like penalized regression might make sense.\nThese models intentionally introduce bias in favor of improved variance in out-of-data predictions.\nSince the goals of prediction and causal inference are different (accurate predictions, usually for out-of-data observations vs. an unbiased effect), the best model for inference is not necessarily the best prediction model.\n\n\n\nSecondly, variables that are biased from a causal perspective often bring along with them predictive power.\nWe'll discuss which variables to include and *not* include in your models in @sec-dags, but let's consider a simple example.\nOne of the famous examples of confounded relationships is ice cream sales and crime in the summer.\n[Descriptively, ice cream sales and crime are related](https://slate.com/news-and-politics/2013/07/warm-weather-homicide-rates-when-ice-cream-sales-rise-homicides-rise-coincidence.html), but this relationship is confounded by weather, e.g., both ice cream sales and crime increases when it's warmer.\n(This is simplistic, of course, as weather itself doesn't cause crime, but it's on the causal pathway.)\n\nConsider a thought experiment where you are in a dark room.\nYour goal is to predict crime, but you don't know the weather or time of year.\nYou do, however, have information on ice cream sales.\nA model with ice cream sales on crime would be biased from a causal perspective---ice cream sales do not cause crime, even though the model would show an effect---but would provide some predictive value to your crime model.\nThe reason for both of these conditions is the same: weather and ice cream sales are correlated, and so are weather and crime.\nIce cream sales can successfully, if imperfectly, serve as a proxy for weather.\nThat results in a biased effect estimate of the causal impact of ice cream sales on crime but a partially effective prediction of crime.\nOther variables, too, which are invalid from a causal perspective, either by being biased themselves or by introducing bias into the causal effect estimate, often bring good predictive value.\nThus, predictive accuracy is not a good measure of causality.\n\nA closely related idea is the *Table Two Fallacy*, so-called because, in health research papers, descriptive analyses are often presented in Table 1, and regression models are often presented in Table 2 [@Westreich2013].\nThe Table Two Fallacy is when a researcher presents confounders and other non-effect variables, particularly when they interpret those coefficients as if they, too, were causal.\nThe problem is that in some situations, the model to estimate an unbiased effect of one variable may not be the same model to estimate an unbiased effect of another variable.\nIn other words, we can't interpret the effects of confounders as causal because they might *themselves* be confounded by another variable unrelated to the original exposure.\n\nDescriptive, predictive, and causal analyses will always contain some aspect of each other.\nA predictive model gains some of its predictive power from the causal structure of the outcome, and a causal model has some predictive power because it contains information about the outcome.\nHowever, the same model in the same data with different goals will have different usefulness depending on those goals.\n\n\n## Diagraming a causal claim {#sec-diag}\n\nEach analysis task, whether descriptive, predictive, or inferential, should start with a clear, precise question. \nLet's diagram them to understand better the structure of causal questions (to which we'll return our focus). \nDiagramming sentences is a grammatical method used to visually represent the structure of a sentence, occasionally taught in grammar school.\nIn this technique, sentences are deconstructed into their constituent parts, such as subjects, verbs, objects, and modifiers, and then displayed using a series of lines and symbols.\nThe arrangement of these elements on the diagram reflects their syntactic roles and how they interact within the sentence's overall structure.\nBy breaking down sentences into these visual representations, diagramming can help learners grasp the nuances of sentence construction, identify grammatical errors, and appreciate the intricate connections between words.\nWe can apply a similar idea to *causal claims*.\nHere is an example of how one might diagram a causal claim.\nWe've pulled out the *cause*, the *effect*, the *subject* (for whom?), and the *timing* (when?).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example of diagraming a causal claim.](../images/sentence-diagram-1.png){#fig-diagram-1 width=2219}\n:::\n:::\n\n\nLet's start with a basic causal question: **Does smoking cause lung cancer?**\n\nThe causal claim here could be that *smoking causes lung cancer*.\n@fig-diagram-2 shows a potential diagram of this causal claim.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagram of the causal claim \"smoking causes lung cancer\".](../images/sentence-diagram-2.png){#fig-diagram-2 width=2219}\n:::\n:::\n\n\nLet's get more specific.\nA study was published in *JAMA* (the Journal of the American Medical Association) in 2005 titled \"Effect of Smoking Reduction on Lung Cancer Risk.\"\nThis study concluded: \"Among individuals who smoke 15 or more cigarettes per day, smoking reduction by 50% significantly reduces the risk of lung cancer\".\n[@godtfredsen2005effect] The study describes the time frame studied as 5-10 years.\nLet's diagram this causal claim.\nHere, we assume that the eligibility criteria and the target population for the estimated causal effect are the same (individuals who smoke 15 or more cigarettes per day); this need not always be the case.\nIn @sec-estimands, we will discuss other potential target populations.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram of a more specific causal claim based on results from @godtfredsen2005effect.](../images/sentence-diagram-3.png){#fig-diagram-3 width=2222}\n:::\n:::\n\n\nTranslating this idea into asking good causal questions, we can map the following terms that you will see throughout this book to these diagrams: *exposure* (the cause), *outcome* (the effect), *eligibility criteria* (for whom?), *time zero* (when did the participants begin to be followed?), *target population*, (who can we estimate an outcome effect for?) and *follow-up period* (when?).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram mapped to causal analysis terminology](../images/sentence-diagram-4.png){#fig-diagram-4 width=2219}\n:::\n:::\n\n\nAsking good causal questions means we map the *question* to the observable *evidence*.\nLet's return to the smoking example.\nOur initial question was: *Does smoking cause lung cancer?*; The evidence in the study shows: *For people who smoke 15+ cigarettes a day, reducing smoking by 50% reduces the risk of lung cancer over 5-10 years*.\nDoes the answer match the question?\nNot quite.\nLet's update our question to match what the study actually showed: *For people who smoke 15+ cigarettes a day, does reducing smoking by 50% reduce the lung cancer risk over 5-10 years?*\nHoning this skill — asking answerable causal questions — is essential and one we will discuss throughout this book.", "supporting": [ "chapter-01_files" ], diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diag-1-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diag-1-1.png new file mode 100644 index 0000000..329deff Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diag-1-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diag-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diag-1.png new file mode 100644 index 0000000..669a5d9 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diag-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diag-2-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diag-2-1.png new file mode 100644 index 0000000..168f974 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diag-2-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diagram-1-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diagram-1-1.png new file mode 100644 index 0000000..772c570 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diagram-1-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diagram-2-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diagram-2-1.png new file mode 100644 index 0000000..b0dbd25 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diagram-2-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diagram-3-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diagram-3-1.png new file mode 100644 index 0000000..b0d4069 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diagram-3-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-diagram-4-1.png b/_freeze/chapters/chapter-01/figure-html/fig-diagram-4-1.png new file mode 100644 index 0000000..2cd2f5c Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/fig-diagram-4-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/fig-ft-chart-1.png b/_freeze/chapters/chapter-01/figure-html/fig-ft-chart-1.png index d3c45d2..1de6b27 100644 Binary files a/_freeze/chapters/chapter-01/figure-html/fig-ft-chart-1.png and b/_freeze/chapters/chapter-01/figure-html/fig-ft-chart-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-2-1.png b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 0000000..329deff Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-2-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-3-1.png b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..54339ae Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-3-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-4-1.png b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 0000000..c647367 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-4-1.png differ diff --git a/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-5-1.png b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 0000000..37bb535 Binary files /dev/null and b/_freeze/chapters/chapter-01/figure-html/unnamed-chunk-5-1.png differ diff --git a/_freeze/chapters/chapter-02/execute-results/html.json b/_freeze/chapters/chapter-02/execute-results/html.json index 66ce316..7c54681 100644 --- a/_freeze/chapters/chapter-02/execute-results/html.json +++ b/_freeze/chapters/chapter-02/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "f26c583c5584e8efcf53b9af2193907f", + "hash": "6c6039a1b05d8b00daa81f6bebff9013", "result": { - "markdown": "# The whole game: mosquito nets and malaria {#sec-whole-game}\n\n\n\n\n\nIn this chapter, we'll analyze data using techniques we learn in this book.\nWe'll play the [whole game](https://www.gse.harvard.edu/news/uk/09/01/education-bat-seven-principles-educators) of causal analysis using a few key steps:\n\n1. Specify a causal question\n2. Draw our assumptions using a causal diagram\n3. Model our assumptions\n4. Diagnose our models\n5. Estimate the causal effect\n6. Conduct sensitivity analysis on the effect estimate\n\nWe'll focus on the broader ideas behind each step and what they look like all together; however, we don't expect you to fully digest each idea. We'll spend the rest of the book taking up each step in detail.\n\n## Specify a causal question\n\nIn this guided exercise, we'll attempt to answer a causal question: does using a bed net reduce the risk of malaria?\n\nMalaria remains a serious public health issue.\nWhile malaria incidence has decreased since 2000, 2020 and the COVID-19 pandemic saw an increase in cases and deaths due primarily to service interruption [@worldma].\nAbout 86% of malaria deaths occurred in 29 countries.\nNearly half of all malaria deaths occurred in just six of those countries: Nigeria (27%), the Democratic Republic of the Congo (12%), Uganda (5%), Mozambique (4%), Angola (3%), and Burkina Faso (3%).\nMost of these deaths occurred in children under 5 [@mosquito].\nMalaria also poses severe health risks to pregnant women and worsens birth outcomes, including early delivery and low birth weight.\n\nBed nets prevent morbidity and mortality due to malaria by providing a barrier against infective bites by the chief host of malaria parasites, the mosquito.\nHumans have used bed nets since ancient times.\nHerodotus, the 5th century BC Greek author of *The Histories*, observed Egyptians using their fishing nets as bed nets:\n\n> Against the gnats, which are very abundant, they have contrived as follows:---those who dwell above the fen-land are helped by the towers, to which they ascend when they go to rest; for the gnats by reason of the winds are not able to fly up high: but those who dwell in the fen-land have contrived another way instead of the towers, and this is it:---every man of them has got a casting net, with which by day he catches fish, but in the night he uses it for this purpose, that is to say he puts the casting-net round about the bed in which he sleeps, and then creeps in under it and goes to sleep: and the gnats, if he sleeps rolled up in a garment or a linen sheet, bite through these, but through the net they do not even attempt to bite [@thehist].\n\nMany modern nets are also treated with insecticide, dating back to Russian soldiers in World War II [@nevill1996], although some people still use them as fishing nets [@gettleman2015].\n\nIt's easy to imagine a randomized trial that deals with this question: participants in a study are randomly assigned to use a bed net, and we follow them over time to see if there is a difference in malaria risk between groups.\nRandomization is often the best way to estimate a causal effect of an intervention because it reduces the number of assumptions we need to make for that estimate to be valid (we will discuss these assumptions in @sec-assump).\nIn particular, randomization addresses confounding very well, accounting for confounders about which we may not even know.\n\nSeveral landmark trials have studied the effects of bed net use on malaria risk, with several essential studies in the 1990s.\nA 2004 meta-analysis found that insecticide-treated nets reduced childhood mortality by 17%, malarial parasite prevalence by 13%, and cases of uncomplicated and severe malaria by about 50% (compared to no nets) [@lengeler2004].\nSince the World Health Organization began recommending insecticide-treated nets, insecticide resistance has been a big concern.\nHowever, a follow-up analysis of trials found that it has yet to impact the public health benefits of bed nets [@pryce2018].\n\nTrials have also been influential in determining the economics of bed net programs.\nFor instance, one trial compared free net distribution versus a cost-share program (where participants pay a subsidized fee for nets).\nThe study's authors found that net uptake was similar between the groups and that free net distribution --- because it was easier to access --- saved more lives, and was cheaper per life saved than the cost-sharing program [@cohen2010].\n\nThere are several reasons we might not be able to conduct a randomized trial, including ethics, cost, and time.\nWe have substantial, robust evidence in favor of bed net use, but let's consider some conditions where observational causal inference could help.\n\n- Imagine we are at a time before trials on this subject, and let's say people have started to use bed nets for this purpose on their own.\n Our goal may still be to conduct a randomized trial, but we can answer questions more quickly with observed data.\n In addition, this study's results might guide trials' design or intermediary policy suggestions.\n\n- Sometimes, it is also not ethical to conduct a trial.\n An example of this in malaria research is a question that arose in the study of bed net effectiveness: does malaria control in early childhood result in delayed immunity to the disease, resulting in severe malaria or death later in life?\n Since we now know bed net use is very effective, *withholding* nets would be unethical.\n A recent observational study found that the benefits of bed net use in childhood on all-cause mortality persist into adulthood [@mosquito].\n\n- We may also want to estimate a different effect or the effect for another population than in previous trials.\n For example, both randomized and observational studies helped us better understand that insecticide-based nets improve malaria resistance in the entire community, not just among those who use nets, so long as net usage is high enough [@howard2000; @hawley2003].\n\nAs we'll see in @sec-trials-std and @sec-g-comp, the causal inference techniques that we'll discuss in this book are often beneficial even when we're able to randomize.\n\nWhen we conduct an observational study, it's still helpful to think through the randomized trial we would run were it possible.\nThe trial we're trying to emulate in this causal analysis is the *target trial.* Considering the target trial helps us make our causal question more accurate.\nLet's consider the causal question posed earlier: does using a bed net (a mosquito net) reduce the risk of malaria?\nThis question is relatively straightforward, but it is still vague.\nAs we saw in @sec-causal-question, we need to clarify some key areas:\n\n- What do we mean by \"bed net\"?\n There are several types of nets: untreated bed nets, insecticide-treated bed nets, and newer long-lasting insecticide-treated bed nets.\n\n- Risk compared to what?\n Are we, for instance, comparing insecticide-treated bed nets to *no* net?\n Untreated nets?\n Or are we comparing a new type of net, like long-lasting insecticide-treated bed nets, to nets that are already in use?\n\n- Risk as defined by what?\n Whether or not a person contracted malaria?\n Whether a person died of malaria?\n\n- Risk among whom?\n What is the population to which we're trying to apply this knowledge?\n Who is it practical to include in our study?\n Who might we need to exclude?\n\nWe will use simulated data to answer a more specific question: Does using insecticide-treated bed nets compared to no nets decrease the risk of contracting malaria after 1 year?\nIn this particular data, [simulated by Dr. Andrew Heiss](https://evalsp21.classes.andrewheiss.com/example/matching-ipw/#program-background):\n\n> researchers are interested in whether using mosquito nets decreases an individual's risk of contracting malaria.\n> They have collected data from 1,752 households in an unnamed country and have variables related to environmental factors, individual health, and household characteristics.\n> The data is **not experimental**---researchers have no control over who uses mosquito nets, and individual households make their own choices over whether to apply for free nets or buy their own nets, as well as whether they use the nets if they have them.\n\nBecause we're using simulated data, we'll have direct access to a variable that measures the likelihood of contracting malaria, something we wouldn't likely have in real life.\nWe'll stick with this measure because we know the actual effect size.\nWe can also safely assume that the population in our dataset represents the population we want to make inferences about (the unnamed country) because the data are simulated as such.\nWe can find the simulated data in `net_data` from the {[causalworkshop](https://github.com/r-causal/causalworkshop)} package, which includes ten variables:\n\n\n\n`id`\n\n: an ID variable\n\n`net` and `net_num`\n\n: a binary variable indicating if the participant used a net (1) or didn't use a net (0)\n\n`malaria_risk`\n\n: risk of malaria scale ranging from 0-100\n\n`income`\n\n: weekly income, measured in dollars\n\n`health`\n\n: a health score scale ranging from 0--100\n\n`household`\n\n: number of people living in the household\n\n`eligible`\n\n: a binary variable indicating if the household is eligible for the free net program.\n\n`temperature`\n\n: the average temperature at night, in Celsius\n\n`resistance`\n\n: Insecticide resistance of local mosquitoes.\n A scale of 0--100, with higher values indicating higher resistance.\n\nThe distribution of malaria risk appears to be quite different by net usage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(causalworkshop)\nnet_data |>\n ggplot(aes(malaria_risk, fill = net)) +\n geom_density(color = NA, alpha = .8)\n```\n\n::: {.cell-output-display}\n![A density plot of malaria risk for those who did and did not use nets. The risk of malaria is lower for those who use nets.](chapter-02_files/figure-html/fig-malaria-risk-density-1.png){#fig-malaria-risk-density width=672}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nIn @fig-malaria-risk-density, the density of those who used nets is to the left of those who did not use nets.\nThe mean difference in malaria risk is about 16.4, suggesting net use might be protective against malaria.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data |>\n group_by(net) |>\n summarize(malaria_risk = mean(malaria_risk))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n net malaria_risk\n \n1 FALSE 43.9\n2 TRUE 27.5\n```\n\n\n:::\n:::\n\n\nAnd that's what we see with simple linear regression, as well, as we would expect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nnet_data |>\n lm(malaria_risk ~ net, data = _) |>\n tidy()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 43.9 0.377 116. 0 \n2 netTRUE -16.4 0.741 -22.1 1.10e-95\n```\n\n\n:::\n:::\n\n\n## Draw our assumptions using a causal diagram\n\nThe problem that we face is that other factors may be responsible for the effect we're seeing.\nIn this example, we'll focus on confounding: a common cause of net usage and malaria will bias the effect we see unless we account for it somehow.\nOne of the best ways to determine which variables we need to account for is to use a causal diagram.\nThese diagrams, also called causal directed acyclic graphs (DAGs), visualize the assumptions that we're making about the causal relationships between the exposure, outcome, and other variables we think might be related.\n\nHere's the DAG that we're proposing for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A proposed causal diagram of the effect of bed net use on malaria. This directed acyclic graph (DAG) states our assumption that bed net use causes a reduction in malaria risk. It also says that we assume: malaria risk is impacted by net usage, income, health, temperature, and insecticide resistance; net usage is impacted by income, health, temperature, eligibility for the free net program, and the number of people in a household; eligibility for the free net programs is impacted by income and the number of people in a household; and health is impacted by income.](chapter-02_files/figure-html/fig-net-data-dag-1.png){#fig-net-data-dag width=672}\n:::\n:::\n\n\n(We'll explore how to create and analyze DAGs in R in @sec-dags).\n\nIn DAGs, each point represents a variable, and each arrow represents a cause.\nIn other words, this diagram declares what we think the causal relationships are between these variables.\nIn @fig-net-data-dag, we're saying that we believe:\n\n- Malaria risk is causally impacted by net usage, income, health, temperature, and insecticide resistance.\n- Net usage is causally impacted by income, health, temperature, eligibility for the free net program, and the number of people in a household.\n- Eligibility for the free net programs is determined by income and the number of people in a household.\n- Health is causally impacted by income.\n\nYou may agree or disagree with some of these assertions.\nThat's a good thing!\nLaying bare our assumptions allows us to consider the scientific credibility of our analysis.\nAnother benefit of using DAGs is that, thanks to their mathematics, we can determine precisely the subset of variables we need to account for if we assume this DAG is correct.\n\n::: callout-tip\n## Assembling DAGs\n\nIn this exercise, we're providing you with a reasonable DAG based on knowledge of how the data were generated.\nIn real life, setting up a DAG is a challenge requiring deep thought, domain expertise, and (often) collaboration between several experts.\n:::\n\nThe chief problem we're dealing with is that, when we analyze the data we're working with, we see the impact of net usage on malaria risk *and of all these other relationships*.\nIn DAG terminology, we have more than one open causal pathway.\nIf this DAG is correct, we have *eight* causal pathways: the path between net usage and malaria risk and seven other *confounding* pathways.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![In the proposed DAG, there are eight open pathways that contribute to the causal effect seen in the naive regression: the true effect (in green) of net usage on malaria risk and seven other confounding pathways (in orange). The naive estimate is wrong because it is a composite of all these effects.](chapter-02_files/figure-html/fig-net-data-confounding-1.png){#fig-net-data-confounding width=1344}\n:::\n:::\n\n\nWhen we calculate a naive linear regression that only includes net usage and malaria risk, the effect we see is incorrect because the seven other confounding pathways in @fig-net-data-confounding distort it.\nIn DAG terminology, we need to *block* these open pathways that distort the causal estimate we're after.\n(We can block paths through several techniques, including stratification, matching, weighting, and more. We'll see several methods throughout the book.) Luckily, by specifying a DAG, we can precisely determine the variables we need to control for.\nFor this DAG, we need to control for three variables: health, income, and temperature.\nThese three variables are a *minimal adjustment set*, the minimum set (or sets) of variables you need to block all confounding pathways.\nWe'll discuss adjustment sets further in @sec-dags.\n\n## Model our assumptions\n\nWe'll use a technique called Inverse Probability Weighting (IPW) to control for these variables, which we'll discuss in detail in @sec-using-ps.\nWe'll use logistic regression to predict the probability of treatment---the propensity score.\nThen, we'll calculate inverse probability weights to apply to the linear regression model we fit above.\nThe propensity score model includes the exposure---net use---as the dependent variable and the minimal adjustment set as the independent variables.\n\n::: callout-tip\n## Modeling the functional form\n\nGenerally speaking, we want to lean on domain expertise and good modeling practices to fit the propensity score model.\nFor instance, we may want to allow continuous confounders to be non-linear using splines, or we may want to add essential interactions between confounders.\nBecause these are simulated data, we know we don't need these extra parameters (so we'll skip them), but in practice, you often do.\nWe'll discuss this more in @sec-using-ps.\n:::\n\nThe propensity score model is a logistic regression model with the formula `net ~ income + health + temperature`, which predicts the probability of bed net usage based on the confounders income, health, and temperature.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npropensity_model <- glm(\n net ~ income + health + temperature,\n data = net_data,\n family = binomial()\n)\n\n# the first six propensity scores\nhead(predict(propensity_model, type = \"response\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n 1 2 3 4 5 6 \n0.2464 0.2178 0.3230 0.2307 0.2789 0.3060 \n```\n\n\n:::\n:::\n\n\nWe can use propensity scores to control for confounding in various ways.\nIn this example, we'll focus on weighting.\nIn particular, we'll compute the inverse probability weight for the *average treatment effect* (ATE).\nThe ATE represents a particular causal question: what if *everyone* in the study used bed nets vs. what if *no one* in the study used bed nets?\n\nTo calculate the ATE, we'll use the broom and propensity packages.\nbroom's `augment()` function extracts prediction-related information from the model and joins it to the data.\npropensity's `wt_ate()` function calculates the inverse probability weight given the propensity score and exposure.\n\nFor inverse probability weighting, the ATE weight is the probability of receiving the treatment you actually received.\nIn other words, if you used a bed net, the ATE weight is the probability that you used a net, and if you did *not* use a net, it is the probability that you did *not* use a net.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(propensity)\nnet_data_wts <- propensity_model |>\n augment(newdata = net_data, type.predict = \"response\") |>\n # .fitted is the value predicted by the model\n # for a given observation\n mutate(wts = wt_ate(.fitted, net))\n\nnet_data_wts |>\n select(net, .fitted, wts) |>\n head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n net .fitted wts\n \n1 FALSE 0.246 1.33\n2 FALSE 0.218 1.28\n3 FALSE 0.323 1.48\n4 FALSE 0.231 1.30\n5 FALSE 0.279 1.39\n6 FALSE 0.306 1.44\n```\n\n\n:::\n:::\n\n\n`wts` represents the amount each observation will be up-weighted or down-weighted in the outcome model we will soon fit.\nFor instance, the 16th household used a bed net and had a predicted probability of 0.41.\nThat's a pretty low probability considering they did, in fact, use a net, so their weight is higher at 2.42.\nIn other words, this household will be up-weighted compared to the naive linear model we fit above.\nThe first household did *not* use a bed net; they're predicted probability of net use was 0.25 (or put differently, a predicted probability of *not* using a net of 0.75).\nThat's more in line with their observed value of `net`, but there's still some predicted probability of using a net, so their weight is 1.28.\n\n## Diagnose our models\n\nThe goal of propensity score weighting is to weight the population of observations such that the distribution of confounders is balanced between the exposure groups.\nPut another way, we are, in principle, removing the arrows between the confounders and exposure in the DAG, so that the confounding paths no longer distort our estimates.\nHere's the distribution of the propensity score by group, created by `geom_mirror_histogram()` from the halfmoon package for assessing balance in propensity score models:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(net_data_wts, aes(.fitted)) +\n geom_mirror_histogram(\n aes(fill = net),\n bins = 50\n ) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\")\n```\n\n::: {.cell-output-display}\n![A mirrored histogram of the propensity scores of those who used nets (top, blue) versus those who who did not use nets (bottom, orange). The range of propensity scores is similar between groups, with those who used nets slightly to the left of those who didn't, but the shapes of the distribution are different.](chapter-02_files/figure-html/fig-mirror-histogram-net-data-unweighted-1.png){#fig-mirror-histogram-net-data-unweighted width=672}\n:::\n:::\n\n\nThe weighted propensity score creates a pseudo-population where the distributions are much more similar:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(net_data_wts, aes(.fitted)) +\n geom_mirror_histogram(\n aes(group = net),\n bins = 50\n ) +\n geom_mirror_histogram(\n aes(fill = net, weight = wts),\n bins = 50,\n alpha = .5\n ) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\")\n```\n\n::: {.cell-output-display}\n![A mirrored histogram of the propensity scores of those who used nets (top, blue) versus those who who did not use nets (bottom, orange). The shaded region represents the unweighted distribution, and the colored region represents the weighted distributions. The ATE weights up-weight the groups to be similar in range and shape of the distribution of propensity scores.](chapter-02_files/figure-html/fig-mirror-histogram-net-data-weighted-1.png){#fig-mirror-histogram-net-data-weighted width=672}\n:::\n:::\n\n\nIn this example, the unweighted distributions are not awful---the shapes are somewhat similar here, and the overlap quite a bit---but the weighted distributions in @fig-mirror-histogram-net-data-weighted are much more similar.\n\n::: callout-caution\n## Unmeasured confounding\n\nPropensity score weighting and most other causal inference techniques only help with *observed* confounders---ones that we model correctly, at that.\nUnfortunately, we still may have unmeasured confounding, which we'll discuss below.\n\nRandomization is one causal inference technique that *does* deal with unmeasured confounding, one of the reasons it is so powerful.\n:::\n\nWe might also want to know how well-balanced the groups are by each confounder.\nOne way to do this is to calculate the standardized mean differences (SMDs) for each confounder with and without weights.\nWe'll calculate the SMDs with `tidy_smd()` then plot them with `geom_love()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_df <- tidy_smd(\n net_data_wts,\n c(income, health, temperature),\n .group = net,\n .wts = wts\n)\n\nggplot(\n plot_df,\n aes(\n x = abs(smd),\n y = variable,\n group = method,\n color = method\n )\n) +\n geom_love()\n```\n\n::: {.cell-output-display}\n![A love plot representing the standardized mean differences (SMD) between exposure groups of three confounders: temperature, income, and health. Before weighting, there are considerable differences in the groups. After weighting, the confounders are much more balanced between groups.](chapter-02_files/figure-html/fig-love-plot-net-data-1.png){#fig-love-plot-net-data width=672}\n:::\n:::\n\n\nA standard guideline is that balanced confounders should have an SMD of less than 0.1 on the absolute scale.\n0.1 is just a rule of thumb, but if we follow it, the variables in @fig-love-plot-net-data are well-balanced after weighting (and unbalanced before weighting).\n\nBefore we apply the weights to the outcome model, let's check their overall distribution for extreme weights.\nExtreme weights can destabilize the estimate and variance in the outcome model, so we want to be aware of it.\nWe'll also discuss several other types of weights that are less prone to this issue in @sec-estimands.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data_wts |>\n ggplot(aes(wts)) +\n geom_density(fill = \"#CC79A7\", color = NA, alpha = 0.8)\n```\n\n::: {.cell-output-display}\n![A density plot of the average treatment effect (ATE) weights. The plot is skewed, with higher values towards 8. This may indicate a problem with the model, but the weights aren't so extreme to destabilize the variance of the estimate.](chapter-02_files/figure-html/fig-ate-density-net-data-1.png){#fig-ate-density-net-data width=672}\n:::\n:::\n\n\nThe weights in @fig-ate-density-net-data are skewed, but there are no outrageous values.\nIf we saw extreme weights, we might try trimming or stabilizing them, or consider calculating an effect for a different estimand, which we'll discuss in @sec-estimands.\nIt doesn't look like we need to do that here, however.\n\n## Estimate the causal effect\n\nWe're now ready to use the ATE weights to (attempt to) account for confounding in the naive linear regression model.\nFitting such a model is pleasantly simple in this case: we fit the same model as before but with `weights = wts`, which will incorporate the inverse probability weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data_wts |>\n lm(malaria_risk ~ net, data = _, weights = wts) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n \n1 (Inte… 42.7 0.442 96.7 0 41.9\n2 netTR… -12.5 0.624 -20.1 5.50e-81 -13.8\n# ℹ 1 more variable: conf.high \n```\n\n\n:::\n:::\n\n\n\n\nThe estimate for the average treatment effect is -12.5 (95% CI -13.8, -11.3).\nUnfortunately, the confidence intervals we're using are wrong because they don't account for the dependence within the weights!\nGenerally, confidence intervals for propensity score weighted models will be too narrow unless we correct for this dependence.\nThe nominal coverage of the confidence intervals will thus be wrong (they aren't 95% CIs because their coverage is much lower than 95%) and may lead to misinterpretation.\n\nWe've got several ways to address this problem, which we'll discuss in detail in @sec-outcome-model, including the bootstrap, robust standard errors, and manually accounting for the dependence with empirical sandwich estimators.\nFor this example, we'll use the bootstrap, a flexible tool that calculates distributions of parameters using re-sampling.\nWe'll use the rsample package from the tidymodels ecosystem to work with bootstrap samples.\n\n::: callout-note\n## The Bootstrap\n\nThe bootstrap is a simple but flexible algorithm for calculating statistics using re-sampling with replacement.\nIt's handy when a closed-form solution doesn't exist to calculate something, as is commonly the case in causal inference (particularly for standard errors), and when we suspect the parametric calculations are not valid for a given situation.\n\nBootstrapping in R has a long tradition of writing functions to calculate the statistic of interest, starting with the classic boot package.\nThroughout the book, we'll use rsample, a more modern alternative for re-sampling, but generally, we start with writing a function to calculate the estimate we're interested in.\n\nLet's say we want to calculate `some_statistic()` for `this_data`.\nTo bootstrap for *R* samples, we:\n\n1. Re-sample `this_data` with replacement.\n The same row may appear multiple (or no) times in a given bootstrap sample, simulating the sampling process in the underlying population.\n\n ``` r\n indices <- sample(\n # create a vector of indices:\n # 1 through the number of rows\n seq_len(nrow(this_data)), \n # sample a vector indices \n # that's the same length as `this_data`\n size = nrow(this_data), \n replace = TRUE\n )\n bootstrap_sample <- this_data[indices, ]\n ```\n\n2. Fit `some_statistic()` on the `bootstrap_sample`\n\n ``` r\n estimate <- some_statistic(bootstrap_sample)\n ```\n\n3. Repeat *R* times\n\nWe then end up with a distribution of `estimate`s, with which we can calculate population statistics, such as point estimates, standard errors, and confidence intervals.\n:::\n\nBecause the bootstrap is so flexible, we need to think carefully about the sources of uncertainty in the statistic we're calculating.\nIt might be tempting to write a function like this to fit the statistic we're interested in (the point estimate for `netTRUE`):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\nfit_ipw_not_quite_rightly <- function(split, ...) {\n # get bootstrapped data sample with `rsample::analysis()`\n .df <- analysis(split)\n\n # fit ipw model\n lm(malaria_risk ~ net, data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\nHowever, this function won't give us the correct confidence intervals because it treats the inverse probability weights as fixed values.\nThey're not, of course; we just estimated them using logistic regression!\nWe need to account for this uncertainty by bootstrapping the *entire modeling process*.\nFor every bootstrap sample, we need to fit the propensity score model, calculate the inverse probability weights, then fit the weighted outcome model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\nfit_ipw <- function(split, ...) {\n # get bootstrapped data sample with `rsample::analysis()`\n .df <- analysis(split)\n\n # fit propensity score model\n propensity_model <- glm(\n net ~ income + health + temperature,\n data = .df,\n family = binomial()\n )\n\n # calculate inverse probability weights\n .df <- propensity_model |>\n augment(type.predict = \"response\", data = .df) |>\n mutate(wts = wt_ate(.fitted, net))\n\n # fit correctly bootstrapped ipw model\n lm(malaria_risk ~ net, data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\nNow that we know precisely how to calculate the estimate for each iteration let's create the bootstrapped dataset with rsample's `bootstraps()` function.\nThe `times` argument determines how many bootstrapped datasets to create; we'll do 1,000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbootstrapped_net_data <- bootstraps(\n net_data,\n times = 1000,\n # required to calculate CIs later\n apparent = TRUE\n)\n\nbootstrapped_net_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 2\n splits id \n \n 1 Bootstrap0001\n 2 Bootstrap0002\n 3 Bootstrap0003\n 4 Bootstrap0004\n 5 Bootstrap0005\n 6 Bootstrap0006\n 7 Bootstrap0007\n 8 Bootstrap0008\n 9 Bootstrap0009\n10 Bootstrap0010\n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\nThe result is a nested data frame: each `splits` object contains metadata that rsample uses to subset the bootstrap samples for each of the 1,000 samples.\nNext, we'll run `fit_ipw()` 1,000 times to create a distribution for `estimate`.\nAt its heart, the calculation we're doing is\n\n``` r\nfit_ipw(bootstrapped_net_data$splits[[n]])\n```\n\nWhere *n* is one of 1,000 indices.\nWe'll use purrr's `map()` function to iterate across each `split` object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results <- bootstrapped_net_data |>\n mutate(boot_fits = map(splits, fit_ipw))\n\nipw_results\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 3\n splits id boot_fits \n \n 1 Bootstrap0001 \n 2 Bootstrap0002 \n 3 Bootstrap0003 \n 4 Bootstrap0004 \n 5 Bootstrap0005 \n 6 Bootstrap0006 \n 7 Bootstrap0007 \n 8 Bootstrap0008 \n 9 Bootstrap0009 \n10 Bootstrap0010 \n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\nThe result is another nested data frame with a new column, `boot_fits`.\nEach element of `boot_fits` is the result of the IPW for the bootstrapped dataset.\nFor example, in the first bootstrapped data set, the IPW results were:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results$boot_fits[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 42.6 0.446 95.4 0 \n2 netTRUE -11.7 0.630 -18.5 6.95e-70\n```\n\n\n:::\n:::\n\n\nNow we have a distribution of estimates:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results |>\n mutate(\n estimate = map_dbl(\n boot_fits,\n # pull the `estimate` for `netTRUE` for each fit\n \\(.fit) .fit |>\n filter(term == \"netTRUE\") |>\n pull(estimate)\n )\n ) |>\n ggplot(aes(estimate)) +\n geom_histogram(fill = \"#D55E00FF\", color = \"white\", alpha = 0.8)\n```\n\n::: {.cell-output-display}\n![\"A histogram of 1,000 bootstrapped estimates of the effect of net use on malaria risk. The spread of these estimates accounts for the dependency and uncertainty in the use of IPW weights.\"](chapter-02_files/figure-html/fig-bootstrap-estimates-net-data-1.png){#fig-bootstrap-estimates-net-data width=672}\n:::\n:::\n\n\n@fig-bootstrap-estimates-net-data gives a sense of the variation in `estimate`, but let's calculate 95% confidence intervals from the bootstrapped distribution using rsample's `int_t()` :\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboot_estimate <- ipw_results |>\n # calculate T-statistic-based CIs\n int_t(boot_fits) |>\n filter(term == \"netTRUE\")\n\nboot_estimate\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 6\n term .lower .estimate .upper .alpha .method \n \n1 netTRUE -13.4 -12.5 -11.7 0.05 student-t\n```\n\n\n:::\n:::\n\n\nNow we have a confounder-adjusted estimate with correct standard errors.\nThe estimate of the effect of *all* households using bed nets versus *no* households using bed nets on malaria risk is -12.5 (95% CI -13.4, -11.7).\nBed nets do indeed seem to reduce malaria risk in this study.\n\n## Conduct sensitivity analysis on the effect estimate\n\nWe've laid out a roadmap for taking observational data, thinking critically about the causal question we want to ask, identifying the assumptions we need to get there, then applying those assumptions to a statistical model.\nGetting the correct answer to the causal question relies on getting our assumptions more or less right.\nBut what if we're more on the less correct side?\n\nSpoiler alert: the answer we just calculated is **wrong**.\nAfter all that effort!\n\nWhen conducting a causal analysis, it's a good idea to use sensitivity analyses to test your assumptions.\nThere are many potential sources of bias in any study and many sensitivity analyses to go along with them; we'll focus on the assumption of no confounding.\n\nLet's start with a broad sensitivity analysis; then, we'll ask questions about specific unmeasured confounders.\nWhen we have less information about unmeasured confounders, we can use tipping point analysis to ask how much confounding it would take to tip my estimate to the null.\nIn other words, what would the strength of the unmeasured confounder have to be to explain our results away?\nThe tipr package is a toolkit for conducting sensitivity analyses.\nLet's examine the tipping point for an unknown, normally-distributed confounder.\nThe `tip_coef()` function takes an estimate (a beta coefficient from a regression model, or the upper or lower bound of the coefficient).\nIt further requires either the 1) scaled differences in means of the confounder between exposure groups or 2) effect of the confounder on the outcome.\nFor the estimate, we'll use `conf.high`, which is closer to 0 (the null), and ask: how much would the confounder have to affect malaria risk to have an unbiased upper confidence interval of 0?\nWe'll use tipr to calculate this answer for 5 scenarios, where the mean difference in the confounder between exposure groups is 1, 2, 3, 4, or 5.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tipr)\ntipping_points <- tip_coef(boot_estimate$.upper, exposure_confounder_effect = 1:5)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe observed effect (-11.72) WOULD be tipped by 1 unmeasured confounder\nwith the following specifications:\n * estimated difference in scaled means between the unmeasured confounder\n in the exposed population and unexposed population: 1\n * estimated relationship between the unmeasured confounder and the outcome: -11.72\nThe observed effect (-11.72) WOULD be tipped by 1 unmeasured confounder\nwith the following specifications:\n * estimated difference in scaled means between the unmeasured confounder\n in the exposed population and unexposed population: 2\n * estimated relationship between the unmeasured confounder and the outcome: -5.86\nThe observed effect (-11.72) WOULD be tipped by 1 unmeasured confounder\nwith the following specifications:\n * estimated difference in scaled means between the unmeasured confounder\n in the exposed population and unexposed population: 3\n * estimated relationship between the unmeasured confounder and the outcome: -3.91\nThe observed effect (-11.72) WOULD be tipped by 1 unmeasured confounder\nwith the following specifications:\n * estimated difference in scaled means between the unmeasured confounder\n in the exposed population and unexposed population: 4\n * estimated relationship between the unmeasured confounder and the outcome: -2.93\nThe observed effect (-11.72) WOULD be tipped by 1 unmeasured confounder\nwith the following specifications:\n * estimated difference in scaled means between the unmeasured confounder\n in the exposed population and unexposed population: 5\n * estimated relationship between the unmeasured confounder and the outcome: -2.34\n```\n\n\n:::\n\n```{.r .cell-code}\ntipping_points |>\n ggplot(aes(confounder_outcome_effect, exposure_confounder_effect)) +\n geom_line(color = \"#009E73\", linewidth = 1.1) +\n geom_point(fill = \"#009E73\", color = \"white\", size = 2.5, shape = 21) +\n labs(\n x = \"Confounder-Outcome Effect\",\n y = \"Scaled mean differences in\\n confounder between exposure groups\"\n )\n```\n\n::: {.cell-output-display}\n![A tipping point analysis under several confounding scenarios where the unmeasured confounder is a normally-distributed continuous variable. The line represents the strength of confounding necessary to tip the upper confidence interval of the causal effect estimate to 0. The x-axis represents the coefficient of the confounder-outcome relationship adjusted for the exposure and the set of measured confounders. The y-axis represents the scaled mean difference of the confounder between exposure groups.](chapter-02_files/figure-html/fig-tip-coef-net-1.png){#fig-tip-coef-net width=672}\n:::\n:::\n\n\nIf we had an unmeasured confounder where the standardized mean difference between exposure groups was 1, the confounder would need to decrease malaria risk by about -11.7.\nThat's pretty strong relative to other effects, but it may be feasible if we have an idea of something we might have missed.\nConversely, suppose the relationship between net use and the unmeasured confounder is very strong, with a mean scaled difference of 5.\nIn that case, the confounder-malaria relationship only needs to be 5.\nNow we have to consider: which of these scenarios are plausible given our domain knowledge and the effects we see in this analysis?\n\nNow let's consider a much more specific sensitivity analysis.\nSome ethnic groups, such as the Fulani, have a genetic resistance to malaria [@arama2015].\nLet's say that in our simulated data, an unnamed ethnic group in the unnamed country shares this genetic resistance to malaria.\nFor historical reasons, bed net use in this group is also very high.\nWe don't have this variable in `net_data`, but let's say we know from the literature that in this sample, we can estimate at:\n\n1. People with this genetic resistance have, on average, a lower malaria risk by about 10.\n2. About 26% of people who use nets in our study have this genetic resistance.\n3. About 5% of people who don't use nets have this genetic resistance.\n\nWith this amount of information, we can use tipr to adjust the estimates we calculated for the unmeasured confounder.\nWe'll use `adjust_coef_with_binary()` to calculate the adjusted estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjusted_estimates <- boot_estimate |>\n select(.estimate, .lower, .upper) |>\n unlist() |>\n adjust_coef_with_binary(\n exposed_confounder_prev = 0.26,\n unexposed_confounder_prev = 0.05,\n confounder_outcome_effect = -10\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe observed effect (-12.55) is updated to -12.3 by a confounder with the following specifications:\n* estimated prevalence of the unmeasured confounder in the exposed population: 0.26\n* estimated prevalence of the unmeasured confounder in the unexposed population: 0.05\n* estimated relationship between the unmeasured confounder and the outcome: -10The observed effect (-13.37) is updated to -13.12 by a confounder with the following specifications:\n* estimated prevalence of the unmeasured confounder in the exposed population: 0.26\n* estimated prevalence of the unmeasured confounder in the unexposed population: 0.05\n* estimated relationship between the unmeasured confounder and the outcome: -10The observed effect (-11.72) is updated to -11.47 by a confounder with the following specifications:\n* estimated prevalence of the unmeasured confounder in the exposed population: 0.26\n* estimated prevalence of the unmeasured confounder in the unexposed population: 0.05\n* estimated relationship between the unmeasured confounder and the outcome: -10\n```\n\n\n:::\n\n```{.r .cell-code}\nadjusted_estimates\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n effect_adjusted effect_observed\n \n1 -12.3 -12.5\n2 -13.1 -13.4\n3 -11.5 -11.7\n# ℹ 3 more variables: exposed_confounder_prev ,\n# unexposed_confounder_prev ,\n# confounder_outcome_effect \n```\n\n\n:::\n:::\n\n\nThe adjusted estimate for a situation where genetic resistance to malaria is a confounder is -12.3 (95% CI -13.1, -11.5).\n\nIn fact, these data were simulated with just such a confounder.\nThe true effect of net use on malaria is about -10, and the true DAG that generated these data is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The true causal diagram for `net_data`. This DAG is identical to the one we proposed with one addition: genetic resistance to malaria causally reduces the risk of malaria and impacts net use. It's thus a confounder and a part of the minimal adjustment set required to get an unbiased effect estimate. In otherwords, by not including it, we've calculated the wrong effect.](chapter-02_files/figure-html/fig-net-data-true-dag-1.png){#fig-net-data-true-dag width=672}\n:::\n:::\n\n\n\n\nThe unmeasured confounder in @fig-net-data-true-dag is available in the dataset `net_data_full` as `genetic_resistance`.\nIf we recalculate the IPW estimate of the average treatment effect of nets on malaria risk, we get -10.2 (95% CI -11.2, -9.3), much closer to the actual answer of -10.\n\nWhat do you think?\nIs this estimate reliable?\nDid we do a good job addressing the assumptions we need to make for a causal effect, mainly that there is no confounding?\nHow might you criticize this model, and what would you do differently?\nOk, we know that -10 is the correct answer because the data are simulated, but in practice, we can never be sure, so we need to continue probing our assumptions until we're confident they are robust. We'll explore these techniques and others in @sec-sensitivity.\n\n\nTo calculate this effect, we:\n\n1. Specified a causal question (for the average treatment effect)\n2. Drew our assumptions using a causal diagram (using DAGs)\n3. Modeled our assumptions (using propensity score weighting)\n4. Diagnosed our models (by checking confounder balance after weighting)\n5. Estimated the causal effect (using inverse probability weighting)\n6. Conducted sensitivity analysis on the effect estimate (using tipping point analysis)\n\nThroughout the rest of the book, we'll follow these broad steps in several examples from medicine, economics, and industry.\nWe'll dive more deeply into propensity score techniques, explore other methods for estimating causal effects, and, most importantly, make sure, over and over again, that the assumptions we're making are reasonable---even if we'll never know for sure.\n", + "markdown": "# The whole game: mosquito nets and malaria\n\n\n\n\n\nIn this chapter, we'll analyze data using techniques we learn in this book.\nWe'll play the [whole game](https://www.gse.harvard.edu/news/uk/09/01/education-bat-seven-principles-educators) of causal analysis using a few key steps:\n\n1. Specify a causal question\n2. Draw our assumptions using a causal diagram\n3. Model our assumptions\n4. Diagnose our models\n5. Estimate the causal effect\n6. Conduct sensitivity analysis on the effect estimate\n\nWe'll focus on the broader ideas behind each step and what they look like all together; however, we don't expect you to fully digest each idea. We'll spend the rest of the book taking up each step in detail.\n\n## Specify a causal question\n\nIn this guided exercise, we'll attempt to answer a causal question: does using a bed net reduce the risk of malaria?\n\nMalaria remains a serious public health issue.\nAdditionally, while malaria incidence has decreased since 2000, 2020 and the COVID-19 pandemic saw an increase in cases and deaths due primarily to service interruption [@worldma].\nAbout 86% of malaria deaths occurred in 29 countries.\nStill, nearly half of all malaria deaths occurred in just six countries: Nigeria (27%), the Democratic Republic of the Congo (12%), Uganda (5%), Mozambique (4%), Angola (3%), and Burkina Faso (3%).\nMost of these deaths occurred in children under 5 [@mosquito].\nMalaria also poses severe health risks to pregnant women and worsens birth outcomes, including early delivery and low birth weight.\n\nBed nets prevent morbidity and mortality due to malaria by providing a barrier against infective bites by the chief host of malaria parasites, the mosquito.\nHumans have used bed nets since ancient times.\nHerodotus, the 5th century BC Greek author of *The Histories*, observed Egyptians using their fishing nets as bed nets:\n\n> Against the gnats, which are very abundant, they have contrived as follows:---those who dwell above the fen-land are helped by the towers, to which they ascend when they go to rest; for the gnats by reason of the winds are not able to fly up high: but those who dwell in the fen-land have contrived another way instead of the towers, and this is it:---every man of them has got a casting net, with which by day he catches fish, but in the night he uses it for this purpose, that is to say he puts the casting-net round about the bed in which he sleeps, and then creeps in under it and goes to sleep: and the gnats, if he sleeps rolled up in a garment or a linen sheet, bite through these, but through the net they do not even attempt to bite [@thehist].\n\nMany modern nets are also treated with insecticide, dating back to Russian soldiers in World War II [@nevill1996], although some people still use them as fishing nets [@gettleman2015].\n\nIt's easy to imagine a randomized trial that deals with this question: participants in a study are randomly assigned to use a bed net, and we follow them over time to see if there is a difference in malaria risk between groups.\nRandomization is often the best way to estimate a causal effect of an intervention because it reduces the number of assumptions we need to make for that estimate to be valid (we will discuss these assumptions in @sec-assump).\nIn particular, randomization addresses confounding very well, accounting for confounders about which we may not even know.\n\nSeveral landmark trials have studied the effects of bed net use on malaria risk, with several essential studies in the 1990s.\nA 2004 meta-analysis found that insecticide-treated nets reduced childhood mortality by 17%, malarial parasite prevalence by 13%, and cases of uncomplicated and severe malaria by about 50% (compared to no nets) [@lengeler2004].\nSince the World Health Organization began recommending insecticide-treated nets, insecticide resistance has been a big concern.\nStill, a follow-up analysis of trials found that it has yet to impact the public health benefits of bed nets [@pryce2018].\n\nTrials have also been influential in determining the economics of bed net programs.\nFor instance, one trial compared free net distribution versus a cost-share program (where participants pay a subsidized fee for nets).\nThe study's authors found that net uptake was similar between the groups and that free net distribution --- because it was easier to access --- saved more lives, and was cheaper per life saved than the cost-sharing program [@cohen2010].\n\nThere are several reasons we might not be able to conduct a randomized trial, including ethics, cost, and time.\nWe have substantial, robust evidence in favor of bed net use.\nStill, let's consider some conditions where observational causal inference helps answer questions about bed nets and malaria prevention.\n\n- Imagine we are at a time before trials on this subject, and let's say people have started to use bed nets for this purpose on their own.\n Our goal may still be to conduct a randomized trial, but we can answer questions more quickly with observed data.\n In addition, this study's results might guide trials' design or intermediary policy suggestions.\n\n- Sometimes, it is also not ethical to conduct a trial.\n An example of this in malaria research is a question that arose in the study of bed net effectiveness: does malaria control in early childhood result in delayed immunity to the disease, resulting in severe malaria or death later in life?\n Since we now know bed net use is very effective, *withholding* nets would be unethical.\n A recent observational study found that the benefits of bed net use in childhood on all-cause mortality persist into adulthood [@mosquito].\n\n- We may also want to estimate a different effect or the effect for another population than in previous trials.\n For example, both randomized and observational studies helped us better understand that insecticide-based nets improve malaria resistance in the entire community, not just among those who use nets, so long as net usage is high enough [@howard2000; @hawley2003].\n\nAs we saw in @sec-causal-question and we'll see in @sec-g-comp, the causal inference techniques that we'll discuss in this book are often beneficial even when we're able to randomize.\n\nWhen we conduct an observational study, it's still helpful to think through the randomized trial we would run were it possible.\nThe trial we're trying to emulate in this causal analysis is the *target trial.* Considering the target trial helps us make our causal question more accurate.\nLet's consider the causal question posed earlier: does using a bed net (a mosquito net) reduce the risk of malaria?\nThis question is relatively straightforward, but it is still vague.\nIn conducting an analysis, we'll need to address several key questions:\n\n- What do we mean by \"bed net\"?\n There are several types of nets: untreated bed nets, insecticide-treated bed nets, and newer long-lasting insecticide-treated bed nets.\n\n- Risk compared to what?\n Are we, for instance, comparing insecticide-treated bed nets to *no* net?\n Untreated nets?\n Or are we comparing a new type of net, like long-lasting insecticide-treated bed nets, to nets that are already in use?\n\n- Risk as defined by what?\n Whether or not a person contracted malaria?\n Whether a person died of malaria?\n\n- Risk among whom?\n What is the population to which we're trying to apply this knowledge?\n Who is it practical to include in our study?\n Who might we need to exclude?\n\nWe will use simulated data to answer a more specific question: Does using insecticide-treated bed nets decrease the risk of contracting malaria?\nIn this particular data, [simulated by Dr. Andrew Heiss](https://evalsp21.classes.andrewheiss.com/example/matching-ipw/#program-background):\n\n> researchers are interested in whether using mosquito nets decreases an individual's risk of contracting malaria.\n> They have collected data from 1,752 households in an unnamed country and have variables related to environmental factors, individual health, and household characteristics.\n> The data is **not experimental**---researchers have no control over who uses mosquito nets, and individual households make their own choices over whether to apply for free nets or buy their own nets, as well as whether they use the nets if they have them.\n\nBecause we're using simulated data, we'll have direct access to a variable that measures the likelihood of contracting malaria, something we wouldn't likely have in real life.\nWe'll stick with this measure because we know the actual effect size.\nWe'll use simulated data, `net_data`, from the {[causalworkshop](https://github.com/r-causal/causalworkshop)} package, which includes ten variables:\n\n\n\n`id`\n\n: an ID variable\n\n`net` and `net_num`\n\n: a binary variable indicating if the participant used a net (1) or didn't use a net (0)\n\n`malaria_risk`\n\n: risk of malaria scale ranging from 0-100\n\n`income`\n\n: weekly income, measured in dollars\n\n`health`\n\n: a health score scale ranging from 0--100\n\n`household`\n\n: number of people living in the household\n\n`eligible`\n\n: a binary variable indicating if the household is eligible for the free net program.\n\n`temperature`\n\n: the average temperature at night, in Celsius\n\n`resistance`\n\n: Insecticide resistance of local mosquitoes.\n A scale of 0--100, with higher values indicating higher resistance.\n\nThe distribution of malaria risk appears to be quite different by net usage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(causalworkshop)\nnet_data |>\n ggplot(aes(malaria_risk, fill = net)) +\n geom_density(color = NA, alpha = .8)\n```\n\n::: {.cell-output-display}\n![A density plot of malaria risk for those who did and did not use nets. The risk of malaria is lower for those who use nets.](chapter-02_files/figure-html/fig-malaria-risk-density-1.png){#fig-malaria-risk-density width=672}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nIn @fig-malaria-risk-density, the density of those who used nets is to the left of those who did not use nets.\nThe mean difference in malaria risk is about 16.4, suggesting net use might be protective against malaria.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data |>\n group_by(net) |>\n summarize(malaria_risk = mean(malaria_risk))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n net malaria_risk\n \n1 FALSE 43.9\n2 TRUE 27.5\n```\n\n\n:::\n:::\n\n\nAnd that's what we see with simple linear regression, as well, as we would expect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nnet_data |>\n lm(malaria_risk ~ net, data = _) |>\n tidy()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 43.9 0.377 116. 0 \n2 netTRUE -16.4 0.741 -22.1 1.10e-95\n```\n\n\n:::\n:::\n\n\n## Draw our assumptions using a causal diagram\n\nThe problem that we face is that other factors may be responsible for the effect we're seeing.\nIn this example, we'll focus on confounding: a common cause of net usage and malaria will bias the effect we see unless we account for it somehow.\nOne of the best ways to determine which variables we need to account for is to use a causal diagram.\nThese diagrams, also called causal directed acyclic graphs (DAGs), visualize the assumptions that we're making about the causal relationships between the exposure, outcome, and other variables we think might be related.\n\nHere's the DAG that we're proposing for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A proposed causal diagram of the effect of bed net use on malaria. This directed acyclic graph (DAG) states our assumption that bed net use causes a reduction in malaria risk. It also says that we assume: malaria risk is impacted by net usage, income, health, temperature, and insecticide resistance; net usage is impacted by income, health, temperature, eligibility for the free net program, and the number of people in a household; eligibility for the free net programs is impacted by income and the number of people in a household; and health is impacted by income.](chapter-02_files/figure-html/fig-net-data-dag-1.png){#fig-net-data-dag width=672}\n:::\n:::\n\n\n(We'll explore how to create and analyze DAGs in R in @sec-dags).\n\nIn DAGs, each point represents a variable, and each arrow represents a cause.\nIn other words, this diagram declares what we think the causal relationships are between these variables.\nIn @fig-net-data-dag, we're saying that we believe:\n\n- Malaria risk is causally impacted by net usage, income, health, temperature, and insecticide resistance.\n- Net usage is causally impacted by income, health, temperature, eligibility for the free net program, and the number of people in a household.\n- Eligibility for the free net programs is determined by income and the number of people in a household.\n- Health is causally impacted by income.\n\nYou may agree or disagree with some of these assertions.\nThat's a good thing!\nLaying bare our assumptions allows us to consider the scientific credibility of our analysis.\nAnother benefit of using DAGs is that, thanks to their mathematics, we can determine precisely the subset of variables we need to account for if we assume this DAG is correct.\n\n::: callout-tip\n## Assembling DAGs\n\nIn this exercise, we're providing you with a reasonable DAG based on knowledge of how the data were generated.\nIn real life, setting up a DAG is a challenge requiring deep thought, domain expertise, and (often) collaboration between several experts.\n:::\n\nThe chief problem we're dealing with is that, when we analyze the data we're working with, we see the impact of net usage on malaria risk *and of all these other relationships*.\nIn DAG terminology, we have more than one open causal pathway.\nIf this DAG is correct, we have *eight* causal pathways: the path between net usage and malaria risk and seven other *confounding* pathways.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![In the proposed DAG, there are eight open pathways that contribute to the causal effect seen in the naive regression: the true effect (in green) of net usage on malaria risk and seven other confounding pathways (in orange). The naive estimate is wrong because it is a composite of all these effects.](chapter-02_files/figure-html/fig-net-data-confounding-1.png){#fig-net-data-confounding width=1344}\n:::\n:::\n\n\nWhen we calculate a naive linear regression that only includes net usage and malaria risk, the effect we see is incorrect because the seven other confounding pathways in @fig-net-data-confounding distort it.\nIn DAG terminology, we need to *block* these open pathways that distort the causal estimate we're after.\n(We can block paths through several techniques, including stratification, matching, weighting, and more. We'll see several methods throughout the book.) Luckily, by specifying a DAG, we can precisely determine the variables we need to control for.\nFor this DAG, we need to control for three variables: health, income, and temperature.\nThese three variables are a *minimal adjustment set*, the minimum set (or sets) of variables you need to block all confounding pathways.\nWe'll discuss adjustment sets further in @sec-dags.\n\n## Model our assumptions\n\nWe'll use a technique called Inverse Probability Weighting (IPW) to control for these variables, which we'll discuss in detail in @sec-using-ps.\nWe'll use logistic regression to predict the probability of treatment---the propensity score.\nThen, we'll calculate inverse probability weights to apply to the linear regression model we fit above.\nThe propensity score model includes the exposure---net use---as the dependent variable and the minimal adjustment set as the independent variables.\n\n::: callout-tip\n## Modeling the functional form\n\nGenerally speaking, we want to lean on domain expertise and good modeling practices to fit the propensity score model.\nFor instance, we may want to allow continuous confounders to be non-linear using splines, or we may want to add essential interactions between confounders.\nBecause these are simulated data, we know we don't need these extra parameters (so we'll skip them), but in practice, you often do.\nWe'll discuss this more in @sec-using-ps.\n:::\n\nThe propensity score model is a logistic regression model with the formula `net ~ income + health + temperature`, which predicts the probability of bed net usage based on the confounders income, health, and temperature.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npropensity_model <- glm(\n net ~ income + health + temperature,\n data = net_data,\n family = binomial()\n)\n\n# the first six propensity scores\nhead(predict(propensity_model, type = \"response\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n 1 2 3 4 5 6 \n0.2464 0.2178 0.3230 0.2307 0.2789 0.3060 \n```\n\n\n:::\n:::\n\n\nWe can use propensity scores to control for confounding in various ways.\nIn this example, we'll focus on weighting.\nIn particular, we'll compute the inverse probability weight for the *average treatment effect* (ATE).\nThe ATE represents a particular causal question: what if *everyone* in the study used bed nets vs. what if *no one* in the study used bed nets?\n\nTo calculate the ATE, we'll use the broom and propensity packages.\nbroom's `augment()` function extracts prediction-related information from the model and joins it to the data.\npropensity's `wt_ate()` function calculates the inverse probability weight given the propensity score and exposure.\n\nFor inverse probability weighting, the ATE weight is the probability of receiving the treatment you actually received.\nIn other words, if you used a bed net, the ATE weight is the probability that you used a net, and if you did *not* use a net, it is the probability that you did not use a net.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(propensity)\nnet_data_wts <- propensity_model |>\n augment(newdata = net_data, type.predict = \"response\") |>\n # .fitted is the value predicted by the model\n # for a given observation\n mutate(wts = wt_ate(.fitted, net))\n\nnet_data_wts |>\n select(net, .fitted, wts) |>\n head()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n net .fitted wts\n \n1 FALSE 0.246 1.33\n2 FALSE 0.218 1.28\n3 FALSE 0.323 1.48\n4 FALSE 0.231 1.30\n5 FALSE 0.279 1.39\n6 FALSE 0.306 1.44\n```\n\n\n:::\n:::\n\n\n`wts` represents the amount each observation will be up-weighted or down-weighted in the outcome model we will soon fit.\nFor instance, the first household used a bed net and had a predicted probability of 0.25.\nThat's a pretty low probability considering they did, in fact, use a net, so their weight is higher at 1.33.\nIn other words, this household will be up-weighted almost three times compared to the naive linear model we fit above.\nThe second household did *not* use a bed net; they're predicted probability of net use was 0.22 (or put differently, a predicted probability of *not* using a net of 0.78).\nThat's more in line with their observed value of `net`, but there's still some predicted probability of using a net, so their weight is 1.28.\n\n## Diagnose our models\n\nThe goal of propensity score weighting is to weight the population of observations such that the distribution of confounders is balanced between the exposure groups.\nPut another way, we are, in principle, removing the associational arrows between confounders and exposure in the DAG, so that the confounding paths no longer affect our estimates.\nHere's the distribution of the propensity score by group, created by `geom_mirror_histogram()` from the halfmoon package for assessing balance in propensity score models (as well as visualizing the pseudo-population the weights simulate):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(net_data_wts, aes(.fitted)) +\n geom_mirror_histogram(\n aes(fill = net),\n bins = 50\n ) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\")\n```\n\n::: {.cell-output-display}\n![A mirrored histogram of the propensity scores of those who used nets (top, blue) versus those who who did not use nets (bottom, orange). The range of propensity scores is similar between groups, with those who used nets slightly to the left of those who didn't, but the shapes of the distribution are different.](chapter-02_files/figure-html/fig-mirror-histogram-net-data-unweighted-1.png){#fig-mirror-histogram-net-data-unweighted width=672}\n:::\n:::\n\n\nThe weighted propensity score creates a pseudo-population where the distributions are much more similar:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(net_data_wts, aes(.fitted)) +\n geom_mirror_histogram(\n aes(group = net),\n bins = 50\n ) +\n geom_mirror_histogram(\n aes(fill = net, weight = wts),\n bins = 50,\n alpha = .5\n ) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\")\n```\n\n::: {.cell-output-display}\n![A mirrored histogram of the propensity scores of those who used nets (top, blue) versus those who who did not use nets (bottom, orange). The shaded region represents the unweighted distribution, and the colored region represents the weighted distributions. The ATE weights up-weight the groups to be similar in range and shape of the distribution of propensity scores.](chapter-02_files/figure-html/fig-mirror-histogram-net-data-weighted-1.png){#fig-mirror-histogram-net-data-weighted width=672}\n:::\n:::\n\n\nIn this example, the unweighted distributions are not awful---the shapes are fairly similar here---but the weighted distributions in @fig-mirror-histogram-net-data-weighted are much more similar.\n\n::: callout-caution\n## Unmeasured confounding\n\nPropensity score weighting and most other causal inference techniques only help with *observed* confounders---ones that we model correctly, at that.\nUnfortunately, we still may have unmeasured confounding, which we'll discuss below.\n\nRandomization is one causal inference technique that *does* deal with unmeasured confounding, one of the reasons it is so powerful.\n:::\n\nWe might also want to know how well-balanced the groups are by each confounder.\nOne way to do this is to calculate the standardized mean differences (SMDs) for each confounder with and without weights.\nWe'll calculate the SMDs with `tidy_smd()` then plot them with `geom_love()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_df <- tidy_smd(\n net_data_wts,\n c(income, health, temperature),\n .group = net,\n .wts = wts\n)\n\nggplot(\n plot_df,\n aes(\n x = abs(smd),\n y = variable,\n group = method,\n color = method\n )\n) +\n geom_love()\n```\n\n::: {.cell-output-display}\n![A love plot representing the standardized mean differences (SMD) between exposure groups of three confounders: temperature, income, and health. Before weighting, there are considerable differences in the groups. After weighting, the confounders are much more balanced between groups.](chapter-02_files/figure-html/fig-love-plot-net-data-1.png){#fig-love-plot-net-data width=672}\n:::\n:::\n\n\nA standard guideline is that balanced confounders should have an SMD of less than 0.1 on the absolute scale.\n0.1 is just a rule of thumb, but if we follow it, the variables in @fig-love-plot-net-data are well-balanced after weighting (and unbalanced before weighting).\n\nBefore we apply the weights to the outcome model, let's check their overall distribution for extreme weights.\nExtreme weights can destabilize the estimate and variance in the outcome model, so we want to be aware of it.\nWe'll also discuss several other types of weights that are less prone to this issue in @sec-estimands.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data_wts |>\n ggplot(aes(wts)) +\n geom_density(fill = \"#CC79A7\", color = NA, alpha = 0.8)\n```\n\n::: {.cell-output-display}\n![A density plot of the average treatment effect (ATE) weights. The plot is skewed, with higher values towards 8. This may indicate a problem with the model, but the weights aren't so extreme to destabilize the variance of the estimate.](chapter-02_files/figure-html/fig-ate-density-net-data-1.png){#fig-ate-density-net-data width=672}\n:::\n:::\n\n\nThe weights in @fig-ate-density-net-data are skewed, but there are no outrageous values.\nIf we saw extreme weights, we might try trimming or stabilizing them, or consider calculating an effect for a different estimand, which we'll discuss in @sec-estimands.\nIt doesn't look like we need to do that here, however.\n\n## Estimate the causal effect\n\nWe're now ready to use the ATE weights to (attempt to) account for confounding in the naive linear regression model.\nFitting such a model is pleasantly simple in this case: we fit the same model as before but with `weights = wts`, which will incorporate the inverse probability weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnet_data_wts |>\n lm(malaria_risk ~ net, data = _, weights = wts) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n \n1 (Inte… 42.7 0.442 96.7 0 41.9\n2 netTR… -12.5 0.624 -20.1 5.50e-81 -13.8\n# ℹ 1 more variable: conf.high \n```\n\n\n:::\n:::\n\n\n\n\nThe estimate for the average treatment effect is -12.5 (95% CI -13.8, -11.3).\nUnfortunately, the confidence intervals we're using are wrong because they don't account for the dependence within the weights!\nGenerally, confidence intervals for propensity score weighted models will be too narrow unless we correct for this dependence.\nThe nominal coverage of the confidence intervals will thus be wrong (they aren't 95% CIs because their coverage is much lower than 95%) and may lead to misinterpretation.\n\nWe've got several ways to address this problem, which we'll discuss in detail in @sec-outcome-model, including the bootstrap, robust standard errors, and manually accounting for the dependence with empirical sandwich estimators.\nFor this example, we'll use the bootstrap, a flexible tool that calculates distributions of parameters using re-sampling.\nWe'll use the rsample package from the tidymodels ecosystem to work with bootstrap samples.\n\n::: callout-note\n## The Bootstrap\n\nThe bootstrap is a simple but flexible algorithm for calculating statistics using re-sampling with replacement.\nIt's handy when a closed-form solution doesn't exist to calculate something, as is commonly the case in causal inference (particularly for standard errors), and when we suspect the parametric calculations are not valid for a given situation.\n\nBootstrapping in R has a long tradition of writing functions to calculate the statistic of interest, starting with the classic boot package.\nThroughout the book, we'll use rsample, a more modern alternative for re-sampling, but generally, we start with writing a function to calculate the estimate we're interested in.\n\nLet's say we want to calculate `some_statistic()` for `this_data`.\nTo bootstrap for *R* samples, we:\n\n1. Re-sample `this_data` with replacement.\n The same row may appear multiple (or no) times in a given bootstrap sample, simulating the sampling process in the underlying population.\n\n ``` r\n indices <- sample(\n # create a vector of indices:\n # 1 through the number of rows\n seq_len(nrow(this_data)), \n # sample a vector indices \n # that's the same length as `this_data`\n size = nrow(this_data), \n replace = TRUE\n )\n bootstrap_sample <- this_data[indices, ]\n ```\n\n2. Fit `some_statistic()` on the `bootstrap_sample`\n\n ``` r\n estimate <- some_statistic(bootstrap_sample)\n ```\n\n3. Repeat *R* times\n\nWe then end up with a distribution of `estimate`s, with which we can calculate population statistics, such as point estimates, standard errors, and confidence intervals.\n:::\n\nBecause the bootstrap is so flexible, we need to think carefully about the sources of uncertainty in the statistic we're calculating.\nIt might be tempting to write a function like this to fit the statistic we're interested in (the point estimate for `netTRUE`):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\nfit_ipw_not_quite_rightly <- function(split, ...) {\n # get bootstrapped data sample with `rsample::analysis()`\n .df <- analysis(split)\n\n # fit ipw model\n lm(malaria_risk ~ net, data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\nHowever, this function won't give us the correct confidence intervals because it treats the inverse probability weights as fixed values.\nThey're not, of course; we just estimated them using logistic regression!\nWe need to account for this uncertainty by bootstrapping the *entire modeling process*.\nFor every bootstrap sample, we need to fit the propensity score model, calculate the inverse probability weights, then fit the weighted outcome model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\nfit_ipw <- function(split, ...) {\n # get bootstrapped data sample with `rsample::analysis()`\n .df <- analysis(split)\n\n # fit propensity score model\n propensity_model <- glm(\n net ~ income + health + temperature,\n data = .df,\n family = binomial()\n )\n\n # calculate inverse probability weights\n .df <- propensity_model |>\n augment(type.predict = \"response\", data = .df) |>\n mutate(wts = wt_ate(.fitted, net))\n\n # fit correctly bootstrapped ipw model\n lm(malaria_risk ~ net, data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\nNow that we know precisely how to calculate the estimate for each iteration let's create the bootstrapped dataset with rsample's `bootstraps()` function.\nThe `times` argument determines how many bootstrapped datasets to create; we'll do 1,000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbootstrapped_net_data <- bootstraps(\n net_data,\n times = 1000,\n # required to calculate CIs later\n apparent = TRUE\n)\n\nbootstrapped_net_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 2\n splits id \n \n 1 Bootstrap0001\n 2 Bootstrap0002\n 3 Bootstrap0003\n 4 Bootstrap0004\n 5 Bootstrap0005\n 6 Bootstrap0006\n 7 Bootstrap0007\n 8 Bootstrap0008\n 9 Bootstrap0009\n10 Bootstrap0010\n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\nThe result is a nested data frame: each `splits` object contains metadata that rsample uses to subset the bootstrap samples for each of the 1,000 samples.\nNext, we'll run `fit_ipw()` 1,000 times to create a distribution for `estimate`.\nAt its heart, the calculation we're doing is\n\n``` r\nfit_ipw(bootstrapped_net_data$splits[[n]])\n```\n\nWhere *n* is one of 1,000.\nWe'll use purrr's `map()` function to iterate across each `split` object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results <- bootstrapped_net_data |>\n mutate(boot_fits = map(splits, fit_ipw))\n\nipw_results\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 3\n splits id boot_fits \n \n 1 Bootstrap0001 \n 2 Bootstrap0002 \n 3 Bootstrap0003 \n 4 Bootstrap0004 \n 5 Bootstrap0005 \n 6 Bootstrap0006 \n 7 Bootstrap0007 \n 8 Bootstrap0008 \n 9 Bootstrap0009 \n10 Bootstrap0010 \n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\nThe result is another nested data frame with a new column, `boot_fits`.\nEach element of `boot_fits` is the result of the IPW for the bootstrapped dataset.\nFor example, in the first bootstrapped data set, the IPW results were:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results$boot_fits[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 42.0 0.426 98.5 0 \n2 netTRUE -12.5 0.601 -20.9 1.99e-86\n```\n\n\n:::\n:::\n\n\nNow we have a distribution of estimates:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results |>\n mutate(\n estimate = map_dbl(\n boot_fits,\n # pull the `estimate` for `netTRUE` for each fit\n \\(.fit) .fit |>\n filter(term == \"netTRUE\") |>\n pull(estimate)\n )\n ) |>\n ggplot(aes(estimate)) +\n geom_histogram(fill = \"#D55E00FF\", color = \"white\", alpha = 0.8)\n```\n\n::: {.cell-output-display}\n![\"A histogram of 1,000 bootstrapped estimates of the effect of net use on malaria risk. The spread of these estimates accounts for the dependency and uncertainty in the use of IPW weights.\"](chapter-02_files/figure-html/fig-bootstrap-estimates-net-data-1.png){#fig-bootstrap-estimates-net-data width=672}\n:::\n:::\n\n\n@fig-bootstrap-estimates-net-data gives a sense of the variation in `estimate`, but let's calculate 95% confidence intervals from the bootstrapped distribution using rsample's `int_t()` :\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboot_estimate <- ipw_results |>\n # calculate T-statistic-based CIs\n int_t(boot_fits) |>\n filter(term == \"netTRUE\")\n\nboot_estimate\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 6\n term .lower .estimate .upper .alpha .method \n \n1 netTRUE -13.4 -12.5 -11.7 0.05 student-t\n```\n\n\n:::\n:::\n\n\nNow we have a confounder-adjusted estimate with correct standard errors.\nThe estimate of the effect of *all* households using bed nets versus *no* households using bed nets on malaria risk is -12.5 (95% CI -13.4, -11.7).\nBed nets do indeed seem to reduce malaria risk in this study.\n\n## Conduct sensitivity analysis on the effect estimate\n\nWe've laid out a roadmap for taking observational data, thinking critically about the causal question we want to ask, identifying the assumptions we need to get there, then applying those assumptions to a statistical model.\nGetting the correct answer to the causal question relies on getting our assumptions more or less right.\nBut what if we're more on the less correct side?\n\nSpoiler alert: the answer we just calculated is **wrong**.\nAfter all that effort!\n\nWhen conducting a causal analysis, it's a good idea to use sensitivity analyses to test your assumptions.\nThere are many potential sources of bias in any study and many sensitivity analyses to go along with them; we'll focus on the assumption of no confounding.\n\nLet's start with a broad sensitivity analysis; then, we'll ask questions about specific unmeasured confounders.\nWhen we have less information about unmeasured confounders, we can use tipping point analysis to ask how much confounding it would take to tip my estimate to the null.\nIn other words, what would the strength of the unmeasured confounder have to be to explain our results away?\nThe tipr package is a toolkit for conducting sensitivity analyses.\nLet's examine the tipping point for an unknown, normally-distributed confounder.\nThe `tip_coef()` function takes an estimate (a beta coefficient from a regression model, or the upper or lower bound of the coefficient).\nIt further requires either the 1) scaled differences in means of the confounder between exposure groups or 2) effect of the confounder on the outcome.\nFor the estimate, we'll use `conf.high`, which is closer to 0 (the null), and ask: how much would the confounder have to affect malaria risk to have an unbiased upper confidence interval of 0?\nWe'll use tipr to calculate this answer for 5 scenarios, where the mean difference in the confounder between exposure groups is 1, 2, 3, 4, or 5.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tipr)\ntipping_points <- tip_coef(boot_estimate$.upper, exposure_confounder_effect = 1:5)\n\ntipping_points |>\n ggplot(aes(confounder_outcome_effect, exposure_confounder_effect)) +\n geom_line(color = \"#009E73\", linewidth = 1.1) +\n geom_point(fill = \"#009E73\", color = \"white\", size = 2.5, shape = 21) +\n labs(\n x = \"Confounder-Outcome Effect\",\n y = \"Scaled mean differences in\\n confounder between exposure groups\"\n )\n```\n\n::: {.cell-output-display}\n![A tipping point analysis under several confounding scenarios where the unmeasured confounder is a normally-distributed continuous variable. The line represents the strength of confounding necessary to tip the upper confidence interval of the causal effect estimate to 0. The x-axis represents the coefficient of the confounder-outcome relationship adjusted for the exposure and the set of measured confounders. The y-axis represents the scaled mean difference of the confounder between exposure groups.](chapter-02_files/figure-html/fig-tip-coef-net-1.png){#fig-tip-coef-net width=672}\n:::\n:::\n\n\nIf we had an unmeasured confounder where the standardized mean difference between exposure groups was 1, the confounder would need to decrease malaria risk by about -11.7.\nThat's pretty strong relative to other effects, but it may be feasible if we have an idea of something we might have missed.\nConversely, suppose the relationship between net use and the unmeasured confounder is very strong, with a mean scaled difference of 5.\nIn that case, the confounder-malaria relationship only needs to be 5.\nNow we have to consider: which of these scenarios are plausible given our domain knowledge and the effects we see in this analysis?\n\nNow let's consider a much more specific sensitivity analysis.\nSome ethnic groups, such as the Fulani, have a genetic resistance to malaria [@arama2015].\nLet's say that in our simulated data, an unnamed ethnic group shares this genetic resistance to malaria.\nFor historical reasons, bed net use in this fictional group is also very high.\nWe don't have this variable in `net_data`, but let's say we know from the literature that in this sample, we can estimate at:\n\n1. People with this genetic resistance have, on average, about 10 lower malaria risk.\n2. About 26% of people who use nets in our study have this genetic resistance.\n3. About 5% of people who don't use nets have this genetic resistance.\n\nWith this amount of information, we can use tipr to adjust the estimates we calculated for the unmeasured confounder.\nWe'll use `adjust_coef_with_binary()` to calculate the adjusted estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjusted_estimates <- boot_estimate |>\n select(.estimate, .lower, .upper) |>\n unlist() |>\n adjust_coef_with_binary(\n exposed_confounder_prev = 0.26,\n unexposed_confounder_prev = 0.05,\n confounder_outcome_effect = -10\n )\n\nadjusted_estimates\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 4\n effect_adjusted effect_observed\n \n1 -10.4 -12.5\n2 -11.3 -13.4\n3 -9.59 -11.7\n# ℹ 2 more variables:\n# exposure_confounder_effect ,\n# confounder_outcome_effect \n```\n\n\n:::\n:::\n\n\nThe adjusted estimate for a situation where genetic resistance to malaria is a confounder is -10.4 (95% CI -11.3, -9.6).\n\nIn fact, these data were simulated with just such a confounder.\nThe true effect of net use on malaria is about -10, and the true DAG that generated these data is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![The true causal diagram for `net_data`. This DAG is identical to the one we proposed with one addition: genetic resistance to malaria causally reduces the risk of malaria and impacts net use. It's thus a confounder and a part of the minimal adjustment set required to get an unbiased effect estimate. In otherwords, by not including it, we've calculated the wrong effect.](chapter-02_files/figure-html/fig-net-data-true-dag-1.png){#fig-net-data-true-dag width=672}\n:::\n:::\n\n\n\n\nThe unmeasured confounder in @fig-net-data-true-dag is available in the dataset `net_data_full` as `genetic_resistance`.\nIf we recalculate the IPW estimate of the average treatment effect of nets on malaria risk, we get -10.2 (95% CI -11.2, -9.4), much closer to the actual answer of -10.\n\nWhat do you think?\nIs this estimate reliable?\nDid we do a good job addressing the assumptions we need to make for a causal effect, mainly that there is no confounding?\nHow might you criticize this model, and what would you do differently?\nOk, we know that -10 is the correct answer because the data are simulated, but in practice, we can never be sure, so we need to continue probing our assumptions until we're confident they are robust.\n\n\nTo calculate this effect, we:\n\n1. Specified a causal question (for the average treatment effect)\n2. Drew our assumptions using a causal diagram (using DAGs)\n3. Modeled our assumptions (using propensity score weighting)\n4. Diagnosed our models (by checking confounder balance after weighting)\n5. Estimated the causal effect (using inverse probability weighting)\n6. Conducted sensitivity analysis on the effect estimate (using tipping point analysis)\n\nThroughout the rest of the book, we'll follow these broad steps in several examples from medicine, economics, and industry.\nWe'll dive more deeply into propensity score techniques, explore alternative methods for calculating causal effects, and, most importantly, make sure, over and over again, that the assumptions we're making are reasonable---even if we'll never know for sure.\n", "supporting": [ "chapter-02_files" ], diff --git a/_freeze/chapters/chapter-02/figure-html/fig-ate-density-net-data-1.png b/_freeze/chapters/chapter-02/figure-html/fig-ate-density-net-data-1.png index 73e8031..cd2b73e 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-ate-density-net-data-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-ate-density-net-data-1.png differ diff --git a/_freeze/chapters/chapter-02/figure-html/fig-bootstrap-estimates-net-data-1.png b/_freeze/chapters/chapter-02/figure-html/fig-bootstrap-estimates-net-data-1.png index a619446..c4e55f5 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-bootstrap-estimates-net-data-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-bootstrap-estimates-net-data-1.png differ diff --git a/_freeze/chapters/chapter-02/figure-html/fig-malaria-risk-density-1.png b/_freeze/chapters/chapter-02/figure-html/fig-malaria-risk-density-1.png index ed0bd3d..98261fd 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-malaria-risk-density-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-malaria-risk-density-1.png differ diff --git a/_freeze/chapters/chapter-02/figure-html/fig-mirror-histogram-net-data-weighted-1.png b/_freeze/chapters/chapter-02/figure-html/fig-mirror-histogram-net-data-weighted-1.png index 3cd7fe2..a28d3ef 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-mirror-histogram-net-data-weighted-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-mirror-histogram-net-data-weighted-1.png differ diff --git a/_freeze/chapters/chapter-02/figure-html/fig-net-data-confounding-1.png b/_freeze/chapters/chapter-02/figure-html/fig-net-data-confounding-1.png index cfedf59..9fd002d 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-net-data-confounding-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-net-data-confounding-1.png differ diff --git a/_freeze/chapters/chapter-02/figure-html/fig-tip-coef-net-1.png b/_freeze/chapters/chapter-02/figure-html/fig-tip-coef-net-1.png index 5817e69..d60e6dd 100644 Binary files a/_freeze/chapters/chapter-02/figure-html/fig-tip-coef-net-1.png and b/_freeze/chapters/chapter-02/figure-html/fig-tip-coef-net-1.png differ diff --git a/_freeze/chapters/chapter-03/execute-results/html.json b/_freeze/chapters/chapter-03/execute-results/html.json index 57e9c9d..038d92d 100644 --- a/_freeze/chapters/chapter-03/execute-results/html.json +++ b/_freeze/chapters/chapter-03/execute-results/html.json @@ -1,8 +1,10 @@ { - "hash": "4e09aee172d6a4be4bd0de6a0280898e", + "hash": "300f61afecef2057d3d6fba3f50b3bd3", "result": { - "markdown": "# Estimating counterfactuals {#sec-counterfactuals}\n\n\n\n\n\n## Potential Outcomes {#sec-potential}\n\nLet's begin by thinking about the philosophical concept of a *potential outcome.* Prior to some \"cause\" occurring, for example receiving some exposure, the *potential outcomes* are all of the potential things that could occur depending on what you are exposed to.\nFor simplicity, let's assume an exposure has two levels:\n\n- $X=1$ if you are exposed\n\n- $X=0$ if you are not exposed\n\nUnder this simple scenario, there are two potential outcomes:\n\n- $Y(1)$ the potential outcome if you are exposed\n\n- $Y(0)$ the potential outcome if you are not exposed\n\nOnly *one* of these potential outcomes will actually be realized, the one corresponding to the exposure that actually occurred, and therefore only one is observable.\nIt is important to remember that these exposures are defined at a particular instance in time, so only one can happen to any individual.\nIn the case of a binary exposure, this leaves one potential outcome as *observable* and one *missing.* In fact, early causal inference methods were often framed as missing data problems; we need to make certain assumptions about the *missing counterfactuals*, the value of the potential outcome corresponding to the exposure(s) that did not occur.\n\nOur causal effect of interest is often some difference in potential outcomes $Y(1) - Y(0)$, averaged over a particular population.\n\n## Counterfactuals\n\nConceptually, the missing counterfactual outcome is one that would have occurred under a different set of circumstances.\nIn causal inference, we *wish* we could observe the conterfactual outcome that would have occurred in an alternate universe where the exposure status for a given observation was flipped.\nTo do this, we attempt to control for all factors that are related to an exposure and outcome such that we can *construct* (or estimate) such a counterfactual outcome.\n\nLet's think about a specific example.\nIce-T, best known as an American rapper and Fin on Law and Order: SVU, co-authored a book titled \"Split Decision: Life Stories\", published in 2022.\nHere is the synopsis:\n\n> **Award-winning actor, rapper, and producer Ice-T unveils a compelling memoir of his early life robbing jewelry stores until he found fame and fortune---while a handful of bad choices sent his former crime partner down an incredibly different path.**\\\n> \\\n> Ice-T rose to fame in the late 1980s, earning acclaim for his music before going on to enthrall television audiences as Odafin \"Fin\" Tutuola in *Law & Order: Special Victims Unit*.\n> But it could have gone much differently.\\\n>\n> \\\n> In this \"poignant and powerful\" (*Library Journal*, starred review) memoir, Ice-T and Spike, his former crime partner---collaborating with *New York Times* bestselling author Douglas Century---relate the shocking stories of their shared pasts, and how just a handful of decisions led to their incredibly different lives.\n> Both grew up in violent, gang-controlled Los Angeles neighborhoods and worked together to orchestrate a series of jewelry heists.\\\n>\n> \\\n> But while Ice-T was discovered rapping in a club and got his first record deal, Spike was caught for a jewelry robbery and did three years in prison.\n> As his music career began to take off, Ice made the decision to abandon the criminal life; Spike continued to plan increasingly ingenious and risky jewel heists.\n> And in 1992, after one of Spike's robberies ended tragically, he was sentenced to thirty-five years to life.\n> While he sat behind bars, he watched his former partner rise to fame in music, movies, and television.\\\n>\n> \\\n> \"Propulsive\" (*Publishers Weekly*, starred review), timely, and thoughtful, two men with two very different lives reveal how their paths might have very well been reversed if they made different choices.\n> All it took was a *split decision*.\n> [@split]\n\nThis premise is compelling because it implies that we are observing a *counterfactual*.\nThe book begins by setting up all the ways Ice-T and his friend Spike were similar prior to some important moment (both grew up in Los Angeles neighborhoods, both were involved with gangs, both worked together to orchestrate a series of jewelry heists, etc).\nThen something happens -- Ice-T makes a decision to abandon criminal life and Spike makes the opposite decision.\nWhat happens next for Ice-T includes fame and fortune, while Spike ends up with 35 years to life in prison.\nThis book is attempting a small study, two people who prior to some event were the same and after were different -- Spike's outcomes serve as the counterfactual to Ice-T's.\n\n::: {#tbl-causal-map layout-ncol=\"1\"}\n\n```{mermaid}\n%%| echo: false\nflowchart LR\nA{Ice-T} --> |observed| B(Abandons criminal life)\nA -.-> |missing counterfactual| C(Does one more heist)\nC -.-> D[35 years in prison]\nB --> E[Fame & Fortune]\n\nclassDef grey fill:#ddd\nclass D,C grey\n```\n\n```{mermaid}\n%%| echo: false\nflowchart LR\nA{Spike} -.-> |missing counterfactual| B(Abandons criminal life)\nA --> |observed| C(Does one more heist)\nC --> D[35 years in prison]\nB -.-> E[Fame & Fortune]\nclassDef grey fill:#ddd\nclass E,B grey\n```\n\n\nIce-T and Spike Causal Map\n:::\n\nIn practice, this is what we attempt to do with causal inference techniques.\nEven randomized trials are limited to a single factual world, so we compare the average effects of groups with different exposures.\nNow, having this as a concrete example of an attempt to construct a counterfactual scenario in the \"real-world\" there are several issues that we can immediately see, highlighting the difficulty in drawing such inference.\nFirst, while the synopsis implies that the two individuals were similar prior to the precipitating event that dictated their future opposite directions, we can easily identify factors in which perhaps they differed.\nIce-T decided to leave his life of crime, but that wasn't the only factor in his success: he had enough musical talent to make a career of it.\nDid Spike have Ice-T's musical talent?\nCan we really conclude that his life would have turned out exactly like Ice-T's if he had made the exact same choices as Ice-T?\nIf we want to truly estimate the causal effect of the decision to leave criminal life on Ice-T's future outcomes, we would need to observe his ultimate course both under making the decision and not.\nOf course this is not possible, so what can we do?\nPerhaps we can find someone else who is exactly like Ice-T who did not make the same decision and see how they fare.\nOf course, Ice-T is unique, it would be challenging to find someone exactly like him.\nAgain, this is attempted with Spike, and even so presents challenges.\nOften, instead of relying on a single individual, we rely on many individuals.\nWe could conduct an experiment where we *randomize* many individuals to leave criminal life (or not) and see how this impacts their outcomes *on average* (this randomized trial seems to present some ethical issues, perhaps we need to look to *observational* studies to help answer this question).\nIn any case, we must rely on statistical techniques to help construct these unobservable counterfactuals.\n\n### Potential Outcomes Simulation {#sec-po-sim}\n\nLet's suppose some happiness index, from 1-10 exists.\nWe are interested in assessing whether eating chocolate ice cream versus vanilla will increase happiness.\nWe have 10 individuals with two potential outcomes for each, one is what their happiness would be if they ate chocolate ice cream, (defined as `y_chocolate` in the code below), and one is what their happiness would be if they ate vanilla ice cream (defined as `y_vanilla` in the code below). We can define the true causal effect of eating chocolate ice cream (versus vanilla) on happiness for each individual as the difference between the two (@tbl-po).\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n)\n\ndata <- data |>\n mutate(causal_effect = y_chocolate - y_vanilla)\n\ndata\n```\n:::\n\n::: {#tbl-po .cell tbl-cap='Potential Outcomes Simulation: The causal effect of eating chocolate (versus vanilla) ice cream on happiness'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Potential Outcomes
Causal Effect
id $$Y_i(\\textrm{chocolate})$$ $$Y_i(\\textrm{vanilla})$$ $$Y_i(\\textrm{chocolate}) - Y_i(\\textrm{vanilla})$$
1 4 1 3
2 4 3 1
3 6 4 2
4 5 5 0
5 6 5 1
6 5 6 -1
7 6 8 -2
8 7 6 1
9 5 3 2
10 6 5 1
\n\n`````\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n summarize(\n avg_chocolate = mean(y_chocolate),\n avg_vanilla = mean(y_vanilla),\n avg_causal_effect = mean(causal_effect)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n avg_chocolate avg_vanilla avg_causal_effect\n1 5.4 4.6 0.8\n```\n\n\n:::\n:::\n\n\nFor example, examining @tbl-po, the causal effect of eating chocolate ice cream (versus vanilla) for individual `4` is 0, whereas the causal effect for individual `9` is 2. The *average* potential happiness after eating chocolate is 5.4 and the *average* potential happiness after eating vanilla is 4.6. The *average* treatment effect of eating chocolate (versus vanilla) ice cream among the ten individuals in this study is 0.8. \n\nIn reality, we cannot observe both potential outcomes, in any moment in time, each individual in our study can only eat *one* flavor of ice cream. Suppose we let our participants choose which ice cream they wanted to eat and each choose their favorite (i.e. they knew which would make them \"happier\" and picked that one. Now what we *observe* is shown in @tbl-obs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_observed <- data |>\n mutate(\n exposure = case_when(\n # people who like chocolate more chose that\n y_chocolate > y_vanilla ~ \"chocolate\",\n # people who like vanilla more chose that\n y_vanilla >= y_chocolate ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n # we can only observe the exposure and one potential outcome\n select(id, exposure, observed_outcome)\ndata_observed\n```\n:::\n\n::: {#tbl-obs .cell tbl-cap='Potential Outcomes Simulation: The observed exposure and outcome used to estimate the effect of eating chocolate (versus vanilla) ice cream on happiness'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Exposure
Observed Outcome
id $$X_i$$ $$Y_i$$
1 chocolate 4
2 chocolate 4
3 chocolate 6
4 vanilla 5
5 chocolate 6
6 vanilla 6
7 vanilla 8
8 chocolate 7
9 chocolate 5
10 chocolate 6
\n\n`````\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.43\n2 vanilla 6.33\n```\n\n\n:::\n:::\n\n\nNow, the *observed* average outcome among those who ate chocolate ice cream is 5.4 (the same as the true average potential outcome), while the *observed* average outcome among those who ate vanilla is 6.3 -- quite different from the *actual* average (4.6). The estimated causal effect here could be calculated as 5.4 - 6.3 = -0.9. \n\nIt turns out here, these 10 participants *chose* which ice cream they wanted to eat and they always chose to eat their favorite! This artificially made it look like eating vanilla ice cream would increase the happiness in this population when in fact we know the opposite is true. The next section will discuss which assumptions need to be true in order to allow us to *accurately* estimate causal effects using observed data. As a sneak peak, our issue here was that how the exposure was decided, if instead we *randomized* who ate chocolate versus vanilla ice cream we would (on average, with a large enough sample) recover the true causal effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## we are doing something *random* so let's set a seed so we always observe the\n## same result each time we run the code\nset.seed(11)\ndata_observed <- data |>\n mutate(\n # change the exposure to randomized, generate from a binomial distribution\n # with a probability 0.5 for being in either group\n exposure = case_when(\n rbinom(10, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n # we can only observe the exposure and one potential outcome\n select(id, exposure, observed_outcome)\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.33\n2 vanilla 4.71\n```\n\n\n:::\n:::\n\n\n## Causal Assumptions {#sec-assump}\n\nLike most statistical approaches, the validity of a causal analysis depends on how well certain assumptions are met.\nAs mentioned in @sec-potential, the potential outcomes framework envisions that each individual possesses a range of potential outcomes for every conceivable value of some input.\nFor instance, as in the scenario previously described with two exposure levels (exposed: 1 and unexposed: 0), we can define potential outcomes for exposure ($Y(1)$) and no exposure ($Y(0)$), and subsequently analyze the difference between these outcomes, i.e., $Y(1) - Y(0)$, to comprehend the impact of the input (the exposure) on the outcome, $Y$.\nAt any given time, only one of these *potential outcomes* is observable -- namely, the outcome tied to the actual exposure the individual underwent.\nUnder certain assumptions, we can leverage data from individuals exposed to different inputs to compare the average differences in their observed outcomes.\nThe most common assumptions across the approaches we describe in this book are:\n\n1. **Consistency**: We assume that the causal question you claim you are answering is consistent with the one you are *actually* answering with your analysis. Mathematically, this means that $Y_{obs} = (X)Y(1) + (1 - X)Y(0)$, in other words, the outcome you observe is exactly equal to the potential outcome under the exposure you received. Two common ways to discuss this assumption are: \n * **Well defined exposure**: We assume that for each value of the exposure, there is no difference between subjects in the delivery of that exposure.\nPut another way, multiple versions of the treatment do not exist. \n * **No interference**: We assume that the outcome (technically all *potential* outcomes, regardless of whether they are observed) for any subject does not depend on another subject's exposure.\n \n::: callout-tip\n## Jargon\n\nAssumption 1 is sometimes referred to as *stable-unit-treatment-value-assumption* or SUTVA [@imbens2015causal].\nLikewise, these assumptions are sometimes referred to as *identifiability conditions* since we need them to hold in order to identify causal estimates.\n:::\n\n2. **Exchangeability**: We assume that within levels of relevant variables (confounders), exposed and unexposed subjects have an equal likelihood of experiencing any outcome prior to exposure; i.e. the exposed and unexposed subjects are exchangeable.\nThis assumption is sometimes referred to as **no unmeasured confounding**.\n\n3. **Positivity**: We assume that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects.\nSaid differently, each individual has some chance of experiencing every available exposure level.\nSometimes this is referred to as the **probabilistic** assumption.\n\n\n\n\n::: callout-note\n## Apples-to-apples\n\nPractically, most of the assumptions we need to make for causal inference are so we can make an *apples-to-apples* comparison: we want to make sure we're comparing individuals that are similar --- who would serve as good proxies for each other's counterfactuals. \n\nThe phrase *apples-to-apples* stems from the saying \"comparing apples to oranges\", e.g. comparing two things that are incomparable. \n\nThat's only one way to say it. [There are a lot of variations worldwide](https://en.wikipedia.org/wiki/Apples_and_oranges). Here are some other things people incorrectly compare:\n\n* Cheese and chalk (UK English)\n* Apples and pears (German)\n* Potatoes and sweet potatoes (Latin American Spanish)\n* Grandmothers and toads (Serbian)\n* Horses and donkeys (Hindi)\n:::\n\n### Causal Assumptions Simulation\n\nLet's bring back our simulation from @sec-po-sim. Recall that we have individuals who will either eat chocolate or vanilla ice cream and we are interested in assessing the causal effect of this exposure on their happiness. Let's see how violations of each assumption may impact the estimation of the causal effect.\n\n#### Consistency violation\n\nTwo ways the consistency assumption can be violated is (1) lack of a well defined exposure and (2) interference. Let's see how these impact our ability to accurately estimate a causal effect.\n\n##### Well defined exposure\n\nSuppose that there were in fact two containers of chocolate ice cream, one of which was spoiled. Therefore, despite the fact that having an exposure \"chocolate\" could mean different things depending on where the individual's scoop came from (regular chocolate ice cream, or spoiled chocolate ice cream), we are lumping them all together under a single umbrella (hence the violation, we have \"multiple versions of treatment\"). You can see how this falls under consistency because the issue here is that the potential outcome we think we are estimating is not the one we are actually observing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_spoiledchocolate = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n) |>\n mutate(causal_effect = y_chocolate - y_vanilla)\n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n exposure_unobserved = case_when(\n rbinom(10, 1, 0.25) == 1 ~ \"chocolate (spoiled)\",\n rbinom(10, 1, 0.25) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure_unobserved == \"chocolate (spoiled)\" ~ y_spoiledchocolate,\n exposure_unobserved == \"chocolate\" ~ y_chocolate,\n exposure_unobserved == \"vanilla\" ~ y_vanilla\n ),\n exposure = case_when(\n exposure_unobserved %in% c(\"chocolate (spoiled)\", \"chocolate\") ~ \"chocolate\",\n exposure_unobserved == \"vanilla\" ~ \"vanilla\"\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 2.75\n2 vanilla 4.67\n```\n\n\n:::\n:::\n\n\nWe know the *true* average causal effect of (unspoiled) chocolate in the sample is 0.8, however our estimated causal effect (because our data are not consistent with the question we are asking) is -1.9. This demonstrates what can go wrong when *well defined exposure* is violated.\n\n##### Interference \n\nInterference would mean that an individual's exposure impacts another's potential outcome. For example, let's say each individual has a partner, and their potential outcome depends on both what flavor of ice cream they ate *and* what flavor their partner ate. For example, in the simulation below, having a partner that received a different flavor of ice cream increases the happiness by two units.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n partner_id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5),\n y_chocolate_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_chocolate_vanilla = c(6, 6, 8, 7, 8, 7, 8, 9, 7, 8),\n y_vanilla_chocolate = c(3, 5, 6, 7, 7, 8, 10, 8, 5, 7),\n y_vanilla_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n)\n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n exposure = case_when(\n rbinom(10, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n exposure_partner =\n c(\"vanilla\", \"vanilla\", \"vanilla\", \"chocolate\", \"chocolate\", \"vanilla\", \"vanilla\", \"vanilla\", \"vanilla\", \"chocolate\"),\n observed_outcome = case_when(\n exposure == \"chocolate\" & exposure_partner == \"chocolate\" ~ y_chocolate_chocolate,\n exposure == \"chocolate\" & exposure_partner == \"vanilla\" ~ y_chocolate_vanilla,\n exposure == \"vanilla\" & exposure_partner == \"chocolate\" ~ y_vanilla_chocolate,\n exposure == \"vanilla\" & exposure_partner == \"vanilla\" ~ y_vanilla_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 7.33\n2 vanilla 5.57\n```\n\n\n:::\n:::\n\nNow our estimated causal effect (because interference exists) is 1.8. This demonstrates what can go wrong when *interference* occurs. One of the main ways to combat interference is change the *unit* under consideration. Here, each individual, each unique *id*, is considered a unit, and there is interference between units (i.e. between partners). If instead we consider each *partner* as a unit and randomize the partners rather than the individuals, we solve the interference issue, as there is not interference *between* different partner sets. This is sometimes referred to as a *cluster randomized trial*. What we decide to do within each cluster may depend on the causal question at hand. For example, if we want to know what would happen if *everyone* at chocolate ice cream versus if *everyone* at vanilla, we would want to randomize both partners to either chocolate or vanilla, as seen below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(11)\n\n## we are now randomizing the *partners* not the individuals\npartners <- data.frame(\n partner_id = 1:5,\n exposure = case_when(\n rbinom(5, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n )\n)\ndata_observed <- data |>\n left_join(partners, by = \"partner_id\") |>\n mutate(\n # all partners have the same exposure\n exposure_partner = exposure,\n observed_outcome = case_when(\n exposure == \"chocolate\" & exposure_partner == \"chocolate\" ~ y_chocolate_chocolate,\n exposure == \"vanilla\" & exposure_partner == \"vanilla\" ~ y_vanilla_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.5 \n2 vanilla 4.38\n```\n\n\n:::\n:::\n\n\n#### Exchangeability violation\n\nWe have actually already seen an example of an exchangeability violation in @sec-po-sim. In that example, participants were able to choose the ice cream that they wanted to eat, so people who were more likely to have a positive effect from eating chocolate chose that, and those more likely to have a positive effect from eating vanilla chose that. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n)\ndata_observed <- data |>\n mutate(\n exposure = case_when(\n # people who like chocolate more chose that\n y_chocolate > y_vanilla ~ \"chocolate\",\n # people who like vanilla more chose that\n y_vanilla >= y_chocolate ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.43\n2 vanilla 6.33\n```\n\n\n:::\n:::\n\n\nHow could we correct this? If we had some people who preferred chocolate ice cream but ended up taking vanilla instead, we could *adjust* for the preference, and the effect conditioned on this would no longer have an exchangeability issue. It turns out that this example as we have constructed it doesn't lend itself to this solution because participants chose their preferred flavor 100% of the time making this *also* a positivity violation. \n\n#### Positivity violation\n\nAs stated above, the previous example violates both *exchangeability* and *positivity*. How could we fix it? As long as *some* people chose outside their preference with some probability (even if it is small!) we can remove this violation. Let's say instead of everyone picking their flavor of preference 100% of the time, they just had a 80% chance of picking that flavor.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n)\n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n prefer_chocolate = y_chocolate > y_vanilla,\n exposure = case_when(\n # people who like chocolate more chose that 80% of the time\n prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), \"chocolate\", \"vanilla\"),\n # people who like vanilla more chose that 80% of the time\n !prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), \"vanilla\", \"chocolate\")\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n select(id, prefer_chocolate, exposure, observed_outcome)\n\nlm(\n observed_outcome ~ I(exposure == \"chocolate\") + prefer_chocolate,\n data_observed\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = observed_outcome ~ I(exposure == \"chocolate\") + \n prefer_chocolate, data = data_observed)\n\nCoefficients:\n (Intercept) \n 6.156 \nI(exposure == \"chocolate\")TRUE \n 0.531 \n prefer_chocolateTRUE \n -1.469 \n```\n\n\n:::\n:::\n\nAfter *adjusting* for this variable (chocolate preference), we recover the correct causal effect. This value is not exactly the same as the truth we obtain with the (unobservable) potential outcomes because we are dealing with a small sample -- as our sample size increases this will get closer to the truth.\n\nCausal assumptions can be difficult to verify and may not hold for many data collection strategies.\nWe cannot overstate the importance of checking these criteria to the extent possible!\nFollowing any of the recipes in this book are unlikely to give valid answers if the causal assumptions are badly violated.\n", - "supporting": [], + "markdown": "# Estimating counterfactuals {#sec-counterfactuals}\n\n\n\n\n\n## Potential Outcomes {#sec-potential}\n\nLet's begin by thinking about the philosophical concept of a *potential outcome.* Prior to some \"cause\" occurring, for example receiving some exposure, the *potential outcomes* are all of the potential things that could occur depending on what you are exposed to.\nFor simplicity, let's assume an exposure has two levels:\n\n- $X=1$ if you are exposed\n\n- $X=0$ if you are not exposed\n\nUnder this simple scenario, there are two potential outcomes:\n\n- $Y(1)$ the potential outcome if you are exposed\n\n- $Y(0)$ the potential outcome if you are not exposed\n\nOnly *one* of these potential outcomes will actually be realized, the one corresponding to the exposure that actually occurred, and therefore only one is observable.\nIt is important to remember that these exposures are defined at a particular instance in time, so only one can happen to any individual.\nIn the case of a binary exposure, this leaves one potential outcome as *observable* and one *missing.* In fact, early causal inference methods were often framed as missing data problems; we need to make certain assumptions about the *missing counterfactuals*, the value of the potential outcome corresponding to the exposure(s) that did not occur.\n\nOur causal effect of interest is often some difference in potential outcomes $Y(1) - Y(0)$, averaged over a particular population.\n\n## Counterfactuals\n\nConceptually, the missing counterfactual outcome is one that would have occurred under a different set of circumstances.\nIn causal inference, we *wish* we could observe the conterfactual outcome that would have occurred in an alternate universe where the exposure status for a given observation was flipped.\nTo do this, we attempt to control for all factors that are related to an exposure and outcome such that we can *construct* (or estimate) such a counterfactual outcome.\n\nLet's think about a specific example.\nIce-T, best known as an American rapper and Fin on Law and Order: SVU, co-authored a book titled \"Split Decision: Life Stories\", published in 2022.\nHere is the synopsis:\n\n> **Award-winning actor, rapper, and producer Ice-T unveils a compelling memoir of his early life robbing jewelry stores until he found fame and fortune---while a handful of bad choices sent his former crime partner down an incredibly different path.**\\\n> \\\n> Ice-T rose to fame in the late 1980s, earning acclaim for his music before going on to enthrall television audiences as Odafin \"Fin\" Tutuola in *Law & Order: Special Victims Unit*.\n> But it could have gone much differently.\\\n>\n> \\\n> In this \"poignant and powerful\" (*Library Journal*, starred review) memoir, Ice-T and Spike, his former crime partner---collaborating with *New York Times* bestselling author Douglas Century---relate the shocking stories of their shared pasts, and how just a handful of decisions led to their incredibly different lives.\n> Both grew up in violent, gang-controlled Los Angeles neighborhoods and worked together to orchestrate a series of jewelry heists.\\\n>\n> \\\n> But while Ice-T was discovered rapping in a club and got his first record deal, Spike was caught for a jewelry robbery and did three years in prison.\n> As his music career began to take off, Ice made the decision to abandon the criminal life; Spike continued to plan increasingly ingenious and risky jewel heists.\n> And in 1992, after one of Spike's robberies ended tragically, he was sentenced to thirty-five years to life.\n> While he sat behind bars, he watched his former partner rise to fame in music, movies, and television.\\\n>\n> \\\n> \"Propulsive\" (*Publishers Weekly*, starred review), timely, and thoughtful, two men with two very different lives reveal how their paths might have very well been reversed if they made different choices.\n> All it took was a *split decision*.\n> [@split]\n\nThis premise is compelling because it implies that we are observing a *counterfactual*.\nThe book begins by setting up all the ways Ice-T and his friend Spike were similar prior to some important moment (both grew up in Los Angeles neighborhoods, both were involved with gangs, both worked together to orchestrate a series of jewelry heists, etc).\nThen something happens -- Ice-T makes a decision to abandon criminal life and Spike makes the opposite decision.\nWhat happens next for Ice-T includes fame and fortune, while Spike ends up with 35 years to life in prison.\nThis book is attempting a small study, two people who prior to some event were the same and after were different -- Spike's outcomes serve as the counterfactual to Ice-T's.\n\n::: {#tbl-causal-map layout-ncol=\"1\"}\n\n```{mermaid}\n%%| echo: false\nflowchart LR\nA{Ice-T} --> |observed| B(Abandons criminal life)\nA -.-> |missing counterfactual| C(Does one more heist)\nC -.-> D[35 years in prison]\nB --> E[Fame & Fortune]\n\nclassDef grey fill:#ddd\nclass D,C grey\n```\n\n```{mermaid}\n%%| echo: false\nflowchart LR\nA{Spike} -.-> |missing counterfactual| B(Abandons criminal life)\nA --> |observed| C(Does one more heist)\nC --> D[35 years in prison]\nB -.-> E[Fame & Fortune]\nclassDef grey fill:#ddd\nclass E,B grey\n```\n\n\nIce-T and Spike Causal Map\n:::\n\nIn practice, this is what we attempt to do with causal inference techniques.\nEven randomized trials are limited to a single factual world, so we compare the average effects of groups with different exposures.\nNow, having this as a concrete example of an attempt to construct a counterfactual scenario in the \"real-world\" there are several issues that we can immediately see, highlighting the difficulty in drawing such inference.\nFirst, while the synopsis implies that the two individuals were similar prior to the precipitating event that dictated their future opposite directions, we can easily identify factors in which perhaps they differed.\nIce-T decided to leave his life of crime, but that wasn't the only factor in his success: he had enough musical talent to make a career of it.\nDid Spike have Ice-T's musical talent?\nCan we really conclude that his life would have turned out exactly like Ice-T's if he had made the exact same choices as Ice-T?\nIf we want to truly estimate the causal effect of the decision to leave criminal life on Ice-T's future outcomes, we would need to observe his ultimate course both under making the decision and not.\nOf course this is not possible, so what can we do?\nPerhaps we can find someone else who is exactly like Ice-T who did not make the same decision and see how they fare.\nOf course, Ice-T is unique, it would be challenging to find someone exactly like him.\nAgain, this is attempted with Spike, and even so presents challenges.\nOften, instead of relying on a single individual, we rely on many individuals.\nWe could conduct an experiment where we *randomize* many individuals to leave criminal life (or not) and see how this impacts their outcomes *on average* (this randomized trial seems to present some ethical issues, perhaps we need to look to *observational* studies to help answer this question).\nIn any case, we must rely on statistical techniques to help construct these unobservable counterfactuals.\n\n### Potential Outcomes Simulation {#sec-po-sim}\n\nLet's suppose some happiness index, from 1-10 exists.\nWe are interested in assessing whether eating chocolate ice cream versus vanilla will increase happiness.\nWe have 10 individuals with two potential outcomes for each, one is what their happiness would be if they ate chocolate ice cream, (defined as `y_chocolate` in the code below), and one is what their happiness would be if they ate vanilla ice cream (defined as `y_vanilla` in the code below). We can define the true causal effect of eating chocolate ice cream (versus vanilla) on happiness for each individual as the difference between the two (@tbl-po).\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n) \n\ndata <- data |>\n mutate(causal_effect = y_chocolate - y_vanilla)\n\ndata\n```\n:::\n\n::: {#tbl-po .cell tbl-cap='Potential Outcomes Simulation: The causal effect of eating chocolate (versus vanilla) ice cream on happiness'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Potential Outcomes
Causal Effect
id $$Y_i(\\textrm{chocolate})$$ $$Y_i(\\textrm{vanilla})$$ $$Y_i(\\textrm{chocolate}) - Y_i(\\textrm{vanilla})$$
1 4 1 3
2 4 3 1
3 6 4 2
4 5 5 0
5 6 5 1
6 5 6 -1
7 6 8 -2
8 7 6 1
9 5 3 2
10 6 5 1
\n\n`````\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n summarize(\n avg_chocolate = mean(y_chocolate),\n avg_vanilla = mean(y_vanilla),\n avg_causal_effect = mean(causal_effect)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n avg_chocolate avg_vanilla avg_causal_effect\n1 5.4 4.6 0.8\n```\n\n\n:::\n:::\n\n\nFor example, examining @tbl-po, the causal effect of eating chocolate ice cream (versus vanilla) for individual `4` is 0, whereas the causal effect for individual `9` is 2. The *average* potential happiness after eating chocolate is 5.4 and the *average* potential happiness after eating vanilla is 4.6. The *average* treatment effect of eating chocolate (versus vanilla) ice cream among the ten individuals in this study is 0.8. \n\nIn reality, we cannot observe both potential outcomes, in any moment in time, each individual in our study can only eat *one* flavor of ice cream. Suppose we let our participants choose which ice cream they wanted to eat and each choose their favorite (i.e. they knew which would make them \"happier\" and picked that one. Now what we *observe* is shown in @tbl-obs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_observed <- data |>\n mutate(\n exposure = case_when(\n # people who like chocolate more chose that\n y_chocolate > y_vanilla ~ \"chocolate\",\n # people who like vanilla more chose that\n y_vanilla >= y_chocolate ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n # we can only observe the exposure and one potential outcome\n select(id, exposure, observed_outcome)\ndata_observed\n```\n:::\n\n::: {#tbl-obs .cell tbl-cap='Potential Outcomes Simulation: The observed exposure and outcome used to estimate the effect of eating chocolate (versus vanilla) ice cream on happiness'}\n::: {.cell-output-display}\n`````{=html}\n\n \n\n\n\n\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Exposure
Observed Outcome
id $$X_i$$ $$Y_i$$
1 chocolate 4
2 chocolate 4
3 chocolate 6
4 vanilla 5
5 chocolate 6
6 vanilla 6
7 vanilla 8
8 chocolate 7
9 chocolate 5
10 chocolate 6
\n\n`````\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.43\n2 vanilla 6.33\n```\n\n\n:::\n:::\n\n\nNow, the *observed* average outcome among those who ate chocolate ice cream is 5.4 (the same as the true average potential outcome), while the *observed* average outcome among those who ate vanilla is 6.3 -- quite different from the *actual* average (4.6). The estimated causal effect here could be calculated as 5.4 - 6.3 = -0.9. \n\nIt turns out here, these 10 participants *chose* which ice cream they wanted to eat and they always chose to eat their favorite! This artificially made it look like eating vanilla ice cream would increase the happiness in this population when in fact we know the opposite is true. The next section will discuss which assumptions need to be true in order to allow us to *accurately* estimate causal effects using observed data. As a sneak peak, our issue here was that how the exposure was decided, if instead we *randomized* who ate chocolate versus vanilla ice cream we would (on average, with a large enough sample) recover the true causal effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## we are doing something *random* so let's set a seed so we always observe the\n## same result each time we run the code\nset.seed(11)\ndata_observed <- data |>\n mutate(\n # change the exposure to randomized, generate from a binomial distribution\n # with a probability 0.5 for being in either group \n exposure = case_when(\n rbinom(10, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n # we can only observe the exposure and one potential outcome\n select(id, exposure, observed_outcome)\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.33\n2 vanilla 4.71\n```\n\n\n:::\n:::\n\n\n## Causal Assumptions {#sec-assump}\n\nLike most statistical approaches, the validity of a causal analysis depends on how well certain assumptions are met.\nAs mentioned in @sec-potential, the potential outcomes framework envisions that each individual possesses a range of potential outcomes for every conceivable value of some input.\nFor instance, as in the scenario previously described with two exposure levels (exposed: 1 and unexposed: 0), we can define potential outcomes for exposure ($Y(1)$) and no exposure ($Y(0)$), and subsequently analyze the difference between these outcomes, i.e., $Y(1) - Y(0)$, to comprehend the impact of the input (the exposure) on the outcome, $Y$.\nAt any given time, only one of these *potential outcomes* is observable -- namely, the outcome tied to the actual exposure the individual underwent.\nUnder certain assumptions, we can leverage data from individuals exposed to different inputs to compare the average differences in their observed outcomes.\nThe most common assumptions across the approaches we describe in this book are:\n\n1. **Consistency**: We assume that the causal question you claim you are answering is consistent with the one you are *actually* answering with your analysis. Mathematically, this means that $Y_{obs} = (X)Y(1) + (1 - X)Y(0)$, in other words, the outcome you observe is exactly equal to the potential outcome under the exposure you received. Two common ways to discuss this assumption are: \n * **Well defined exposure**: We assume that for each value of the exposure, there is no difference between subjects in the delivery of that exposure.\nPut another way, multiple versions of the treatment do not exist. \n * **No interference**: We assume that the outcome (technically all *potential* outcomes, regardless of whether they are observed) for any subject does not depend on another subject's exposure.\n \n::: callout-tip\n## Jargon\n\nAssumption 1 is sometimes referred to as *stable-unit-treatment-value-assumption* or SUTVA [@imbens2015causal].\nLikewise, these assumptions are sometimes referred to as *identifiability conditions* since we need them to hold in order to identify causal estimates.\n:::\n\n2. **Exchangeability**: We assume that within levels of relevant variables (confounders), exposed and unexposed subjects have an equal likelihood of experiencing any outcome prior to exposure; i.e. the exposed and unexposed subjects are exchangeable.\nThis assumption is sometimes referred to as **no unmeasured confounding**.\n\n3. **Positivity**: We assume that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects.\nSaid differently, each individual has some chance of experiencing every available exposure level.\nSometimes this is referred to as the **probabilistic** assumption.\n\n\n\n\n::: callout-note\n## Apples-to-apples\n\nPractically, most of the assumptions we need to make for causal inference are so we can make an *apples-to-apples* comparison: we want to make sure we're comparing individuals that are similar --- who would serve as good proxies for each other's counterfactuals. \n\nThe phrase *apples-to-apples* stems from the saying \"comparing apples to oranges\", e.g. comparing two things that are incomparable. \n\nThat's only one way to say it. [There are a lot of variations worldwide](https://en.wikipedia.org/wiki/Apples_and_oranges). Here are some other things people incorrectly compare:\n\n* Cheese and chalk (UK English)\n* Apples and pears (German)\n* Potatoes and sweet potatoes (Latin American Spanish)\n* Grandmothers and toads (Serbian)\n* Horses and donkeys (Hindi)\n:::\n\n### Causal Assumptions Simulation\n\nLet's bring back our simulation from @sec-po-sim. Recall that we have individuals who will either eat chocolate or vanilla ice cream and we are interested in assessing the causal effect of this exposure on their happiness. Let's see how violations of each assumption may impact the estimation of the causal effect.\n\n#### Consistency violation\n\nTwo ways the consistency assumption can be violated is (1) lack of a well defined exposure and (2) interference. Let's see how these impact our ability to accurately estimate a causal effect.\n\n##### Well defined exposure\n\nSuppose that there were in fact two containers of chocolate ice cream, one of which was spoiled. Therefore, despite the fact that having an exposure \"chocolate\" could mean different things depending on where the individual's scoop came from (regular chocolate ice cream, or spoiled chocolate ice cream), we are lumping them all together under a single umbrella (hence the violation, we have \"multiple versions of treatment\"). You can see how this falls under consistency because the issue here is that the potential outcome we think we are estimating is not the one we are actually observing.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_spoiledchocolate = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n) |>\n mutate(causal_effect = y_chocolate - y_vanilla)\n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n exposure_unobserved = case_when(\n rbinom(10, 1, 0.25) == 1 ~ \"chocolate (spoiled)\",\n rbinom(10, 1, 0.25) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure_unobserved == \"chocolate (spoiled)\" ~ y_spoiledchocolate,\n exposure_unobserved == \"chocolate\" ~ y_chocolate,\n exposure_unobserved == \"vanilla\" ~ y_vanilla\n ),\n exposure = case_when(\n exposure_unobserved %in% c(\"chocolate (spoiled)\", \"chocolate\") ~ \"chocolate\",\n exposure_unobserved == \"vanilla\" ~ \"vanilla\"\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 2.75\n2 vanilla 4.67\n```\n\n\n:::\n:::\n\n\nWe know the *true* average causal effect of (unspoiled) chocolate in the sample is 0.8, however our estimated causal effect (because our data are not consistent with the question we are asking) is -1.9. This demonstrates what can go wrong when *well defined exposure* is violated.\n\n##### Interference \n\nInterference would mean that an individual's exposure does not impact another's potential outcome. For example, let's say each individual has a partner, and their potential outcome depends on both what flavor of ice cream they ate *and* what flavor their partner ate. For example, in the simulation below, having a partner that received a different flavor of ice cream increases the happiness by two units.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n partner_id = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5),\n y_chocolate_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_chocolate_vanilla = c(6, 6, 8, 7, 8, 7, 8, 9, 7, 8),\n y_vanilla_chocolate = c(3, 5, 6, 7, 7, 8, 10, 8, 5, 7),\n y_vanilla_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n)\n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n exposure = case_when(\n rbinom(10, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n ),\n exposure_partner =\n c(\"vanilla\", \"vanilla\", \"vanilla\", \"chocolate\", \"chocolate\", \"vanilla\", \"vanilla\", \"vanilla\", \"vanilla\", \"chocolate\"),\n observed_outcome = case_when(\n exposure == \"chocolate\" & exposure_partner == \"chocolate\" ~ y_chocolate_chocolate,\n exposure == \"chocolate\" & exposure_partner == \"vanilla\" ~ y_chocolate_vanilla,\n exposure == \"vanilla\" & exposure_partner == \"chocolate\" ~ y_vanilla_chocolate,\n exposure == \"vanilla\" & exposure_partner == \"vanilla\" ~ y_vanilla_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 7.33\n2 vanilla 5.57\n```\n\n\n:::\n:::\n\nNow our estimated causal effect (because interference exists) is 1.8. This demonstrates what can go wrong when *interference* occurs. One of the main ways to combat interference is change the *unit* under consideration. Here, each individual, each unique *id*, is considered a unit, and there is interference between units (i.e. between partners). If instead we consider each *partner* as a unit and randomize the partners rather than the individuals, we solve the interference issue, as there is not interference *between* different partner sets. This is sometimes referred to as a *cluster randomized trial*. What we decide to do within each cluster may depend on the causal question at hand. For example, if we want to know what would happen if *everyone* at chocolate ice cream versus if *everyone* at vanilla, we would want to randomize both partners to either chocolate or vanilla, as seen below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(11)\n\n## we are now randomizing the *partners* not the individuals\npartners <- data.frame(\n partner_id = 1:5,\n exposure = case_when(\n rbinom(5, 1, 0.5) == 1 ~ \"chocolate\",\n TRUE ~ \"vanilla\"\n )\n)\ndata_observed <- data |>\n left_join(partners, by = \"partner_id\") |>\n mutate(\n # all partners have the same exposure\n exposure_partner = exposure,\n observed_outcome = case_when(\n exposure == \"chocolate\" & exposure_partner == \"chocolate\" ~ y_chocolate_chocolate,\n exposure == \"vanilla\" & exposure_partner == \"vanilla\" ~ y_vanilla_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.5 \n2 vanilla 4.38\n```\n\n\n:::\n:::\n\n\n#### Exchangeability violation\n\nWe have actually already seen an example of an exchangeability violation in @sec-po-sim. In that example, participants were able to choose the ice cream that they wanted to eat, so people who were more likely to have a positive effect from eating chocolate chose that, and those more likely to have a positive effect from eating vanilla chose that. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n) \ndata_observed <- data |>\n mutate(\n exposure = case_when(\n # people who like chocolate more chose that\n y_chocolate > y_vanilla ~ \"chocolate\",\n # people who like vanilla more chose that\n y_vanilla >= y_chocolate ~ \"vanilla\"\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n select(id, exposure, observed_outcome)\n\ndata_observed |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(observed_outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 chocolate 5.43\n2 vanilla 6.33\n```\n\n\n:::\n:::\n\n\nHow could we correct this? If we had some people who preferred chocolate ice cream but ended up taking vanilla instead, we could *adjust* for the preference, and the effect conditioned on this would no longer have an exchangeability issue. It turns out that this example as we have constructed it doesn't lend itself to this solution because participants chose their preferred flavor 100% of the time making this *also* a positivity violation. \n\n#### Positivity violation\n\nAs stated above, the previous example violates both *exchangeability* and *positivity*. How could we fix it? As long as *some* people chose outside their preference with some probability (even if it is small!) we can remove this violation. Let's say instead of everyone picking their flavor of preference 100% of the time, they just had a 80% chance of picking that flavor.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- data.frame(\n id = 1:10,\n y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),\n y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)\n) \n\nset.seed(11)\ndata_observed <- data |>\n mutate(\n prefer_chocolate = y_chocolate > y_vanilla ,\n exposure = case_when(\n # people who like chocolate more chose that 80% of the time\n prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), \"chocolate\", \"vanilla\"),\n # people who like vanilla more chose that 80% of the time\n !prefer_chocolate ~ ifelse(rbinom(10, 1, 0.8), \"vanilla\", \"chocolate\")\n ),\n observed_outcome = case_when(\n exposure == \"chocolate\" ~ y_chocolate,\n exposure == \"vanilla\" ~ y_vanilla\n )\n ) |>\n select(id, prefer_chocolate, exposure, observed_outcome)\n\nlm(observed_outcome ~ I(exposure == \"chocolate\") + prefer_chocolate, \n data_observed)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = observed_outcome ~ I(exposure == \"chocolate\") + \n prefer_chocolate, data = data_observed)\n\nCoefficients:\n (Intercept) \n 6.156 \nI(exposure == \"chocolate\")TRUE \n 0.531 \n prefer_chocolateTRUE \n -1.469 \n```\n\n\n:::\n:::\n\nAfter *adjusting* for this variable (chocolate preference), we recover the correct causal effect. This value is not exactly the same as the truth we obtain with the (unobservable) potential outcomes because we are dealing with a small sample -- as our sample size increases this will get closer to the truth.\n\nCausal assumptions can be difficult to verify and may not hold for many data collection strategies.\nWe cannot overstate the importance of checking these criteria to the extent possible!\nFollowing any of the recipes in this book are unlikely to give valid answers if the causal assumptions are badly violated.\n", + "supporting": [ + "chapter-03_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/chapters/chapter-04/execute-results/html.json b/_freeze/chapters/chapter-04/execute-results/html.json index 3ea179e..c6fcb03 100644 --- a/_freeze/chapters/chapter-04/execute-results/html.json +++ b/_freeze/chapters/chapter-04/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "b3d388ce27239686e821c3825a1c7df3", + "hash": "4c86b93e4742ed048489d6cac7992ebc", "result": { - "markdown": "# Target Trials and Standard Methods {#sec-trials-std}\n\n\n\n\n\n\n## Randomized trials {#sec-rand-trials}\n\nA *randomized trial* is one where the exposure (cause) of interest is *randomly assigned*.\n\n::: callout-note\nIn this book, we refer to analyses where the exposure is randomly assigned as a *randomized trial*, sometimes this is called an A/B test.\n:::\n\nWhy does randomization help?\nLooking at our assumptions in @sec-assump, randomized trials solve the well defined exposure portion of consistency by default -- the exposure is exactly what is randomized.\nLikewise for positivity; if we have randomly assigned folks to either the exposed or unexposed groups, we know the probability of assignment (and we know it is not exactly 0 or 1).\nRandomization alone does not solve the interference portion of consistency (for example, if we randomize some people to receive a vaccine for a communicable disease, their receiving it could lower the chance of those around them contracting the infectious disease because it changes the probability of exposure).\n*Ideal* randomized trials resolves the issue of exchangeability because the exposed and unexposed populations (in the limit) are inherently the same since their exposure status was determined by a random process (not by any factors that might make them different from each other).\nGreat!\nIn reality, we often see this assumption violated by issues such as *drop out* or *non-adherence* in randomized trials.\nIf there is differential drop out between exposure groups (for example, if participants randomly assigned to the treatment are more likely to drop out of a study, and thus we don't observe their outcome), then the observed exposure groups are no longer *exchangeable*.\nTherefore, in @tbl-assump-solved we have two columns, one for the *ideal* randomized trial (where adherence is assumed to be perfect and no participants drop out) and one for *realistic* randomized trials where this may not be so.\n\n| Assumption | Ideal Randomized Trial | Realistic Randomized Trial | Observational Study |\n|-----------------|-----------------|---------------------|-------------------|\n| Consistency (Well defined exposure) | 😄 | 😄 | 🤷 |\n| Consistency (No interference) | 🤷 | 🤷 | 🤷 |\n| Positivity | 😄 | 😄 | 🤷 |\n| Exchangeability | 😄 | 🤷 | 🤷 |\n\n: Assumptions solved by study design. 😄 indicates it is solved by default, 🤷 indicates that it is *solvable* but not solved by default. {#tbl-assump-solved}\n\nWhen designing a study, the first step is asking an appropriate *causal question*. We then can map this question to a *protocol*, consisting of the following seven elements, as defined by @hernan2016using:\n\n* Eligibility criteria\n* Exposure definition\n* Assignment procedures\n* Follow-up period\n* Outcome definition\n* Causal contrast of interest\n* Analysis plan\n\nIn @tbl-protocol we map each of these elements to the corresponding assumption that it can address. For example, exchangeability can be addressed by the eligibility criteria (we can restrict our study to only participants for whom exposure assignment is exchangeable), assignment procedure (we could use random exposure assignment to ensure exchangeability), follow-up period (we can be sure to choose an appropriate start time for our follow-up period to ensure that we are not inducing bias -- we'll think more about this in a future chapter), and/or the analysis plan (we can adjust for any factors that would cause those in different exposure groups to lack exchangeability). \n\n\n\nAssumption | Eligibility Criteria | Exposure Definition| Assignment Procedures | Follow-up Period | Outcome Definition | Causal contrast | Analysis Plan\n------------|----------------- | ------------------|---------|----------|----------|--------- | -------\nConsistency (Well defined exposure) | |✔️|| | |\nConsistency (No interference) | | ✔️|✔️ | | ✔️ | | ✔️\nPositivity |✔️||✔️|| | | ✔️\nExchangeability |✔️||✔️|✔️|| | ✔️\n: Mapping assumptions to elements of a study protocol {#tbl-protocol}\n\nRecall our diagrams from @sec-diag (@fig-diagram-4); several of these protocol elements can be mapped to these diagrams when we are attempting to define our causal question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram mapped to causal analysis terminology](chapter-04_files/figure-html/fig-diagram-4-1.png){#fig-diagram-4 width=672}\n:::\n:::\n\n\n\n## Target Trials\n\nThere are many reasons why randomization may not be possible. For example, it might not be ethical to randomly assign people to a particular exposure, there may not be funding available to run a randomized trial, or there might not be enough time to conduct a full trial. In these situations, we rely on observational data to help us answer causal questions by implementing a *target trial*. A *target trial* answers: What experiment would you design if you could? \n\n## Causal inference with `group_by()` and `summarize()` {#sec-group-sum}\n\nLet's suppose we are trying to estimate a causal effect of an exposure on an outcome, but the exposure is not *randomized*, in fact, there is a common cause of the exposure and outcome, making the exposed and unexposed groups *not exchangeable* without adjustment (violating the fourth assumption in @sec-assump).\n\n::: callout-note\nA **confounder** is a common cause of exposure and outcome.\n:::\n\n### One binary confounder\n\nLet's suppose this confounder is binary, see the simulation below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000\nsim <- tibble(\n # generate the confounder from a binomial distribution\n # with a probability 0.5 for being in either group\n confounder = rbinom(n, 1, 0.5),\n # make the probability of exposure dependent on the\n # confounder value\n p_exposure = case_when(\n confounder == 1 ~ 0.75,\n confounder == 0 ~ 0.25\n ),\n # generate the exposure from a binomial distribution\n # with the probability of exposure dependent on the confounder\n exposure = rbinom(n, 1, p_exposure),\n # generate the \"true\" average treatment effect of 0\n # to do this, we are going to generate the potential outcomes, first\n # the potential outcome if exposure = 0\n # (notice exposure is not in the equation below, only the confounder)\n # we use rnorm(n) to add the random error term that is normally\n # distributed with a mean of 0 and a standard deviation of 1\n y0 = confounder + rnorm(n),\n # because the true effect is 0, the potential outcome if exposure = 1\n # is identical\n y1 = y0,\n # now, in practice we will only see one of these, outcome is what is\n # observed\n outcome = (1 - exposure) * y0 + exposure * y1,\n observed_potential_outcome = case_when(\n exposure == 0 ~ \"y0\",\n exposure == 1 ~ \"y1\"\n )\n)\n```\n:::\n\n\nHere we have one binary `confounder`, the probability that `confounder = 1` is `0.5`.\nThe probability of the being exposed is `0.75` for those for whom `confounder = 1` `0.25` for those for whom `confounder = 0`.\nThere is no effect of the `exposure` on the `outcome` (the true causal effect is 0); the `outcome` effect is fully dependent on the `confounder`. In this simulation we generate the potential outcomes to drive home our assumptions; many of our simulations in this book will skip this step.\nLet's look at this generated data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n select(confounder, exposure, outcome, observed_potential_outcome)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10,000 × 4\n confounder exposure outcome observed_potential_out…¹\n \n 1 0 0 -0.804 y0 \n 2 0 0 -1.06 y0 \n 3 1 1 -0.0354 y1 \n 4 1 1 -0.186 y1 \n 5 0 0 -0.500 y0 \n 6 1 1 0.475 y1 \n 7 1 0 0.698 y0 \n 8 1 0 1.47 y0 \n 9 1 0 0.752 y0 \n10 0 0 1.26 y0 \n# ℹ 9,990 more rows\n# ℹ abbreviated name: ¹​observed_potential_outcome\n```\n\n\n:::\n:::\n\n\nGreat! Let's begin by proving to ourselves that this violates the exchangeability assumption. Recall from @sec-assump:\n\n> **Exchangeability**: We assume that within levels of relevant variables (confounders), exposed and unexposed subjects have an equal likelihood of experiencing any outcome prior to exposure; i.e. the exposed and unexposed subjects are exchangeable. This assumption is sometimes referred to as **no unmeasured confounding**.\n\nNow, let's try to estimate the effect of the `exposure` on the `outcome` assuming the two exposed groups are exchangeable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 0 0.228\n2 1 0.756\n```\n\n\n:::\n:::\n\n\nThe average outcome among the exposed is `0.76` and among the unexposed `0.23`, yielding an average effect of the exposure of `0.76-0.23=0.53`.\nLet's do a little R work to get there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_outcome,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n estimate\n \n1 0.528\n```\n\n\n:::\n:::\n\n\nOk, so assuming the exposure groups are exchangeable (and assuming the rest of the assumptions from @sec-assump hold), we estimate the effect of the exposure on the outcome to be 0.53.\nWe *know* the exchaneability assumption is violated based on how we simulated the data.\nHow can we estimate an unbiased effect?\nThe easiest way to do so is to estimate the effect within each confounder class.\nThis will work because folks with the same value of the confounder have an equal probability of exposure.\nInstead of just grouping by the exposure, let's group by the confounder as well:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 3\n# Groups: confounder [2]\n confounder exposure avg_outcome\n \n1 0 0 -0.0185 \n2 0 1 0.00954\n3 1 0 0.994 \n4 1 1 1.01 \n```\n\n\n:::\n:::\n\n\nWe can now calculate the average effect within each confounder class as well as the overall causal effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Average effect within each confounder class\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_outcome,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 2\n confounder estimate\n \n1 0 0.0280\n2 1 0.0122\n```\n\n\n:::\n\n```{.r .cell-code}\n## Overall average effect\n\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_outcome,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) |>\n summarise(mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n `mean(estimate)`\n \n1 0.0201\n```\n\n\n:::\n:::\n\n\nGreat!\nNow our estimate is much closer to the true value (0).\n\n::: callout\nThe method we are using to solve the fact that our two groups are not exchangeable is known as **stratification**. We are *stratifying* by the confounder(s) and estimating the causal effect within each stratum. To get an overall average causal effect we are averaging across the strata. This can be a great tool if there are very few confounders, however it can suffer from the curse of dimensionality as the number of confounders as well as the number of levels within each confounder increases.\n:::\n\n### Two binary confounders\n\nLet's extend this to two binary confounders.\nThe simulation below now has two binary confounders.\nThe true causal effect of the `exposure` on the `outcome` is still 0.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000\nsim2 <- tibble(\n confounder_1 = rbinom(n, 1, 0.5),\n confounder_2 = rbinom(n, 1, 0.5),\n p_exposure = case_when(\n confounder_1 == 1 & confounder_2 == 1 ~ 0.75,\n confounder_1 == 0 & confounder_2 == 1 ~ 0.9,\n confounder_1 == 1 & confounder_2 == 0 ~ 0.2,\n confounder_1 == 0 & confounder_2 == 0 ~ 0.1,\n ),\n exposure = rbinom(n, 1, p_exposure),\n outcome = confounder_1 + confounder_2 + rnorm(n)\n)\n```\n:::\n\n\nNow we are going to group by both confounders and estimate the causal effect within each stratum.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim2 |>\n group_by(confounder_1, confounder_2, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 3\n# Groups: confounder_1 [2]\n confounder_1 confounder_2 estimate\n \n1 0 0 -0.122 \n2 0 1 0.0493\n3 1 0 0.0263\n4 1 1 0.0226\n```\n\n\n:::\n:::\n\n\nWe can also estimate the overall causal effect:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim2 |>\n group_by(confounder_1, confounder_2, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_outcome,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0, .groups = \"drop\") |>\n summarise(mean(estimate))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n `mean(estimate)`\n \n1 -0.00594\n```\n\n\n:::\n:::\n\n\n### Continuous confounder\n\nSo far our strategy has been to estimate the causal effect within strata where the individuals are *exchangeable*.\nHow could we extend this to a continuous confounder?\nLet's create another simulation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000\nsim3 <- tibble(\n confounder = rnorm(n),\n p_exposure = exp(confounder) / (1 + exp(confounder)),\n exposure = rbinom(n, 1, p_exposure),\n outcome = confounder + rnorm(n)\n)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10,000 × 4\n confounder p_exposure exposure outcome\n \n 1 -0.626 0.348 0 -0.840 \n 2 0.184 0.546 1 0.0769\n 3 -0.836 0.302 0 -1.30 \n 4 1.60 0.831 1 0.911 \n 5 0.330 0.582 1 -0.461 \n 6 -0.820 0.306 0 -1.16 \n 7 0.487 0.620 1 -0.780 \n 8 0.738 0.677 1 -0.656 \n 9 0.576 0.640 1 0.995 \n10 -0.305 0.424 0 2.91 \n# ℹ 9,990 more rows\n```\n\n\n:::\n:::\n\n\nIf we want to still use `group_by` and `summarise`, we could bin the continuous confounder, for example using it's quintiles, and estimate the causal effect within each bin:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim3 |>\n mutate(confounder_q = ntile(confounder, 5)) |>\n group_by(confounder_q, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 2\n confounder_q estimate\n \n1 1 0.237 \n2 2 0.0622\n3 3 0.0264\n4 4 0.133 \n5 5 0.178 \n```\n\n\n:::\n\n```{.r .cell-code}\nsim3 |>\n mutate(confounder_q = ntile(confounder, 5)) |>\n group_by(confounder_q, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y,\n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) |>\n summarise(estimate = mean(estimate))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 1\n estimate\n \n1 0.127\n```\n\n\n:::\n:::\n\n\nGreat!\nNow, in practice we have many more than one or two binary confounders, we often have many many confounders, both continuous and binary.\nAttempting to estimate the causal effect within each unique stratum would be very inefficient (and often not possible!).\nWhat if instead we could create a single summary score for all of the confounders?\nHold onto this thought, we'll bring it up again when we introduce *propensity scores*.\n\n## When do standard methods succeed and fail? {#sec-standard}\n\nWhen teaching these topics, we are often asked when \"standard methods\" will succeed, i.e.: when can we just fit a linear regression model to estimate a causal effect?\nLet's start with the easiest example: the exposure is *randomized*.\nAs discussed in @sec-rand-trials, randomization ensures comparability and can simplify the methods needed to estimate a causal effect.\nIn the presence of a randomized exposure (assuming perfect adherence to the exposure assigned, no one dropped out of the study, etc.), simple tools like regression can be used to estimate a causal effect.\n\n### When correlation is causation\n\nWhen you have no confounders and there is a linear relationship between the exposure and the outcome, that *correlation is a causal relationship*.\nEven in these cases, using the methods you will learn in this book can help.\n\n1. Adjusting for baseline covariates can make an estimate *more efficient*\n2. Propensity score weighting is *more efficient* that direct adjustment\n3. Sometimes we are *more comfortable with the functional form of the propensity score* (predicting exposure) than the outcome model\n\nLet's look at an example.\nI am going to simulate 100 observations.\nHere the treatment is randomly assigned and there are two baseline covariates: `age` and `weight`.\nOn average, the treatment causes a one unit increase in the outcome (this is called the *average treatment effect*, we will talk more about this quantity in future chapters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nset.seed(10)\nn <- 100\nd <- tibble(\n age = rnorm(n, 55, 20),\n weight = rnorm(n),\n # generate the treatment from a binomial distribution\n # with the probability of treatment = 0.5\n treatment = rbinom(n, 1, 0.5),\n # generate the average causal effect of treatment: 1\n y = 1 * treatment + 0.2 * age + 0.2 * weight + rnorm(n)\n)\n```\n:::\n\n\nWe can draw a causal diagram of the relationship described in the code above (@fig-diag).\n@sec-dags contains more information on these causal diagrams, but briefly, the arrows denote causal relationships, so since we have established that the treatment causes an increase in the outcome (an average treatment effect of 1), we see an arrow from `trt` to `y` in this diagram.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Causal Diagram of Example Randomized Study](chapter-04_files/figure-html/fig-diag-1.png){#fig-diag width=672}\n:::\n:::\n\n\nLet's examine three models: (1) an unadjusted model (@tbl-panel-1), (2) a linear model that adjusts for the baseline covariates (@tbl-panel-2), and (3) a propensity score weighted model (@tbl-panel-3).\n\n::: {#tbl-panel layout-ncol=\"2\"}\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(gtsummary)\nlm(y ~ treatment, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Unadjusted regression\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n \n \n \n \n
Unadjusted regression
CharacteristicBetaSE195% CI1p-value
treatment0.930.803-0.66, 2.50.2
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment + age + weight, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Adjusted regression\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n
Adjusted regression
CharacteristicBetaSE195% CI1p-value
treatment1.00.2040.59, 1.4<0.001
age0.200.0050.19, 0.22<0.001
weight0.340.1060.13, 0.550.002
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nd |>\n mutate(\n p = glm(treatment ~ weight + age, data = d) |> predict(type = \"response\"),\n ate = treatment / p + (1 - treatment) / (1 - p)\n ) |>\n as.data.frame() -> d\nlibrary(PSW)\ndf <- as.data.frame(d)\nx <- psw(df,\n \"treatment ~ weight + age\",\n weight = \"ATE\", wt = TRUE,\n out.var = \"y\"\n)\ntibble(\n Characteristic = \"treatment\",\n Beta = round(x$est.wt, 1),\n SE = round(x$std.wt, 3),\n `95% CI` = glue::glue(\"{round(x$est.wt - 1.96 * x$std.wt, 1)}, {round(x$est.wt + 1.96 * x$std.wt, 1)}\"),\n `p-value` = \"<0.001\"\n) |>\n knitr::kable(caption = \"Propensity score weighted regression\")\n```\n\n::: {.cell-output-display}\n\n\nTable: Propensity score weighted regression\n\n|Characteristic | Beta| SE|95% CI |p-value |\n|:--------------|----:|-----:|:--------|:-------|\n|treatment | 1| 0.202|0.6, 1.4 |<0.001 |\n\n\n:::\n:::\n\n\nThree ways to estimate the causal effect.\n:::\n\nLooking at the three outputs in @tbl-panel, we can first notice that all three are \"unbiased\" estimates of the causal effect (we know the true average treatment effect is 1, based on our simulation) -- the estimated causal effect in each table is in the `Beta` column.\nGreat, so all methods give us an unbiased estimate.\nNext, let's look at the `SE` (standard error) column along with the `95% CI` (confidence interval) column.\nNotice the unadjusted model has a *wider* confidence interval (in fact, in this case the confidence interval includes the null, 0) -- this means if we were to use this method, even though we were able to estimate an unbiased causal effect, we would often conclude that we *fail to reject the null* that relationship between the treatment and outcome is 0.\nIn statistical terms, we refer to this as a *lack of efficiency*.\nLooking at the adjusted analysis in @tbl-panel-2, we see that the standard error is quite a bit smaller (and likewise the confidence interval is tighter, no longer including the null).\nEven though our baseline covariates `age` and `weight` were not *confounders* adjusting from them *increased the precision* of our result (this is a good thing! We want estimates that are both unbiased *and* precise).\nFinally, looking at the propensity score weighted estimate we can see that our precision was slightly improved compared to the adjusted result (0.202 compared to 0.204).\nThe magnitude of this improvement will depend on several factors, but it has been shown mathematically that using propensity scores like this to adjust for baseline factors in a randomized trial will *always* improve precision [@williamson2014variance].\nWhat can we learn from this small demonstration?\nEven in the perfect scenario, where we can estimate unbiased results without using propensity scores, the methods we will show here can be useful.\nThe utility of these methods only increases when exploring more complex examples, such as situations where the effect is *not* randomized, the introduction of time-varying confounders, etc.\n\nWhat if we did not have a randomized exposure?\nThere are many cases where randomization to a treatment is not ethical or feasible.\nStandard methods can still estimate unbiased effects, but more care needs to be given to the previously mentioned assumptions (@tbl-assump-solved).\nFor example, we need the exposed an unexposed groups to be *exchangeable*; this means we must adjust for *all confounders* with their correct functional form.\nIf everything is simple and linear (and there is no effect heterogeneity, that is everyone's causal effect is the same regardless of their baseline factors), then a simple regression model that adjusts for the confounders can give you an unbiased result.\nLet's look at a simple example such as this.\nNotice in the simulation below, the main difference compared to the above simulation is that the probability of treatment assignment is no longer 0.5 as it was above, but now dependent on the participants `age` and `weight`. For example, maybe doctors tend to prescribe a certain treatment to patients who are older and who weigh more.\nThe true causal effect is still 1, but now we have two confounders, `age` and `weight` (@fig-diag-2).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nn <- 100000\nd <- tibble(\n age = rnorm(n, 55, 20),\n weight = rnorm(n),\n # generate the treatment from a binomial distribution\n # with the probability of treatment dependent on the age and weight\n treatment = rbinom(n, 1, 1 / (1 + exp(-0.01 * age - weight))),\n ## generate the true average causal effect of the treatment: 1\n y = 1 * treatment + 0.2 * age + 0.2 * weight + rnorm(n)\n)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Causal Diagram of Example Observation Study](chapter-04_files/figure-html/fig-diag-2-1.png){#fig-diag-2 width=672}\n:::\n:::\n\n\n::: {#tbl-panel-2 layout-ncol=\"2\"}\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Unadjusted regression\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n \n \n \n \n
Unadjusted regression
CharacteristicBetaSE195% CI1p-value
treatment1.80.0271.8, 1.9<0.001
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment + age + weight, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Adjusted regression\")\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n
Adjusted regression
CharacteristicBetaSE195% CI1p-value
treatment0.990.0070.98, 1.0<0.001
age0.200.0000.20, 0.20<0.001
weight0.200.0030.20, 0.21<0.001
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nd |>\n mutate(\n p = glm(treatment ~ weight + age, data = d, family = binomial) |> predict(type = \"response\"),\n ate = treatment / p + (1 - treatment) / (1 - p)\n ) |>\n as.data.frame() -> d\nlibrary(PSW)\ndf <- as.data.frame(d)\nx <- psw(df,\n \"treatment ~ weight + age\",\n weight = \"ATE\", wt = TRUE,\n out.var = \"y\"\n)\ntibble(\n Characteristic = \"treatment\",\n Beta = round(x$est.wt, 1),\n SE = round(x$std.wt, 3),\n `95% CI` = glue::glue(\"{round(x$est.wt - 1.96 * x$std.wt, 1)}, {round(x$est.wt + 1.96 * x$std.wt, 1)}\"),\n `p-value` = \"<0.001\"\n) |>\n knitr::kable(caption = \"Propensity score weighted regression\")\n```\n\n::: {.cell-output-display}\n\n\nTable: Propensity score weighted regression\n\n|Characteristic | Beta| SE|95% CI |p-value |\n|:--------------|----:|-----:|:------|:-------|\n|treatment | 1| 0.014|1, 1 |<0.001 |\n\n\n:::\n:::\n\n\nThree ways to estimate a causal effect in a non-randomized setting\n:::\n\nFirst, let's look at @tbl-panel-2-1.\nHere, we see that the unadjusted effect is *biased* (it differs from the true effect, 1, and the true effect is *not* contained in the reported 95% confidence interval).\nNow lets compare @tbl-panel-2-2 and @tbl-panel-2-3.\nTechnically, both are estimating unbiased causal effects.\nThe output in the `Beta` column of @tbl-panel-2-2 is technically a *conditional* effect (and often in causal inference we want marginal effects), but because there is no treatment heterogeneity in this simulation, the conditional and marginal effects are equal.\n@tbl-panel-2-3, using the propensity score, also estimates an unbiased effect, but it is no longer the most *efficient* (that was true when the baseline covariates were merely causal for `y`, now that they are `confounders` the efficiency gains for using propensity score weighting are not as clear cut).\nSo why would we ever use propensity scores in this case?\nSometimes we have a better understanding of the functional form of the propensity score model compared to the outcome model.\nAlternatively, sometimes the outcome model is difficult to fit (for example, if the outcome is rare).\n\n::: callout-tip\n## Marginal versus conditional effects\n\nIn causal inference, we are often interested in *marginal* effects, mathematically, this means that we want to *marginalize* the effect of interest across the distribution of factors in a particular population that we are trying to estimate a causal effect for.\nIn an adjusted regression model, the coefficients are *conditional*, in other words, when describing the estimated coefficient, we often say something like \"a one-unit change in the exposure results in a `coefficient` change in the outcome *holding all other variables in the model constant*. In the case where the outcome is continuous, the effect is linear, and there are no interactions between the exposure effect and other factors about the population, the distinction between an conditional and a marginal effect is largely semantic. If there *is* an interaction in the model, that is, if the exposure has a different impact on the outcome depending on some other factor, we no longer have a single coefficient to interpret. We would want to estimate a *marginal* effect, taking into account the distribution of that factor in the population of interest. Why? We are ultimately trying to determine whether we should suggest the exposure to the target population, so we want to know *on average* whether it will be beneficial. Let's look at quick example: suppose that you are designing an online shopping site. Currently, the\"Purchase\" button is grey. Changing the button to red increases revenue by \\$10 for people who are *not* colorblind and decreases revenue by \\$10 for those who *are* colorblind -- *the effect is heterogeneous*. Whether you change the color of the button will depend on the *distribution* of colorblind folks that visit your website. For example, if 50% of the visitors are colorblind, your average effect of changing the color would be \\$0. If instead, 100% are colorblind, the average effect of changing the color would be -\\$10. Likewise, if 0% are colorblind, the average effect of changing the color to red would be \\$10. Your decision, therefore, needs to be based on the *marginal* effect, the effect that takes into account the distribution of colorblind online customers.\n:::\n", + "markdown": "# Target Trials and Standard Methods {#sec-trials-std}\n\n\n\n\n\n\n## Randomized trials {#sec-rand-trials}\n\nA *randomized trial* is one where the exposure (cause) of interest is *randomly assigned*.\n\n::: callout-note\nIn this book, we refer to analyses where the exposure is randomly assigned as a *randomized trial*, sometimes this is called an A/B test.\n:::\n\nWhy does randomization help?\nLooking at our assumptions in @sec-assump, randomized trials solve the well defined exposure portion of consistency by default -- the exposure is exactly what is randomized.\nLikewise for positivity; if we have randomly assigned folks to either the exposed or unexposed groups, we know the probability of assignment (and we know it is not exactly 0 or 1).\nRandomization alone does not solve the interference portion of consistency (for example, if we randomize some people to receive a vaccine for a communicable disease, their receiving it could lower the chance of those around them contracting the infectious disease because it changes the probability of exposure).\n*Ideal* randomized trials resolves the issue of exchangeability because the exposed and unexposed populations (in the limit) are inherently the same since their exposure status was determined by a random process (not by any factors that might make them different from each other).\nGreat!\nIn reality, we often see this assumption violated by issues such as *drop out* or *non-adherence* in randomized trials.\nIf there is differential drop out between exposure groups (for example, if participants randomly assigned to the treatment are more likely to drop out of a study, and thus we don't observe their outcome), then the observed exposure groups are no longer *exchangeable*.\nTherefore, in @tbl-assump-solved we have two columns, one for the *ideal* randomized trial (where adherence is assumed to be perfect and no participants drop out) and one for *realistic* randomized trials where this may not be so.\n\n| Assumption | Ideal Randomized Trial | Realistic Randomized Trial | Observational Study |\n|-----------------|-----------------|---------------------|-------------------|\n| Consistency (Well defined exposure) | 😄 | 😄 | 🤷 |\n| Consistency (No interference) | 🤷 | 🤷 | 🤷 |\n| Positivity | 😄 | 😄 | 🤷 |\n| Exchangeability | 😄 | 🤷 | 🤷 |\n\n: Assumptions solved by study design. 😄 indicates it is solved by default, 🤷 indicates that it is *solvable* but not solved by default. {#tbl-assump-solved}\n\nWhen designing a study, the first step is asking an appropriate *causal question*. We then can map this question to a *protocol*, consisting of the following seven elements, as defined by @hernan2016using:\n\n* Eligibility criteria\n* Exposure definition\n* Assignment procedures\n* Follow-up period\n* Outcome definition\n* Causal contrast of interest\n* Analysis plan\n\nIn @tbl-protocol we map each of these elements to the corresponding assumption that it can address. For example, exchangeability can be addressed by the eligibility criteria (we can restrict our study to only participants for whom exposure assignment is exchangeable), assignment procedure (we could use random exposure assignment to ensure exchangeability), follow-up period (we can be sure to choose an appropriate start time for our follow-up period to ensure that we are not inducing bias -- we'll think more about this in a future chapter), and/or the analysis plan (we can adjust for any factors that would cause those in different exposure groups to lack exchangeability). \n\n\n\nAssumption | Eligibility Criteria | Exposure Definition| Assignment Procedures | Follow-up Period | Outcome Definition | Causal contrast | Analysis Plan\n------------|----------------- | ------------------|---------|----------|----------|--------- | -------\nConsistency (Well defined exposure) | |✔️|| | |\nConsistency (No interference) | | ✔️|✔️ | | ✔️ | | ✔️\nPositivity |✔️||✔️|| | | ✔️\nExchangeability |✔️||✔️|✔️|| | ✔️\n: Mapping assumptions to elements of a study protocol {#tbl-protocol}\n\nRecall our diagrams from @sec-diag (@fig-diagram-4); several of these protocol elements can be mapped to these diagrams when we are attempting to define our causal question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example diagram mapped to causal analysis terminology](chapter-04_files/figure-html/fig-diagram-4-1.png){#fig-diagram-4 width=672}\n:::\n:::\n\n\n\n## Target Trials\n\nThere are many reasons why randomization may not be possible. For example, it might not be ethical to randomly assign people to a particular exposure, there may not be funding available to run a randomized trial, or there might not be enough time to conduct a full trial. In these situations, we rely on observational data to help us answer causal questions by implementing a *target trial*. A *target trial* answers: What experiment would you design if you could? \n\n## Causal inference with `group_by()` and `summarize()` {#sec-group-sum}\n\nLet's suppose we are trying to estimate a causal effect of an exposure on an outcome, but the exposure is not *randomized*, in fact, there is a common cause of the exposure and outcome, making the exposed and unexposed groups *not exchangeable* without adjustment (violating the fourth assumption in @sec-assump).\n\n::: callout-note\nA **confounder** is a common cause of exposure and outcome.\n:::\n\n### One binary confounder\n\nLet's suppose this confounder is binary, see the simulation below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000 \nsim <- tibble(\n # generate the confounder from a binomial distribution\n # with a probability 0.5 for being in either group \n confounder = rbinom(n, 1, 0.5),\n # make the probability of exposure dependent on the \n # confounder value\n p_exposure = case_when(\n confounder == 1 ~ 0.75,\n confounder == 0 ~ 0.25\n ),\n # generate the exposure from a binomial distribution\n # with the probability of exposure dependent on the confounder\n exposure = rbinom(n, 1, p_exposure),\n # generate the \"true\" average treatment effect of 0 \n # to do this, we are going to generate the potential outcomes, first \n # the potential outcome if exposure = 0\n # (notice exposure is not in the equation below, only the confounder)\n # we use rnorm(n) to add the random error term that is normally\n # distributed with a mean of 0 and a standard deviation of 1\n y0 = confounder + rnorm(n),\n # because the true effect is 0, the potential outcome if exposure = 1\n # is identical\n y1 = y0,\n # now, in practice we will only see one of these, outcome is what is \n # observed\n outcome = (1 - exposure) * y0 + exposure * y1,\n observed_potential_outcome = case_when(\n exposure == 0 ~ \"y0\",\n exposure == 1 ~ \"y1\"\n )\n)\n```\n:::\n\n\nHere we have one binary `confounder`, the probability that `confounder = 1` is `0.5`.\nThe probability of the being exposed is `0.75` for those for whom `confounder = 1` `0.25` for those for whom `confounder = 0`.\nThere is no effect of the `exposure` on the `outcome` (the true causal effect is 0); the `outcome` effect is fully dependent on the `confounder`. In this simulation we generate the potential outcomes to drive home our assumptions; many of our simulations in this book will skip this step.\nLet's look at this generated data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n select(confounder, exposure, outcome, observed_potential_outcome)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,000 × 4\n confounder exposure outcome observed_potential_out…¹\n \n 1 0 0 -0.804 y0 \n 2 0 0 -1.06 y0 \n 3 1 1 -0.0354 y1 \n 4 1 1 -0.186 y1 \n 5 0 0 -0.500 y0 \n 6 1 1 0.475 y1 \n 7 1 0 0.698 y0 \n 8 1 0 1.47 y0 \n 9 1 0 0.752 y0 \n10 0 0 1.26 y0 \n# ℹ 9,990 more rows\n# ℹ abbreviated name: ¹​observed_potential_outcome\n```\n:::\n:::\n\n\nGreat! Let's begin by proving to ourselves that this violates the exchangeability assumption. Recall from @sec-assump:\n\n> **Exchangeability**: We assume that within levels of relevant variables (confounders), exposed and unexposed subjects have an equal likelihood of experiencing any outcome prior to exposure; i.e. the exposed and unexposed subjects are exchangeable. This assumption is sometimes referred to as **no unmeasured confounding**.\n\nNow, let's try to estimate the effect of the `exposure` on the `outcome` assuming the two exposed groups are exchangeable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n exposure avg_outcome\n \n1 0 0.228\n2 1 0.756\n```\n:::\n:::\n\n\nThe average outcome among the exposed is `0.76` and among the unexposed `0.23`, yielding an average effect of the exposure of `0.76-0.23=0.53`.\nLet's do a little R work to get there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure, \n values_from = avg_outcome, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n estimate\n \n1 0.528\n```\n:::\n:::\n\n\nOk, so assuming the exposure groups are exchangeable (and assuming the rest of the assumptions from @sec-assump hold), we estimate the effect of the exposure on the outcome to be 0.53.\nWe *know* the exchaneability assumption is violated based on how we simulated the data.\nHow can we estimate an unbiased effect?\nThe easiest way to do so is to estimate the effect within each confounder class.\nThis will work because folks with the same value of the confounder have an equal probability of exposure.\nInstead of just grouping by the exposure, let's group by the confounder as well:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 3\n# Groups: confounder [2]\n confounder exposure avg_outcome\n \n1 0 0 -0.0185 \n2 0 1 0.00954\n3 1 0 0.994 \n4 1 1 1.01 \n```\n:::\n:::\n\n\nWe can now calculate the average effect within each confounder class as well as the overall causal effect.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Average effect within each confounder class\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure, \n values_from = avg_outcome, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n confounder estimate\n \n1 0 0.0280\n2 1 0.0122\n```\n:::\n\n```{.r .cell-code}\n## Overall average effect\n\nsim |>\n group_by(confounder, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure, \n values_from = avg_outcome, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) |>\n summarise(mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n `mean(estimate)`\n \n1 0.0201\n```\n:::\n:::\n\n\nGreat!\nNow our estimate is much closer to the true value (0).\n\n::: callout\nThe method we are using to solve the fact that our two groups are not exchangeable is known as **stratification**. We are *stratifying* by the confounder(s) and estimating the causal effect within each stratum. To get an overall average causal effect we are averaging across the strata. This can be a great tool if there are very few confounders, however it can suffer from the curse of dimensionality as the number of confounders as well as the number of levels within each confounder increases.\n:::\n\n### Two binary confounders\n\nLet's extend this to two binary confounders.\nThe simulation below now has two binary confounders.\nThe true causal effect of the `exposure` on the `outcome` is still 0.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000\nsim2 <- tibble(\n confounder_1 = rbinom(n, 1, 0.5),\n confounder_2 = rbinom(n, 1, 0.5), \n \n p_exposure = case_when(\n confounder_1 == 1 & confounder_2 == 1 ~ 0.75,\n confounder_1 == 0 & confounder_2 == 1 ~ 0.9,\n confounder_1 == 1 & confounder_2 == 0 ~ 0.2,\n confounder_1 == 0 & confounder_2 == 0 ~ 0.1,\n ),\n exposure = rbinom(n, 1, p_exposure),\n outcome = confounder_1 + confounder_2 + rnorm(n) \n)\n```\n:::\n\n\nNow we are going to group by both confounders and estimate the causal effect within each stratum.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim2 |>\n group_by(confounder_1, confounder_2, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 3\n# Groups: confounder_1 [2]\n confounder_1 confounder_2 estimate\n \n1 0 0 -0.122 \n2 0 1 0.0493\n3 1 0 0.0263\n4 1 1 0.0226\n```\n:::\n:::\n\n\nWe can also estimate the overall causal effect:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim2 |>\n group_by(confounder_1, confounder_2, exposure) |>\n summarise(avg_outcome = mean(outcome)) |>\n pivot_wider(\n names_from = exposure, \n values_from = avg_outcome, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0, .groups = \"drop\") |>\n summarise(mean(estimate)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n `mean(estimate)`\n \n1 -0.00594\n```\n:::\n:::\n\n\n### Continuous confounder\n\nSo far our strategy has been to estimate the causal effect within strata where the individuals are *exchangeable*.\nHow could we extend this to a continuous confounder?\nLet's create another simulation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nn <- 10000 \nsim3 <- tibble(\n confounder = rnorm(n), \n p_exposure = exp(confounder) / (1 + exp(confounder)),\n exposure = rbinom(n, 1, p_exposure),\n outcome = confounder + rnorm(n) \n)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10,000 × 4\n confounder p_exposure exposure outcome\n \n 1 -0.626 0.348 0 -0.840 \n 2 0.184 0.546 1 0.0769\n 3 -0.836 0.302 0 -1.30 \n 4 1.60 0.831 1 0.911 \n 5 0.330 0.582 1 -0.461 \n 6 -0.820 0.306 0 -1.16 \n 7 0.487 0.620 1 -0.780 \n 8 0.738 0.677 1 -0.656 \n 9 0.576 0.640 1 0.995 \n10 -0.305 0.424 0 2.91 \n# ℹ 9,990 more rows\n```\n:::\n:::\n\n\nIf we want to still use `group_by` and `summarise`, we could bin the continuous confounder, for example using it's quintiles, and estimate the causal effect within each bin:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsim3 |>\n mutate(confounder_q = ntile(confounder, 5)) |>\n group_by(confounder_q, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 2\n confounder_q estimate\n \n1 1 0.237 \n2 2 0.0622\n3 3 0.0264\n4 4 0.133 \n5 5 0.178 \n```\n:::\n\n```{.r .cell-code}\nsim3 |>\n mutate(confounder_q = ntile(confounder, 5)) |>\n group_by(confounder_q, exposure) |>\n summarise(avg_y = mean(outcome)) |>\n pivot_wider(\n names_from = exposure,\n values_from = avg_y, \n names_prefix = \"x_\"\n ) |>\n summarise(estimate = x_1 - x_0) |>\n summarise(estimate = mean(estimate))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 1\n estimate\n \n1 0.127\n```\n:::\n:::\n\n\nGreat!\nNow, in practice we have many more than one or two binary confounders, we often have many many confounders, both continuous and binary.\nAttempting to estimate the causal effect within each unique stratum would be very inefficient (and often not possible!).\nWhat if instead we could create a single summary score for all of the confounders?\nHold onto this thought, we'll bring it up again when we introduce *propensity scores*.\n\n## When do standard methods succeed and fail? {#sec-standard}\n\nWhen teaching these topics, we are often asked when \"standard methods\" will succeed, i.e.: when can we just fit a linear regression model to estimate a causal effect?\nLet's start with the easiest example: the exposure is *randomized*.\nAs discussed in @sec-rand-trials, randomization ensures comparability and can simplify the methods needed to estimate a causal effect.\nIn the presence of a randomized exposure (assuming perfect adherence to the exposure assigned, no one dropped out of the study, etc.), simple tools like regression can be used to estimate a causal effect.\n\n### When correlation is causation\n\nWhen you have no confounders and there is a linear relationship between the exposure and the outcome, that *correlation is a causal relationship*.\nEven in these cases, using the methods you will learn in this book can help.\n\n1. Adjusting for baseline covariates can make an estimate *more efficient*\n2. Propensity score weighting is *more efficient* that direct adjustment\n3. Sometimes we are *more comfortable with the functional form of the propensity score* (predicting exposure) than the outcome model\n\nLet's look at an example.\nI am going to simulate 100 observations.\nHere the treatment is randomly assigned and there are two baseline covariates: `age` and `weight`.\nOn average, the treatment causes a one unit increase in the outcome (this is called the *average treatment effect*, we will talk more about this quantity in future chapters).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nset.seed(10)\nn <- 100\nd <- tibble(\n age = rnorm(n, 55, 20),\n weight = rnorm(n),\n # generate the treatment from a binomial distribution\n # with the probability of treatment = 0.5\n treatment = rbinom(n, 1, 0.5),\n # generate the average causal effect of treatment: 1\n y = 1 * treatment + 0.2 * age + 0.2 * weight + rnorm(n)\n)\n```\n:::\n\n\nWe can draw a causal diagram of the relationship described in the code above (@fig-diag).\n@sec-dags contains more information on these causal diagrams, but briefly, the arrows denote causal relationships, so since we have established that the treatment causes an increase in the outcome (an average treatment effect of 1), we see an arrow from `trt` to `y` in this diagram.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Causal Diagram of Example Randomized Study](chapter-04_files/figure-html/fig-diag-1.png){#fig-diag width=672}\n:::\n:::\n\n\nLet's examine three models: (1) an unadjusted model (@tbl-panel-1), (2) a linear model that adjusts for the baseline covariates (@tbl-panel-2), and (3) a propensity score weighted model (@tbl-panel-3).\n\n::: {#tbl-panel layout-ncol=\"2\"}\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlibrary(gtsummary)\nlm(y ~ treatment, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Unadjusted regression\")\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n \n \n \n \n
Unadjusted regression
CharacteristicBetaSE195% CI1p-value
treatment0.930.803-0.66, 2.50.2
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment + age + weight, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Adjusted regression\")\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n
Adjusted regression
CharacteristicBetaSE195% CI1p-value
treatment1.00.2040.59, 1.4<0.001
age0.200.0050.19, 0.22<0.001
weight0.340.1060.13, 0.550.002
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nd |>\n mutate(\n p = glm(treatment ~ weight + age, data = d) |> predict(type = \"response\"),\n ate = treatment / p + (1 - treatment) / (1 - p)\n ) |>\n as.data.frame() -> d\nlibrary(PSW)\ndf <- as.data.frame(d)\nx <- psw(df, \n \"treatment ~ weight + age\", \n weight = \"ATE\", wt = TRUE,\n out.var = \"y\")\ntibble(\n Characteristic = \"treatment\",\n Beta = round(x$est.wt, 1),\n SE = round(x$std.wt, 3),\n `95% CI` = glue::glue(\"{round(x$est.wt - 1.96 * x$std.wt, 1)}, {round(x$est.wt + 1.96 * x$std.wt, 1)}\"),\n `p-value` = \"<0.001\"\n) |>\n knitr::kable(caption = \"Propensity score weighted regression\")\n```\n\n::: {.cell-output-display}\nTable: Propensity score weighted regression\n\n|Characteristic | Beta| SE|95% CI |p-value |\n|:--------------|----:|-----:|:--------|:-------|\n|treatment | 1| 0.202|0.6, 1.4 |<0.001 |\n:::\n:::\n\n\nThree ways to estimate the causal effect.\n:::\n\nLooking at the three outputs in @tbl-panel, we can first notice that all three are \"unbiased\" estimates of the causal effect (we know the true average treatment effect is 1, based on our simulation) -- the estimated causal effect in each table is in the `Beta` column.\nGreat, so all methods give us an unbiased estimate.\nNext, let's look at the `SE` (standard error) column along with the `95% CI` (confidence interval) column.\nNotice the unadjusted model has a *wider* confidence interval (in fact, in this case the confidence interval includes the null, 0) -- this means if we were to use this method, even though we were able to estimate an unbiased causal effect, we would often conclude that we *fail to reject the null* that relationship between the treatment and outcome is 0.\nIn statistical terms, we refer to this as a *lack of efficiency*.\nLooking at the adjusted analysis in @tbl-panel-2, we see that the standard error is quite a bit smaller (and likewise the confidence interval is tighter, no longer including the null).\nEven though our baseline covariates `age` and `weight` were not *confounders* adjusting from them *increased the precision* of our result (this is a good thing! We want estimates that are both unbiased *and* precise).\nFinally, looking at the propensity score weighted estimate we can see that our precision was slightly improved compared to the adjusted result (0.202 compared to 0.204).\nThe magnitude of this improvement will depend on several factors, but it has been shown mathematically that using propensity scores like this to adjust for baseline factors in a randomized trial will *always* improve precision [@williamson2014variance].\nWhat can we learn from this small demonstration?\nEven in the perfect scenario, where we can estimate unbiased results without using propensity scores, the methods we will show here can be useful.\nThe utility of these methods only increases when exploring more complex examples, such as situations where the effect is *not* randomized, the introduction of time-varying confounders, etc.\n\nWhat if we did not have a randomized exposure?\nThere are many cases where randomization to a treatment is not ethical or feasible.\nStandard methods can still estimate unbiased effects, but more care needs to be given to the previously mentioned assumptions (@tbl-assump-solved).\nFor example, we need the exposed an unexposed groups to be *exchangeable*; this means we must adjust for *all confounders* with their correct functional form.\nIf everything is simple and linear (and there is no effect heterogeneity, that is everyone's causal effect is the same regardless of their baseline factors), then a simple regression model that adjusts for the confounders can give you an unbiased result.\nLet's look at a simple example such as this.\nNotice in the simulation below, the main difference compared to the above simulation is that the probability of treatment assignment is no longer 0.5 as it was above, but now dependent on the participants `age` and `weight`. For example, maybe doctors tend to prescribe a certain treatment to patients who are older and who weigh more.\nThe true causal effect is still 1, but now we have two confounders, `age` and `weight` (@fig-diag-2).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nn <- 100000\nd <- tibble(\n age = rnorm(n, 55, 20),\n weight = rnorm(n),\n # generate the treatment from a binomial distribution\n # with the probability of treatment dependent on the age and weight\n treatment = rbinom(n, 1, 1 / (1 + exp(-0.01 * age - weight))),\n ## generate the true average causal effect of the treatment: 1\n y = 1 * treatment + 0.2 * age + 0.2 * weight + rnorm(n)\n)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![Causal Diagram of Example Observation Study](chapter-04_files/figure-html/fig-diag-2-1.png){#fig-diag-2 width=672}\n:::\n:::\n\n\n::: {#tbl-panel-2 layout-ncol=\"2\"}\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Unadjusted regression\")\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n \n \n \n \n \n \n
Unadjusted regression
CharacteristicBetaSE195% CI1p-value
treatment1.80.0271.8, 1.9<0.001
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nlm(y ~ treatment + age + weight, d) |>\n tbl_regression() |>\n modify_column_unhide(column = std.error) |>\n modify_caption(\"Adjusted regression\")\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n \n\n\n\n\n \n\n\n\n\n \n \n \n \n \n \n \n
Adjusted regression
CharacteristicBetaSE195% CI1p-value
treatment0.990.0070.98, 1.0<0.001
age0.200.0000.20, 0.20<0.001
weight0.200.0030.20, 0.21<0.001
1 SE = Standard Error, CI = Confidence Interval
\n
\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\"}\nd |>\n mutate(\n p = glm(treatment ~ weight + age, data = d, family = binomial) |> predict(type = \"response\"),\n ate = treatment / p + (1 - treatment) / (1 - p)\n ) |>\n as.data.frame() -> d\nlibrary(PSW)\ndf <- as.data.frame(d)\nx <- psw(df, \n \"treatment ~ weight + age\", \n weight = \"ATE\", wt = TRUE,\n out.var = \"y\")\ntibble(\n Characteristic = \"treatment\",\n Beta = round(x$est.wt, 1),\n SE = round(x$std.wt, 3),\n `95% CI` = glue::glue(\"{round(x$est.wt - 1.96 * x$std.wt, 1)}, {round(x$est.wt + 1.96 * x$std.wt, 1)}\"),\n `p-value` = \"<0.001\"\n) |>\n knitr::kable(caption = \"Propensity score weighted regression\")\n```\n\n::: {.cell-output-display}\nTable: Propensity score weighted regression\n\n|Characteristic | Beta| SE|95% CI |p-value |\n|:--------------|----:|-----:|:------|:-------|\n|treatment | 1| 0.014|1, 1 |<0.001 |\n:::\n:::\n\n\nThree ways to estimate a causal effect in a non-randomized setting\n:::\n\nFirst, let's look at @tbl-panel-2-1.\nHere, we see that the unadjusted effect is *biased* (it differs from the true effect, 1, and the true effect is *not* contained in the reported 95% confidence interval).\nNow lets compare @tbl-panel-2-2 and @tbl-panel-2-3.\nTechnically, both are estimating unbiased causal effects.\nThe output in the `Beta` column of @tbl-panel-2-2 is technically a *conditional* effect (and often in causal inference we want marginal effects), but because there is no treatment heterogeneity in this simulation, the conditional and marginal effects are equal.\n@tbl-panel-2-3, using the propensity score, also estimates an unbiased effect, but it is no longer the most *efficient* (that was true when the baseline covariates were merely causal for `y`, now that they are `confounders` the efficiency gains for using propensity score weighting are not as clear cut).\nSo why would we ever use propensity scores in this case?\nSometimes we have a better understanding of the functional form of the propensity score model compared to the outcome model.\nAlternatively, sometimes the outcome model is difficult to fit (for example, if the outcome is rare).\n\n::: callout-tip\n## Marginal versus conditional effects\n\nIn causal inference, we are often interested in *marginal* effects, mathematically, this means that we want to *marginalize* the effect of interest across the distribution of factors in a particular population that we are trying to estimate a causal effect for.\nIn an adjusted regression model, the coefficients are *conditional*, in other words, when describing the estimated coefficient, we often say something like \"a one-unit change in the exposure results in a `coefficient` change in the outcome *holding all other variables in the model constant*. In the case where the outcome is continuous, the effect is linear, and there are no interactions between the exposure effect and other factors about the population, the distinction between an conditional and a marginal effect is largely semantic. If there *is* an interaction in the model, that is, if the exposure has a different impact on the outcome depending on some other factor, we no longer have a single coefficient to interpret. We would want to estimate a *marginal* effect, taking into account the distribution of that factor in the population of interest. Why? We are ultimately trying to determine whether we should suggest the exposure to the target population, so we want to know *on average* whether it will be beneficial. Let's look at quick example: suppose that you are designing an online shopping site. Currently, the\"Purchase\" button is grey. Changing the button to red increases revenue by \\$10 for people who are *not* colorblind and decreases revenue by \\$10 for those who *are* colorblind -- *the effect is heterogeneous*. Whether you change the color of the button will depend on the *distribution* of colorblind folks that visit your website. For example, if 50% of the visitors are colorblind, your average effect of changing the color would be \\$0. If instead, 100% are colorblind, the average effect of changing the color would be -\\$10. Likewise, if 0% are colorblind, the average effect of changing the color to red would be \\$10. Your decision, therefore, needs to be based on the *marginal* effect, the effect that takes into account the distribution of colorblind online customers.\n:::\n", "supporting": [ "chapter-04_files" ], diff --git a/_freeze/chapters/chapter-05/execute-results/html.json b/_freeze/chapters/chapter-05/execute-results/html.json index 9c21544..ae63b20 100644 --- a/_freeze/chapters/chapter-05/execute-results/html.json +++ b/_freeze/chapters/chapter-05/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "7a1a06427dcc094e0606c80041bcc449", + "hash": "fb7b3693149671381024a90b5463dc2e", "result": { - "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\n> Draw your assumptions before your conclusions --@hernan2021\n\nCausal diagrams are a tool to visualize your assumptions about the causal structure of the questions you're trying to answer.\nIn a randomized experiment, the causal structure is quite simple.\nWhile there may be many causes of an outcome, the only cause of the exposure is the randomization process itself (we hope!).\nIn many non-randomized settings, however, the structure of your question can be a complex web of causality.\nCausal diagrams help communicate what we think this structure looks like.\nIn addition to being open about what we think the causal structure is, causal diagrams have incredible mathematical properties that allow us to identify a way to estimate unbiased causal effects even with observational data.\n\nCausal diagrams are also increasingly common.\nData collected as a review of causal diagrams in applied health research papers show a drastic increase in use over time [@Tennant2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Percentage of health research papers using causal diagrams over time.](chapter-05_files/figure-html/fig-dag-usage-1.png){#fig-dag-usage width=672}\n:::\n:::\n\n\nThe type of causal diagrams we use are also called directed acyclic graphs (DAGs)[^chapter-05-1].\nThese graphs are directed because they include arrows going in a specific direction.\nThey're acyclic because they don't go in circles; a variable can't cause itself, for instance.\nDAGs are used for various problems, but we're specifically concerned with *causal* DAGs.\nThis class of DAGs is sometimes called Structural Causal Models (SCMs) because they are a model of the causal structure of a question [@hernan2021; @Pearl_Glymour_Jewell_2021].\n\n[^chapter-05-1]: An essential but rarely observed detail of DAGs is that dag is also an [affectionate Australian insult](https://en.wikipedia.org/wiki/Dag_(slang)) referring to the dung-caked fur of a sheep, a *daglock*.\n\nDAGs depict causal relationships between variables.\nVisually, the way they depict variables is as *edges* and *nodes*.\nEdges are the arrows going from one variable to another, sometimes called arcs or just arrows.\nNodes are the variables themselves, sometimes called vertices, points, or just variables.\nIn @fig-dag-basic, there are two nodes, `x` and `y`, and one edge going from `x` to `y`.\nHere, we are saying that `x` causes `y`.\n`y` \"listens\" to `x` [@Pearl_Glymour_Jewell_2021].\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A causal directed acyclic graph (DAG). DAGs depict causal relationships. In this DAG, the assumption is that `x` causes `y`.](chapter-05_files/figure-html/fig-dag-basic-1.png){#fig-dag-basic width=288}\n:::\n:::\n\n\nIf we're interested in the causal effect of `x` on `y`, we're trying to estimate a numeric representation of that arrow.\nUsually, though, there are many other variables and arrows in the causal structure of a given question.\nA series of arrows is called a *path*.\nThere are three types of paths you'll see in DAGs: forks, chains, and colliders (sometimes called inverse forks).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three types of causal relationships: forks, chains, and colliders. The direction of the arrows and the relationships of interest dictate which type of path a series of variables is. Forks represent a mutual cause, chains represent direct causes, and colliders represent a mutual descendant.](chapter-05_files/figure-html/fig-dag-path-types-1.png){#fig-dag-path-types width=672}\n:::\n:::\n\n\nForks represent a common cause of two variables.\nHere, we're saying that `q` causes both `x` and `y`, the traditional definition of a confounder.\nThey're called forks because the arrows from `x` to `y` are in different directions.\nChains, on the other hand, represent a series of arrows going in the same direction.\nHere, `q` is called a *mediator*: it is along the causal path from `x` to `y`.\nIn this diagram, the only path from `x` to `y` is mediated through `q`.\nFinally, a collider is a path where two arrowheads meet at a variable.\nBecause causality always goes forward in time, this naturally means that the collider variable is caused by two other variables.\nHere, we're saying that `x` and `y` both cause `q`.\n\n::: callout-tip\n## Are DAGs SEMs?\n\nIf you're familiar with structural equation models (SEMs), a modeling technique commonly used in psychology and other social science settings, you may notice some similarities between SEMs and DAGs.\nDAGs are a form of *non-parametric* SEM.\nSEMs estimate entire graphs using parametric assumptions.\nCausal DAGs, on the other hand, don't estimate anything; an arrow going from one variable to another says nothing about the strength or functional form of that relationship, only that we think it exists.\n:::\n\nOne of the significant benefits of DAGs is that they help us identify sources of bias and, often, provide clues on how to address them.\nHowever, talking about an unbiased effect estimate only makes sense when we have a specific causal question in mind.\nSince each arrow represents a cause, it's causality all the way down; no individual arrow is inherently problematic.\nHere, we're interested in the effect of `x` on `y`.\nThis question defines which paths we're interested in and which we're not.\n\nThese three types of paths have different implications for the statistical relationship between `x` and `y`.\nIf we only look at the correlation between the two variables under these assumptions:\n\n1. In the fork, `x` and `y` will be associated, despite there being no arrow from `x` to `y`.\n2. In the chain, `x` and `y` are related only through `q`.\n3. In the collider, `x` and `y` will *not* be related.\n\nPaths that transmit association are called *open paths*.\nPaths that do not transmit association are called *closed paths*.\nForks and chains are open, while colliders are closed.\n\nSo, should we adjust for `q`?\nThat depends on the nature of the path.\nForks are confounding paths.\nBecause `q` causes both `x` and `y`, `x` and `y` will have a spurious association.\nThey both contain information from `q`, their mutual cause.\nThat mutual causal relationship makes `x` and `y` associated statistically.\nAdjusting for `q` will *block* the bias from confounding and give us the true relationship between `x` and `y`.\n\n::: callout-tip\n## Adjustment\n\nWe can use a variety of techniques to account for a variable.\nWe use the term \"adjustment\" or \"controlling for\" to refer to any technique that removes the effect of variables we're not interested in.\n:::\n\n@fig-confounder-scatter depicts this effect visually.\nHere, `x` and `y` are continuous, and by definition of the DAG, they are unrelated.\n`q`, however, causes both.\nThe unadjusted effect is biased because it includes information about the open path from `x` to `y` via `q`.\nWithin levels of `q`, however, `x` and `y` are unrelated.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With forks, the relationship is biased by `q`. When accounting for `q`, we see the true null relationship.](chapter-05_files/figure-html/fig-confounder-scatter-1.png){#fig-confounder-scatter width=672}\n:::\n:::\n\n\nFor chains, whether or not we adjust for mediators depends on the research question.\nHere, adjusting for `q` would result in a null estimate of the effect of `x` on `y`.\nBecause the only effect of `x` on `y` is via `q`, no other effect remains.\nThe effect of `x` on `y` mediated by `q` is called the *indirect* effect, while the effect of `x` on `y` directly is called the *direct* effect.\nIf we're only interested in the direct effect, controlling for `q` might be what we want.\nIf we want to know about both effects, we shouldn't try to adjust for `q`.\nWe'll learn more about estimating these and other mediation effects in @sec-mediation.\n\n@fig-mediator-scatter shows this effect visually.\nThe unadjusted effect of `x` on `y` represents the total effect.\nSince the total effect is due entirely to the path mediated by `q`, when we adjust for `q`, no relationship remains.\nThis null effect is the direct effect.\nNeither of these effects is due to bias, but each answers a different research question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. With chains, whether and how we should account for `q` depends on the research question. Without doing so, we see the impact of the total effect of `x` and `y`, including the indirect effect via `q`. When accounting for `q`, we see the direct (null) effect of `x` on `y`.](chapter-05_files/figure-html/fig-mediator-scatter-1.png){#fig-mediator-scatter width=672}\n:::\n:::\n\n\nColliders are different.\nIn the collider DAG of @fig-dag-path-types, `x` and `y` are *not* associated, but both cause `q`.\nAdjusting for `q` has the opposite effect than with confounding: it *opens* a biasing pathway.\nSometimes, people draw the path opened up by conditioning on a collider connecting `x` and `y`.\n\nVisually, we can see this happen when `x` and `y` are continuous and `q` is binary.\nIn @fig-collider-scatter, when we don't include `q`, we find no relationship between `x` and `y`.\nThat's the correct result.\nHowever, when we include `q`, we can detect information about both `x` and `y`, and they appear correlated: across levels of `x`, those with `q = 0` have lower levels of `y`.\nAssociation seemingly flows back in time.\nOf course, that can't happen from a causal perspective, so controlling for `q` is the wrong thing to do.\nWe end up with a biased effect of `x` on `y`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Two scatterplots of the relationship between `x` and `y`. The unadjusted relationship between the two is unbiased. When accounting for `q`, we open a colliding backdoor path and bias the relationship between `x` and `y`.](chapter-05_files/figure-html/fig-collider-scatter-1.png){#fig-collider-scatter width=672}\n:::\n:::\n\n\nHow can this be?\nSince `x` and `y` happen before `q`, `q` can't impact them.\nLet's turn the DAG on its side and consider @fig-collider-time.\nIf we break down the two time points, at time point 1, `q` hasn't happened yet, and `x` and `y` are unrelated.\nAt time point 2, `q` happens due to `x` and `y`.\n*But causality only goes forward in time*.\n`q` happening later can't change the fact that `x` and `y` happened independently in the past.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A collider relationship over two points in time. At time point one, there is no relationship between `x` and `y`. Both cause `q` by time point two, but this does not change what already happened at time point one.](chapter-05_files/figure-html/fig-collider-time-1.png){#fig-collider-time width=672}\n:::\n:::\n\n\nCausality only goes forward.\nAssociation, however, is time-agnostic.\nIt's just an observation about the numerical relationships between variables.\nWhen we control for the future, we risk introducing bias.\nIt takes time to develop an intuition for this.\nConsider a case where `x` and `y` are the only causes of `q`, and all three variables are binary.\nWhen *either* `x` or `y` equals 1, then `q` happens.\nIf we know `q = 1` and `x = 0` then logically it must be that `y = 1`.\nThus, knowing about `q` gives us information about `y` via `x`.\nThis example is extreme, but it shows how this type of bias, sometimes called *collider-stratification bias* or *selection bias*, occurs: conditioning on `q` provides statistical information about `x` and `y` and distorts their relationship [@Banack2023].\n\n::: callout-tip\n## Exchangeability revisited\n\nWe commonly refer to exchangability as the assumption of no confounding.\nActually, this isn't quite right.\nIt's the assumption of no *open, non-causal* paths [@hernan2021].\nMany times, these are confounding pathways.\nHowever, conditioning on a collider can also open paths.\nEven though these aren't confounders, doing so creates non-exchangeability between the two groups: they are different in a way that matters to the exposure and outcome.\n\nOpen, non-causal paths are also called *backdoor paths*.\nWe'll use this terminology often because it captures the idea well: these are any open paths biasing the effect we're interested in estimating.\n:::\n\nCorrectly identifying the causal structure between the exposure and outcome thus helps us 1) communicate the assumptions we're making about the relationships between variables and 2) identify sources of bias.\nImportantly, in doing 2), we are also often able to identify ways to prevent bias based on the assumptions in 1).\nIn the simple case of the three DAGs in @fig-dag-path-types, we know whether or not to control for `q` depending on the nature of the causal structure.\nThe set or sets of variables we need to adjust for is called the *adjustment set*.\nDAGs can help us identify adjustment sets even in complex settings [@vanderzander2019].\n\n::: callout-tip\n## What about interaction?\n\nDAGs don't make a statement about interaction or effect estimate modification, even though they are an important part of inference.\nTechnically, interaction is a matter of the functional form of the relationships in the DAG.\nMuch as we don't need to specify how we will model a variable in the DAG (e.g., with splines), we don't need to determine how variables statistically interact.\nThat's a matter for the modeling stage.\n\nThere are several ways we use interactions in causal inference.\nIn one extreme, they are simply a matter of functional form: interaction terms are included in models but marginalized to get an overall causal effect.\nConversely, we're interested in *joint causal effects*, where the two variables interacting are both causal.\nIn between, we can use interaction terms to identify *heterogeneous causal effects*, which vary by a second variable that is not assumed to be causal.\nAs with many tools in causal inference, we use the same statistical technique in many ways to answer different questions.\nWe'll revisit this topic in detail in [Chapter -@sec-interaction].\n\nMany people have tried expressing interaction in DAGs using different types of arcs, nodes, and other annotations, but no approach has taken off as the preferred way [@weinberg2007; @Nilsson2021].\n:::\n\nLet's take a look at an example in R.\nWe'll learn to build DAGs, visualize them, and identify important information like adjustment sets.\n\n## DAGs in R\n\nFirst, consider a research question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?\nWe can diagram this using the method described in @sec-diag (@fig-diagram-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A sentence diagram for the question: Does listening to a comedy podcast the morning before an exam improve graduate student test scores? The population is graduate students. The start time is morning, and the outcome time is after the exam.](../images/podcast-diagram.png){#fig-diagram-podcast width=2267}\n:::\n:::\n\n\nThe tool we'll use for making DAGs is ggdag.\nggdag is a package that connects ggplot2, the most powerful visualization tool in R, to dagitty, an R package with sophisticated algorithms for querying DAGs.\n\nTo create a DAG object, we'll use the `dagify()` function.`dagify()` returns a `dagitty` object that works with both the dagitty and ggdag packages.\nThe `dagify()` function takes formulas, separated by commas, that specify causes and effects, with the left element of the formula defining the effect and the right all of the factors that cause it.\nThis is just like the type of formula we specify for most regression models in R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n effect1 ~ cause1 + cause2 + cause3,\n effect2 ~ cause1 + cause4,\n ...\n)\n```\n:::\n\n\nWhat are all of the factors that cause graduate students to listen to a podcast the morning before an exam?\nWhat are all of the factors that could cause a graduate student to do well on a test?\nLet's posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n\n\n:::\n:::\n\n\nIn the code above, we assume that:\n\n- a graduate student's mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the test\n- their mood and how prepared they are also influence their exam score\n\nNotice we *do not* see podcast in the exam equation; this means that we assume that there is **no** causal relationship between podcast and the exam score.\n\nThere are some other useful arguments you'll often find yourself supplying to `dagify()`:\n\n- `exposure` and `outcome`: Telling ggdag the variables that are the exposure and outcome of your research question is required for many of the most valuable queries we can make of DAGs.\n- `latent`: This argument lets us tell ggdag that some variables in the DAG are unmeasured. `latent` helps identify valid adjustment sets with the data we actually have.\n- `coords`: Coordinates for the variables. You can choose between algorithmic or manual layouts, as discussed below. We'll use `time_ordered_coords()` here.\n- `labels`: A character vector of labels for the variables.\n\nLet's create a DAG object, `podcast_dag`, with some of these attributes, then visualize the DAG with `ggdag()`.\n`ggdag()` returns a ggplot object, so we can add additional layers to the plot, like themes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"), \n # time point 2\n \"podcast\", \n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students' test scores?](chapter-05_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=384}\n:::\n:::\n\n\n::: callout-note\nFor the rest of the chapter, we'll use `theme_dag()`, a ggplot theme from ggdag meant for DAGs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntheme_set(\n theme_dag() %+replace%\n # also add some additional styling\n theme(\n legend.position = \"bottom\",\n strip.text.x = element_text(margin = margin(2, 0, 2, 0, \"mm\"))\n )\n)\n```\n:::\n\n:::\n\n::: callout-tip\n## DAG coordinates\n\nYou don't need to specify coordinates to ggdag.\nIf you don't, it uses algorithms designed for automatic layouts.\nThere are many such algorithms, and they focus on different aspects of the layout, e.g., the shape, the space between the nodes, minimizing how many edges cross, etc.\nThese layout algorithms usually have a component of randomness, so it's good to use a seed if you want to get the same result.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n# no coordinates specified\nset.seed(123)\npod_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n\n# automatically determine layouts\npod_dag |> \n ggdag(text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-14-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nWe can also ask for a specific layout, e.g., the popular Sugiyama algorithm for DAGs [@sugiyama1981].\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"sugiyama\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-15-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nFor causal DAGs, the time-ordered layout algorithm is often best, which we can specify with `time_ordered_coords()` or `layout = \"time_ordered\"`.\nWe'll discuss time ordering in greater detail in @sec-time-ordered.\nEarlier, we explicitly told ggdag which variables were at which time points, but we don't need to.\nNotice, though, that the time ordering algorithm puts `podcast` and `exam` at the same time point since one doesn't cause another (and thus predate it).\nWe know that's not the case: listening to the podcast happened before taking the exam.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npod_dag |> \n ggdag(layout = \"time_ordered\", text_size = 2.8)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-16-1.png){fig-align='center' width=384}\n:::\n:::\n\n\nYou can manually specify coordinates using a list or data frame and provide them to the `coords` argument of `dagify()`.\nAdditionally, because ggdag is based on dagitty, you can use `dagitty.net` to create and organize a DAG using a graphical interface, then export the result as dagitty code for ggdag to consume.\n\nAlgorithmic layouts are lovely for fast visualization of DAGs or particularly complex graphs.\nOnce you want to share your DAG, it's usually best to be more intentional about the layout, perhaps by specifying the coordinates manually.\n`time_ordered_coords()` is often the best of both worlds, and we'll use it for most DAGs in this book.\n:::\n\nWe've specified the DAG for this question and told ggdag what the exposure and outcome of interest are.\nAccording to the DAG, there is no direct causal relationship between listening to a podcast and exam scores.\nAre there any other open paths?\n`ggdag_paths()` takes a DAG and visualizes the open paths.\nIn @fig-paths-podcast, we see two open paths: `podcast <- mood -> exam\"` and `podcast <- prepared -> exam`. These are both forks---*confounding pathways*. Since there is no causal relationship between listening to a podcast and exam scores, the only open paths are *backdoor* paths, these two confounding pathways.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag |> \n # show the whole dag as a light gray \"shadow\" \n # rather than just the paths\n ggdag_paths(shadow = TRUE, text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![`ggdag_paths()` visualizes open paths in a DAG. There are two open paths in `podcast_dag`: the fork from `mood` and the fork from `prepared`.](chapter-05_files/figure-html/fig-paths-podcast-1.png){#fig-paths-podcast width=672}\n:::\n:::\n\n\n::: callout-tip\n`dagify()` returns a `dagitty()` object, but underneath the hood, ggdag converts `dagitty` objects to tidy DAGs, a structure that holds both the `dagitty` object and a `dataframe` about the DAG.\nThis is handy if you want to manipulate the DAG programmatically.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy <- podcast_dag |> \n tidy_dagitty()\n\npodcast_dag_tidy\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 5 nodes and 5 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 7 × 9\n name x y direction to xend yend\n \n1 exam 3 0 NA NA\n2 humor 1 0 -> podcast 2 0\n3 mood 1 1 -> exam 3 0\n4 mood 1 1 -> podcast 2 0\n5 podcast 2 0 NA NA\n6 prepared 1 -1 -> exam 3 0\n7 prepared 1 -1 -> podcast 2 0\n# ℹ 2 more variables: circular , label \n```\n\n\n:::\n:::\n\n\nMost of the quick plotting functions transform the `dagitty` object to a tidy DAG if it's not already, then manipulate the data in some capacity.\nFor instance, `dag_paths()` underlies `ggdag_paths()`; it returns a tidy DAG with data about the paths.\nYou can use several dplyr functions on these objects directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag_tidy |> \n dag_paths() |> \n filter(set == 2, path == \"open path\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A DAG with 3 nodes and 2 edges\n#\n# Exposure: podcast\n# Outcome: exam\n#\n# A tibble: 4 × 11\n set name x y direction to xend yend\n \n1 2 exam 3 0 NA NA\n2 2 podcast 2 0 NA NA\n3 2 prepar… 1 -1 -> exam 3 0\n4 2 prepar… 1 -1 -> podc… 2 0\n# ℹ 3 more variables: circular , label ,\n# path \n```\n\n\n:::\n:::\n\n\nTidy DAGs are not pure data frames, but you can retrieve either the `dataframe` or `dagitty` object to work with them directly using `pull_dag_data()` or `pull_dag()`.\n`pull_dag()` can be useful when you want to work with dagitty functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dagitty)\npodcast_dag_tidy |> \n pull_dag() |> \n paths()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"podcast <- mood -> exam\" \n[2] \"podcast <- prepared -> exam\"\n\n$open\n[1] TRUE TRUE\n```\n\n\n:::\n:::\n\n:::\n\nBackdoor paths pollute the statistical association between `podcast` and `exam`, so we must account for them.\n`ggdag_adjustment_set()` visualizes any valid adjustment sets implied by the DAG.\n@fig-podcast-adustment-set shows adjusted variables as squares.\nAny arrows coming out of adjusted variables are removed from the DAG because the path is longer open at that variable.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag, \n text = FALSE, \n use_labels = \"label\"\n)\n```\n\n::: {.cell-output-display}\n![A visualization of the minimal adjustment set for the podcast-exam DAG. If this DAG is correct, two variables are required to block the backdoor paths: `mood` and `prepared`.](chapter-05_files/figure-html/fig-podcast-adustment-set-1.png){#fig-podcast-adustment-set fig-align='center' width=384}\n:::\n:::\n\n\n@fig-podcast-adustment-set shows the *minimal adjustment set*.\nBy default, ggdag returns the set(s) that can close all backdoor paths with the fewest number of variables possible.\nIn this DAG, that's just one set: `mood` and `prepared`.\nThis set makes sense because there are two backdoor paths, and the only other variables on them besides the exposure and outcome are these two variables.\nSo, at minimum, we must account for both to get a valid estimate.\n\n::: callout-tip\n`ggdag()` and friends usually use `tidy_dagitty()` and `dag_*()` or `node_*()` functions to change the underlying data frame.\nSimilarly, the quick plotting functions use ggdag's geoms to visualize the resulting DAG(s).\nIn other words, you can use the same data manipulation and visualization strategies that you use day-to-day directly with ggdag.\n\nHere's a condensed version of what `ggdag_adjustment_set()` is doing:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npodcast_dag_tidy |> \n # add adjustment sets to data\n dag_adjustment_sets() |>\n ggplot(aes(\n x = x, y = y, xend = xend, yend = yend,\n color = adjusted, shape = adjusted\n )) + \n # ggdag's custom geoms: add nodes, edges, and labels\n geom_dag_point() + \n # remove adjusted paths\n geom_dag_edges_link(data = \\(.df) filter(.df, adjusted != \"adjusted\")) + \n geom_dag_label_repel() + \n # you can use any ggplot function, too\n facet_wrap(~ set) +\n scale_shape_manual(values = c(adjusted = 15, unadjusted = 19))\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-22-1.png){fig-align='center' width=432}\n:::\n:::\n\n:::\n\nMinimal adjustment sets are only one type of valid adjustment set [@vanderzander2019].\nSometimes, other combinations of variables can get us an unbiased effect estimate.\nTwo other options available in ggdag are full adjustment sets and canonical adjustment sets.\nFull adjustment sets are every combination of variables that result in a valid set.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag, \n text = FALSE, \n use_labels = \"label\",\n # get full adjustment sets\n type = \"all\"\n)\n```\n\n::: {.cell-output-display}\n![All valid adjustment sets for `podcast_dag`.](chapter-05_files/figure-html/fig-adustment-set-all-1.png){#fig-adustment-set-all fig-align='center' width=624}\n:::\n:::\n\n\nIt turns out that we can also control for `humor`.\n\nCanonical adjustment sets are a bit more complex: they are all possible ancestors of the exposure and outcome minus any likely descendants.\nIn fully saturated DAGs (DAGs where every node causes anything that comes after it in time), the canonical adjustment set is the minimal adjustment set.\n\n::: callout-tip\nMost of the functions in ggdag use dagitty underneath the hood.\nIt's often helpful to call dagitty functions directly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjustmentSets(podcast_dag, type = \"canonical\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ humor, mood, prepared }\n```\n\n\n:::\n:::\n\n:::\n\nUsing our proposed DAG, let's simulate some data to see how accounting for the minimal adjustment set might occur in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(10)\nsim_data <- podcast_dag |>\n simulate_data()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 500 × 5\n exam humor mood podcast prepared\n \n 1 -0.435 0.263 -0.100 -0.630 1.07 \n 2 -0.593 0.317 0.143 -1.55 0.0640\n 3 0.786 1.97 -0.591 -0.318 -0.439 \n 4 -0.103 2.86 -0.139 1.07 0.754 \n 5 -0.614 -2.39 0.702 0.464 0.356 \n 6 1.01 1.21 0.910 0.769 0.561 \n 7 0.167 -1.37 -0.559 -0.866 0.214 \n 8 1.16 0.164 -0.743 0.969 -1.67 \n 9 0.650 0.215 -0.248 0.691 -0.303 \n10 0.156 0.713 1.19 -1.02 -0.219 \n# ℹ 490 more rows\n```\n\n\n:::\n:::\n\n\nSince we have simulated this data, we know that this is a case where *standard methods will succeed* (see @sec-standard) and, therefore, can estimate the causal effect using a basic linear regression model.\n@fig-dag-sim shows a forest plot of the simulated data based on our DAG.\nNotice the model that only included the exposure resulted in a spurious effect (an estimate of -0.1 when we know the truth is 0).\nIn contrast, the model that adjusted for the two variables as suggested by `ggdag_adjustment_set()` is not spurious (much closer to 0).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Model that does not close backdoor paths\nunadjusted_model <- lm(exam ~ podcast, sim_data) |>\n broom::tidy(conf.int = TRUE) |>\n dplyr::filter(term == \"podcast\") |>\n mutate(formula = \"podcast\")\n\n## Model that closes backdoor paths\nadjusted_model <- lm(exam ~ podcast + mood + prepared, sim_data) |>\n broom::tidy(conf.int = TRUE) |>\n dplyr::filter(term == \"podcast\") |>\n mutate(formula = \"podcast + mood + prepared\")\n\nbind_rows(\n unadjusted_model,\n adjusted_model\n) |>\n ggplot(aes(x = estimate, y = formula, xmin = conf.low, xmax = conf.high)) +\n geom_vline(xintercept = 0, linewidth = 1, color = \"grey80\") +\n geom_pointrange(fatten = 3, size = 1) +\n theme_minimal(18) +\n labs(\n y = NULL,\n caption = \"correct effect size: 0\"\n )\n```\n\n::: {.cell-output-display}\n![Forest plot of simulated data based on the DAG described in @fig-dag-podcast.](chapter-05_files/figure-html/fig-dag-sim-1.png){#fig-dag-sim width=672}\n:::\n:::\n\n\n## Structures of Causality\n\n### Advanced Confounding\n\nIn `podcast_dag`, `mood` and `prepared` were *direct* confounders: an arrow was going directly from them to `podcast` and `exam`.\nOften, backdoor paths are more complex.\nLet's consider such a case by adding two new variables: `alertness` and `skills_course`.\n`alertness` represents the feeling of alertness from a good mood, thus the arrow from `mood` to `alertness`.\n`skills_course` represents whether the student took a College Skills Course and learned time management techniques.\nNow, `skills_course` is what frees up the time to listen to a podcast as well as being prepared for the exam.\n`mood` and `prepared` are no longer direct confounders: they are two variables along a more complex backdoor path.\nAdditionally, we've added an arrow going from `humor` to `mood`.\nLet's take a look at @fig-podcast_dag2.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag2 <- dagify(\n podcast ~ mood + humor + skills_course,\n alertness ~ mood,\n mood ~ humor,\n prepared ~ skills_course,\n exam ~ alertness + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n alertness = \"alertness\",\n skills_course = \"college\\nskills course\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\n\nggdag(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![An expanded version of `podcast_dag` that includes two additional variables: `skills_course`, representing a College Skills Course, and `alertness`.](chapter-05_files/figure-html/fig-podcast_dag2-1.png){#fig-podcast_dag2 width=480}\n:::\n:::\n\n::: {.cell}\n\n:::\n\n\nNow there are *three* backdoor paths we need to close: `podcast <- humor -> mood -> alertness -> exam`, `podcast <- mood -> alertness -> exam`, and`podcast <- skills_course -> prepared -> exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_paths(podcast_dag2, use_labels = \"label\", text = FALSE, shadow = TRUE)\n```\n\n::: {.cell-output-display}\n![Three open paths in `podcast_dag2`. Since there is no effect of `podcast` on `exam`, all three are backdoor paths that must be closed to get the correct effect.](chapter-05_files/figure-html/fig-podcast_dag2-paths-1.png){#fig-podcast_dag2-paths width=1056}\n:::\n:::\n\n\nThere are four minimal adjustment sets to close all three paths (and eighteen full adjustment sets!).\nThe minimal adjustment sets are `alertness + prepared`, `alertness + skills_course`, `mood + prepared`, `mood + skills_course`.\nWe can now block the open paths in several ways.\n`mood` and `prepared` still work, but we've got other options now.\nNotably, `prepared` and `alertness` could happen at the same time or even after `podcast`.\n`skills_course` and `mood` still happen before both `podcast` and `exam`, so the idea is still the same: the confounding pathway starts before the exposure and outcome.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(podcast_dag2, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Valid minimal adjustment sets that will close the backdoor paths in @fig-podcast_dag2-paths.](chapter-05_files/figure-html/fig-podcast_dag2-set-1.png){#fig-podcast_dag2-set width=672}\n:::\n:::\n\n\nDeciding between these adjustment sets is a matter of judgment: if all data are perfectly measured, the DAG is correct, and we've modeled them correctly, then it doesn't matter which we use.\nEach adjustment set will result in an unbiased estimate.\nAll three of those assumptions are usually untrue to some degree.\nLet's consider the path via `skills_course` and `prepared`.\nIt may be that we are better able to assess whether or not someone took the College Skills Course than how prepared for the exam they are.\nIn that case, an adjustment set with `skills_course` is a better option.\nBut perhaps we better understand the relationship between preparedness and exam results.\nIf we have it measured, controlling for that might be better.\nWe could get the best of both worlds by including both variables: between the better measurement of `skills_course` and the better modeling of `prepared`, we might have a better chance of minimizing confounding from this path.\n\n### Selection Bias and Mediation\n\nSelection bias is another name for the type of bias that is induced by adjusting for a collider [@lu2022].\nIt's called \"selection bias\" because a common form of collider-induced bias is a variable inherently stratified upon by the design of the study---selection *into* the study.\nLet's consider a case based on the original `podcast_dag` but with one additional variable: whether or not the student showed up to the exam.\nNow, there is an indirect effect of `podcast` on `exam`: listening to a podcast influences whether or not the students attend the exam.\nThe true result of `exam` is missing for those who didn't show up; by studying the group of people who *did* show up, we are inherently stratifying on this variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + showed_up,\n showed_up ~ podcast + mood + prepared,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"), \n # time point 2\n \"podcast\", \n \"showed_up\", \n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\",\n showed_up = \"showed up\"\n )\n)\nggdag(podcast_dag3, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![Another variant of `podcast_dag`, this time including the inherent stratification on those who appear for the exam. There is still no direct effect of `podcast` on `exam`, but there is an indirect effect via `showed_up`.](chapter-05_files/figure-html/fig-podcast_dag3-1.png){#fig-podcast_dag3 width=432}\n:::\n:::\n\n\nThe problem is that `showed_up` is both a collider and a mediator: stratifying on it induces a relationship between many of the variables in the DAG but blocks the indirect effect of `podcast` on `exam`.\nLuckily, the adjustment sets can handle the first problem; because `showed_up` happens *before* `exam`, we're less at risk of collider bias between the exposure and outcome.\nUnfortunately, we cannot calculate the total effect of `podcast` on `exam` because part of the effect is missing: the indirect effect is closed at `showed_up`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |> \n adjust_for(\"showed_up\") |> \n ggdag_adjustment_set(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` given that the data are inherently conditioned on showing up to the exam. In this case, there is no way to recover an unbiased estimate of the total effect of `podcast` on `exam`.](chapter-05_files/figure-html/fig-podcast_dag3-as-1.png){#fig-podcast_dag3-as width=432}\n:::\n:::\n\n\nSometimes, you can still estimate effects in this situation by changing the estimate you wish to calculate.\nWe can't calculate the total effect because we are missing the indirect effect, but we can still calculate the direct effect of `podcast` on `exam`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag3 |> \n adjust_for(\"showed_up\") |> \n ggdag_adjustment_set(effect = \"direct\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![The adjustment set for `podcast_dag3` when targeting a different effect. There is one minimal adjustment set that we can use to estimate the direct effect of `podcast` on `exam`.](chapter-05_files/figure-html/fig-podcast_dag3-direct-1.png){#fig-podcast_dag3-direct width=432}\n:::\n:::\n\n\n#### M-Bias and Butterfly Bias\n\nA particular case of selection bias that you'll often see people talk about is *M-bias*.\nIt's called M-bias because it looks like an M when arranged top to bottom.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nm_bias() |> \n ggdag()\n```\n\n::: {.cell-output-display}\n![A DAG representing M-Bias, a situation where a collider predates the exposure and outcome.](chapter-05_files/figure-html/fig-m-bias-1.png){#fig-m-bias width=384}\n:::\n:::\n\n\n::: callout-tip\nggdag has several quick-DAGs for demonstrating basic causal structures, including `confounder_triangle()`, `collider_triangle()`, `m_bias()`, and `butterfly_bias()`.\n:::\n\nWhat's theoretically interesting about M-bias is that `m` is a collider but occurs before `x` and `y`.\nRemember that association is blocked at a collider, so there is no open path between `x` and `y`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npaths(m_bias())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$paths\n[1] \"x <- a -> m <- b -> y\"\n\n$open\n[1] FALSE\n```\n\n\n:::\n:::\n\n\nLet's focus on the `mood` path of the podcast-exam DAG.\nWhat if we were wrong about mood, and the actual relationship was M-shaped?\nLet's say that, rather than causing `podcast` and `exam`, `mood` was itself caused by two mutual causes of `podcast` and `exam`, `u1` and `u2`, as in @fig-podcast_dag4.\nWe don't know what `u1` and `u2` are, and we don't have them measured.\nAs above, there are no open paths in this subset of the DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 <- dagify(\n podcast ~ u1,\n exam ~ u2,\n mood ~ u1 + u2,\n coords = time_ordered_coords(list(\n c(\"u1\", \"u2\"),\n \"mood\",\n \"podcast\", \n \"exam\"\n )),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n u1 = \"unmeasured\",\n u2 = \"unmeasured\"\n ),\n # we don't have them measured\n latent = c(\"u1\", \"u2\")\n)\n\nggdag(podcast_dag4, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A reconfiguration of @fig-dag-podcast where `mood` is a collider on an M-shaped path.](chapter-05_files/figure-html/fig-podcast_dag4-1.png){#fig-podcast_dag4 width=528}\n:::\n:::\n\n\nThe problem arises when we think our original DAG is the right DAG: `mood` is in the adjustment set, so we control for it.\nBut this induces bias!\nIt opens up a path between `u1` and `u2`, thus creating a path from `podcast` to `exam`.\nIf we had either `u1` or `u2` measured, we could adjust for them to close this path, but we don't.\nThere is no way to close this open path.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag4 |> \n adjust_for(\"mood\") |> \n ggdag_adjustment_set(use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![The adjustment set where `mood` is a collider. If we control for `mood` and don't know about or have the unmeasured causes of `mood`, we have no means of closing the backdoor path opened by adjusting for a collider.](chapter-05_files/figure-html/fig-podcast_dag4-as-1.png){#fig-podcast_dag4-as width=528}\n:::\n:::\n\n\nOf course, the best thing to do here is not control for `mood` at all.\nSometimes, though, that is not an option.\nImagine if, instead of `mood`, this turned out to be the real structure for `showed_up`: since we inherently control for `showed_up`, and we don't have the unmeasured variables, our study results will always be biased.\nIt's essential to understand if we're in that situation so we can address it with sensitivity analysis to understand just how biased the effect would be.\n\nLet's consider a variation on M-bias where `mood` causes `podcast` and `exam` and `u1` and `u2` are mutual causes of `mood` and the exposure and outcome.\nThis arrangement is sometimes called butterfly or bowtie bias, again because of its shape.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbutterfly_bias(x = \"podcast\", y = \"exam\", m = \"mood\", a = \"u1\", b = \"u2\") |> \n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![In butterfly bias, `mood` is both a collider and a confounder. Controlling for the bias induced by `mood` opens a new pathway because we've also conditioned on a collider. We can't properly close all backdoor paths without either `u1` or `u2`.](chapter-05_files/figure-html/fig-butterfly_bias-1.png){#fig-butterfly_bias width=480}\n:::\n:::\n\n\nNow, we're in a challenging position: we need to control for `mood` because it's a confounder, but controlling for `mood` opens up the pathway from `u1` to `u2`.\nBecause we don't have either variable measured, we can't then close the path opened from conditioning on `mood`.\nWhat should we do?\nIt turns out that, when in doubt, controlling for `mood` is the better of the two options: confounding bias tends to be worse than collider bias, and M-shaped structures of colliders are sensitive to slight deviations (e.g., if this is not the exact structure, often the bias isn't as bad) [@DingMiratrix2015].\n\nAnother common form of selection bias is from *loss to follow-up*: people drop out of a study in a way that is related to the exposure and outcome.\nWe'll come back to this topic in [Chapter -@sec-longitudinal].\n\n### Causes of the exposure, causes of the outcome\n\nLet's consider one other type of causal structure that's important: causes of the exposure and not the outcome, and their opposites, causes of the outcome and not the exposure.\nLet's add a variable, `grader_mood`, to the original DAG.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag5 <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared + grader_mood,\n coords = time_ordered_coords(\n list(\n # time point 1\n c(\"prepared\", \"humor\", \"mood\"), \n # time point 2\n c(\"podcast\", \"grader_mood\"), \n # time point 3\n \"exam\"\n )\n ),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"student\\nmood\",\n humor = \"humor\",\n prepared = \"prepared\",\n grader_mood = \"grader\\nmood\"\n )\n)\nggdag(podcast_dag5, use_labels = \"label\", text = FALSE)\n```\n\n::: {.cell-output-display}\n![A DAG containing a cause of the exposure that is not the cause of the outcome (`humor`) and a cause of the outcome that is not a cause of the exposure (`grader_mood`).](chapter-05_files/figure-html/fig-podcast_dag5-1.png){#fig-podcast_dag5 width=480}\n:::\n:::\n\n\nThere are now two variables that aren't related to *both* the exposure and the outcome: `humor`, which causes `podcast` but not `exam`, and `grader_mood`, which causes `exam` but not `podcast`.\nLet's start with `humor`.\n\nVariables that cause the exposure but not the outcome are also called *instrumental variables* (IVs).\nIVs are an unusual circumstance where, under certain conditions, controlling for them can make other types of bias worse.\nWhat's unique about this is that IVs can *also* be used to conduct an entirely different approach to estimating an unbiased effect of the exposure on the outcome.\nIVs are commonly used this way in econometrics and are increasingly popular in other areas.\nIn short, IV analysis allows us to estimate the causal effect using a different set of assumptions than the approaches we've talked about thus far.\nSometimes, a problem intractable using propensity score methods can be addressed using IVs and vice versa.\nWe'll talk more about IVs in @sec-iv-friends.\n\nSo, if you're *not* using IV methods, should you include an IV in a model meant to address confounding?\nIf you're unsure if the variable is an IV or not, you should probably add it to your model: it's more likely to be a confounder than an IV, and, it turns out, the bias from adding an IV is usually small in practice.\nSo, like adjusting for a potential M-structure variable, the risk of bias is worse from confounding [@Myers2011].\n\nNow, let's talk about the opposite of an IV: a cause of the outcome that is not the cause of the exposure.\nThese variables are sometimes called *competing exposures* (because they also cause the outcome) or *precision variables* (because, as we'll see, they increase the precision of causal estimates).\nWe'll call them precision variables because we're concerned about the relationship to the research question at hand, not to another research question where they are exposures [@Brookhart2006].\n\nLike IVs, precision variables do not occur along paths from the exposure to the outcome.\nThus, including them is not necessary.\nUnlike IVs, including precision variables is beneficial.\nIncluding other causes of the outcomes helps a statistical model capture some of its variation.\nThis doesn't impact the point estimate of the effect, but it does reduce the variance, resulting in smaller standard errors and narrower confidence intervals.\nThus, we recommend including them when possible.\n\nSo, even though we don't need to control for `grader_mood`, if we have it in the data set, we should.\nSimilarly, `humor` is not a good addition to the model unless we think it really might be a confounder; if it is a valid instrument, we might want to consider using IV methods to estimate the effect instead.\n\n### Measurement Error and Missingness\n\nDAGs can also help us understand the bias arising from mismeasurements in the data, including the worst mismeasurement: not measuring it at all.\nWe'll cover these topics in [Chapter -@sec-missingness], but the basic idea is that by separating the actual value from the observed value, we can better understand how such biases may behave [@Hernán2009].\nHere's a basic example of a bias called *recall bias*.\nRecall bias is when the outcome influences a participant's memory of exposure, so it's a particular problem in retrospective studies where the earlier exposure is not recorded until after the outcome happens.\nAn example of when this can occur is a case-control study of cancer.\nSomeone *with* cancer may be more motivated to ruminate on their past exposures than someone *without* cancer.\nSo, their memory about a given exposure may be more refined than someone without.\nBy conditioning on the observed version of the exposure, we open up many collider paths.\nUnfortunately, there is no way to close them all.\nIf this is the case, we must investigate how severe the bias would be in practice.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nerror_dag <- dagify(\n exposure_observed ~ exposure_real + exposure_error,\n outcome_observed ~ outcome_real + outcome_error,\n outcome_real ~ exposure_real,\n exposure_error ~ outcome_real,\n labels = c(\n exposure_real = \"Exposure\\n(truth)\",\n exposure_error = \"Measurement Error\\n(exposure)\",\n exposure_observed = \"Exposure\\n(observed)\",\n outcome_real = \"Outcome\\n(truth)\",\n outcome_error = \"Measurement Error\\n(outcome)\",\n outcome_observed = \"Outcome\\n(observed)\"\n ),\n exposure = \"exposure_real\",\n outcome = \"outcome_real\",\n coords = time_ordered_coords()\n)\n\nerror_dag |> \n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing measurement error in observing the exposure and outcome. In this case, the outcome impacts the participant's memory of the exposure, also known as recall bias.](chapter-05_files/figure-html/fig-error_dag-1.png){#fig-error_dag width=528}\n:::\n:::\n\n\n## Recommendations in building DAGs\n\nIn principle, using DAGs is easy: specify the causal relationships you think exist and then query the DAG for information like valid adjustment sets.\nIn practice, assembling DAGs takes considerable time and thought.\nNext to defining the research question itself, it's one of the most challenging steps in making causal inferences.\nVery little guidance exists on best practices in assembling DAGs.\n@Tennant2021 collected data on DAGs in applied health research to better understand how researchers used them.\n@tbl-dag-properties shows some information they collected: the median number of nodes and arcs in a DAG, their ratio, the saturation percent of the DAG, and how many were fully saturated.\nSaturating DAGs means adding all possible arrows going forward in time, e.g., in a fully saturated DAG, any given variable at time point 1 has arrows going to all variables in future time points, and so on.\nMost DAGs were only about half saturated, and very few were fully saturated.\n\nOnly about half of the papers using DAGs reported the adjustment set used.\nIn other words, researchers presented their assumptions about the research question but not the implications about how they should handle the modeling stage or if they did use a valid adjustment set.\nSimilarly, the majority of studies did not report the estimand of interest.\n\n::: callout-note\nThe estimand is the target of interest in terms of what we're trying to estimate, as discussed briefly in [Chapter -@sec-whole-game].\nWe'll discuss estimands in detail in [Chapter -@sec-estimands].\n:::\n\n\n::: {#tbl-dag-properties .cell tbl-cap='A table of DAG properties measured by @Tennant2021. Number of nodes and arcs are the median number of variables and arrows in the analyzed DAGs, while the Node to Arc ratio is their ratio. Saturation proportion is the proportion of all possible arrows going forward in time to other included variables. Fully saturated DAGs are those that include all such arrows. @Tennant2021 also analyzed whether studies reported their estimands and adjustment sets.'}\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n\n \n\n \n\n \n\n \n\n \n\n \n \n \n \n \n \n \n
CharacteristicN = 1441
DAG properties
Number of Nodes12 (9, 16)
Number of Arcs29 (19, 41)
Node to Arc Ratio2.30 (1.78, 3.00)
Saturation Proportion0.46 (0.31, 0.67)
Fully Saturated
    Yes4 (3%)
    No140 (97%)
Reporting
Reported Estimand
    Yes40 (28%)
    No104 (72%)
Reported Adjustment Set
    Yes80 (56%)
    No64 (44%)
1 Median (IQR); n (%)
\n
\n```\n\n:::\n:::\n\n\nIn this section, we'll offer some advice from @Tennant2021 and our own experience assembling DAGs.\n\n### Iterate early and often\n\nOne of the best things you can do for the quality of your results is to make the DAG before you conduct the study, ideally before you even collect the data.\nIf you're already working with your data, at minimum, build your DAG before doing data analysis.\nThis advice is similar in spirit to pre-registered analysis plans: declaring your assumptions ahead of time can help clarify what you need to do, reduce the risk of overfitting (e.g., determining confounders incorrectly from the data), and give you time to get feedback on your DAG.\n\nThis last benefit is significant: you should ideally democratize your DAG.\nShare it early and often with others who are experts on the data, domain, and models.\nIt's natural to create a DAG, present it to your colleagues, and realize you have missed something important.\nSometimes, you will only agree on some details of the structure.\nThat's a good thing: you know now where there is uncertainty in your DAG.\nYou can then examine the results from multiple plausible DAGs or address the uncertainty with sensitivity analyses.\n\nIf you have more than one candidate DAG, check their adjustment sets.\nIf two DAGs have overlapping adjustment sets, focus on those sets; then, you can move forward in a way that satisfies the plausible assumptions you have.\n\n### Consider your question\n\nAs we saw in @fig-podcast_dag3, some questions can be challenging to answer with certain data, while others are more approachable.\nYou should consider precisely what it is you want to estimate.\nDefining your target estimate is an important topic and the subject of [Chapter -@sec-estimands].\n\nAnother important detail about how your DAG relates to your question is the population and time.\nMany causal structures are not static over time and space.\nConsider lung cancer: the distribution of causes of lung cancer was considerably different before the spread of smoking.\nIn medieval Japan, before the spread of tobacco from the Americas centuries later, the causal structure for lung cancer would have been practically different from what it is in Japan today, both in terms of tobacco use and other factors (age of the population, etc.).\n\nThe same is true for confounders.\nEven if something *can* cause the exposure and outcome, if the prevalence of that thing is zero in the population you're analyzing, it's irrelevant to the causal question.\nIt may also be that, in some populations, it doesn't affect one of the two.\nThe reverse is also true: something might be unique to the target population.\nThe use of tobacco in North America several centuries ago was unique among the world population, even though ceremonial tobacco use was quite different from modern recreational use.\nMany changes won't happen as dramatically as across centuries, but sometimes, they do, e.g., if regulation in one country effectively eliminates the population's exposure to something.\n\n### Order nodes by time {#chapter-05-chapter-05-chapter-05-sec-time-ordered}\n\nAs discussed earlier, we recommend ordering your variables by time, either left-to-right or up-to-down.\nThere are two reasons for this.\nFirst, time ordering is an integral part of your assumptions.\nAfter all, something happening before another thing is a requirement for it to be a cause.\nThinking this through carefully will clarify your DAG and the variables you need to address.\n\nSecond, after a certain level of complexity, it's easier to read a DAG when arranged by time because you have to think less about that dimension; it's inherent to the layout.\nThe time ordering algorithm in ggdag automates much of this for you, although, as we saw earlier, it's sometimes helpful to give it more information about the order.\n\nA related topic is feedback loops [@murray2022].\nOften, we think about two things that mutually cause each other as happening in a circle, like global warming and A/C use (A/C use increases global warming, which makes it hotter, which increases A/C use, and so on).\nIt's tempting to visualize that relationship like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n ac_use ~ global_temp,\n global_temp ~ ac_use,\n labels = c(ac_use = \"A/C use\", global_temp = \"Global\\ntemperature\")\n) |> \n ggdag(layout = \"circle\", edge_type = \"arc\", text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG representing the reciprocal relationship between A/C use and global temperature because of global warming. Feedback loops are useful mental shorthands to describe variables that impact each other over time compactly, but they are not true causal diagrams.](chapter-05_files/figure-html/fig-feedback-loop-1.png){#fig-feedback-loop width=432}\n:::\n:::\n\n\nFrom a DAG perspective, this is a problem because of the *A* part of *DAG*: it's cyclic!\nImportantly, though, it's also not correct from a causal perspective.\nFeedback loops are a shorthand for what really happens, which is that the two variables mutually affect each other *over time*.\nCausality only goes forward in time, so it doesn't make sense to go back and forth like in @fig-feedback-loop.\n\nThe real DAG looks something like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n global_temp_2000 ~ ac_use_1990 + global_temp_1990,\n ac_use_2000 ~ ac_use_1990 + global_temp_1990,\n global_temp_2010 ~ ac_use_2000 + global_temp_2000,\n ac_use_2010 ~ ac_use_2000 + global_temp_2000,\n global_temp_2020 ~ ac_use_2010 + global_temp_2010,\n ac_use_2020 ~ ac_use_2010 + global_temp_2010,\n coords = time_ordered_coords(),\n labels = c(\n ac_use_1990 = \"A/C use\\n(1990)\", \n global_temp_1990 = \"Global\\ntemperature\\n(1990)\",\n ac_use_2000 = \"A/C use\\n(2000)\", \n global_temp_2000 = \"Global\\ntemperature\\n(2000)\",\n ac_use_2010 = \"A/C use\\n(2010)\", \n global_temp_2010 = \"Global\\ntemperature\\n(2010)\",\n ac_use_2020 = \"A/C use\\n(2020)\", \n global_temp_2020 = \"Global\\ntemperature\\n(2020)\"\n )\n) |> \n ggdag(text = FALSE, use_labels = \"label\")\n```\n\n::: {.cell-output-display}\n![A DAG showing the relationship between A/C use and global temperature over time. The true causal relationship in a feedback loop goes *forward*.](chapter-05_files/figure-html/fig-feedforward-1.png){#fig-feedforward width=480}\n:::\n:::\n\n\nThe two variables, rather than being in a feed*back* loop, are actually in a feed*forward* loop: they co-evolve over time.\nHere, we only show four discrete moments in time (the decades from 1990 to 2020), but of course, we could get much finer depending on the question and data.\n\nAs with any DAG, the proper analysis approach depends on the question.\nThe effect of A/C use in 2000 on the global temperature in 2020 produces a different adjustment set than the global temperature in 2000 on A/C use in 2020.\nSimilarly, whether we also model this change over time or just those two time points depends on the question.\nOften, these feedforward relationships require you to address *time-varying* confounding, which we'll discuss in [Chapter -@sec-longitudinal].\n\n### Consider the whole data collection process\n\nAs @fig-podcast_dag3 showed us, it's essential to consider the *way* we collected data as much as the causal structure of the question.\nConsidering the whole data collection process is particularly true if you're working with \"found\" data---a data set not intentionally collected to answer the research question.\nWe are always inherently conditioning on the data we have vs. the data we don't have.\nIf other variables influenced the data collection process in the causal structure, you need to consider the impact.\nDo you need to control for additional variables?\nDo you need to change the effect you are trying to estimate?\nCan you answer the question at all?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n::: callout-tip\n## What about case-control studies?\n\nA standard study design in epidemiology is the case-control study.\nCase-control studies are beneficial when the outcome under study is rare or takes a very long time to happen (like many types of cancer).\nParticipants are selected into the study based on their outcome: once a person has an event, they are entered as a case and matched with a control who hasn't had the event.\nOften, they are matched on other factors as well.\n\nMatched case-control studies are selection biased by design [@mansournia2013].\nIn @fig-case-control, when we condition on selection into the study, we lose the ability to close all backdoor paths, even if we control for `confounder`.\nFrom the DAG, it would appear that the entire design is invalid!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ confounder + exposure,\n selection ~ outcome + confounder,\n exposure ~ confounder,\n exposure = \"exposure\",\n outcome = \"outcome\",\n coords = time_ordered_coords()\n) |> \n ggdag(edge_type = \"arc\", text_size = 2.2)\n```\n\n::: {.cell-output-display}\n![A DAG representing a matched case-control study. In such a study, selection is determined by outcome status and any matched confounders. Selection into the study is thus a collider. Since it is inherently stratified on who is actually in the study, such data are limited in the types of causal effects they can estimate.](chapter-05_files/figure-html/fig-case-control-1.png){#fig-case-control width=432}\n:::\n:::\n\n\nLuckily, this isn't wholly true.\nCase-control studies are limited in the type of causal effects they can estimate (causal odds ratios, which under some circumstances approximate causal risk ratios).\nWith careful study design and sampling, the math works out such that these estimates are still valid.\nExactly how and why case-control studies work is beyond the scope of this book, but they are a remarkably clever design.\n:::\n\n### Include variables you don't have\n\nIt's critical that you include *all* variables important to the causal structure, not just the variables you have measured in your data.\nggdag can mark variables as unmeasured (\"latent\"); it will then return only usable adjustment sets, e.g., those without the unmeasured variables.\nOf course, the best thing to do is to use DAGs to help you understand what to measure in the first place, but there are many reasons why your data might be different.\nEven data intentionally collected for the research question might not have a variable discovered to be a confounder after data collection.\n\nFor instance, if we have a DAG where `exposure` and `outcome` have a confounding pathway consisting of `confounder1` and `confounder2`, we can control for either to successfully debias the estimate:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\"\n) |> \n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder1 }\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nThus, if just one is missing (`latent`), then we are ok:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n outcome ~ exposure + confounder1,\n exposure ~ confounder2,\n confounder2 ~ confounder1,\n exposure = \"exposure\",\n outcome = \"outcome\",\n latent = \"confounder1\"\n) |> \n adjustmentSets()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n{ confounder2 }\n```\n\n\n:::\n:::\n\n\nBut if both are missing, there are no valid adjustment sets.\n\nWhen you don't have a variable measured, you still have a few options.\nAs mentioned above, you may be able to identify alternate adjustment sets.\nIf the missing variable is required to close all backdoor paths completely, you can and should conduct a sensitivity analysis to understand the impact of not having it.\nThis is the subject of [Chapter -@sec-sensitivity].\n\nUnder some lucky circumstances, you can also use a *proxy* confounder [@miao2018].\nA proxy confounder is a variable closely related to the confounder such that controlling for it controls for some of the effects of the missing variable.\nConsider an expansion of the fundamental confounding relationship where `q` has a cause, `p`, as in @fig-proxy-confounder.\nTechnically, if we don't have `q`, we can't close the backdoor path, and our effect will be biased.\nPractically, though, if `p` is highly correlated with `q`, it can serve as a method to reduce the confounding from `q`.\nYou can think of `p` as a mismeasured version of `q`; it will seldom wholly control for the bias via `q`, but it can help minimize it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndagify(\n y ~ x + q,\n x ~ q,\n q ~ p,\n coords = time_ordered_coords()\n) |> \n ggdag(edge_type = \"arc\")\n```\n\n::: {.cell-output-display}\n![A DAG with a confounder, `q`, and a proxy confounder, `p`. The true adjustment set is `q`. Since `p` causes `q`, it contains information about `q` and can reduce the bias if we don't have `q` measured.](chapter-05_files/figure-html/fig-proxy-confounder-1.png){#fig-proxy-confounder width=432}\n:::\n:::\n\n\n### Saturate your DAG, then prune\n\nIn discussing @tbl-dag-properties, we mentioned *saturated* DAGs.\nThese are DAGs where all possible arrows are included based on the time ordering, e.g., every variable causes variables that come after it in time.\n\n*Not* including an arrow is a bigger assumption than including one.\nIn other words, your default should be to have an arrow from one variable to a future variable.\nThis default is counterintuitive to many people.\nHow can it be that we need to be so careful about assessing causal effects yet be so liberal in applying causal assumptions in the DAG?\nThe answer to this lies in the strength and prevalence of the cause.\nTechnically, an arrow present means that *for at least a single observation*, the prior node causes the following node.\nThe arrow similarly says nothing about the strength of the relationship.\nSo, a minuscule causal effect on a single individual justifies the presence of an arrow.\nIn practice, such a case is probably not relevant.\nThere is *effectively* no arrow.\n\nThe more significant point, though, is that you should feel confident to add an arrow.\nThe bar for justification is much lower than you think.\nInstead, it's helpful to 1) determine your time ordering, 2) saturate the DAG, and 3) prune out implausible arrows.\n\nLet's experiment by working through a saturated version of the podcast-exam DAG.\n\nFirst, the time-ordering.\nPresumably, the student's sense of humor far predates the day of the exam.\nMood in the morning, too, predates listening to the podcast or exam score, as does preparation.\nThe saturated DAG given this ordering is:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A saturated version of `podcast_dag`: variables have all possible arrows going forward to other variables over time.](chapter-05_files/figure-html/fig-podcast_dag_sat-1.png){#fig-podcast_dag_sat width=528}\n:::\n:::\n\n\nThere are a few new arrows here.\nHumor now causes the other two confounders, as well as exam score.\nSome of them make sense.\nSense of humor probably affects mood for some people.\nWhat about preparedness?\nThis relationship seems a little less plausible.\nSimilarly, we know that a sense of humor does not affect exam scores in this case because the grading is blinded.\nLet's prune those two.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![A pruned version of @fig-podcast_dag_sat: we've removed implausible arrows from the fully saturated DAGs.](chapter-05_files/figure-html/fig-podcast_dag_pruned-1.png){#fig-podcast_dag_pruned width=528}\n:::\n:::\n\n\nThis DAG seems more reasonable.\nSo, was our original DAG wrong?\nThat depends on several factors.\nNotably, both DAGs produce the same adjustment set: controlling for `mood` and `prepared` will give us an unbiased effect if either DAG is correct.\nEven if the new DAG were to produce a different adjustment set, whether the result is meaningfully different depends on the strength of the confounding.\n\n### Include instruments and precision variables\n\nTechnically, you do not need to include instrumental and precision variables in your DAG.\nThe adjustment sets will be the same with and without them.\nHowever, adding them is helpful for two reasons.\nFirstly, they demonstrate your assumptions about their relationships and the variables under study.\nAs discussed above, *not* including an arrow is a more significant assumption than including one, so it's valuable information about how you think the causal structure operates.\nSecondly, it impacts your modeling decision.\nYou should always include precision variables in your model to reduce variability in your estimate so it helps you identify those.\nInstruments are also helpful to see because they may guide alternative or complementary modeling strategies, as we'll discuss in @sec-evidence.\n\n### Focus on the causal structure, then consider measurement bias\n\nAs we saw above, missingness and measurement error can be a source of bias.\nAs we'll see in [Chapter -@sec-missingness], we have several strategies to approach such a situation.\nYet, almost everything we measure is inaccurate to some degree.\nThe true DAG for the data at hand inherently conditions on the measured version of variables.\nIn that sense, your data are always subtly-wrong, a sort of unreliable narrator.\nWhen should we include this information in the DAG?\nWe recommend first focusing on the causal structure of the DAG as if you had perfectly measured each variable [@hernan2021].\nThen, consider how mismeasurement and missingness might affect the realized data, particularly related to the exposure, outcome, and critical confounders.\nYou may prefer to present this as an alternative DAG to consider strategies for addressing the bias arising from those sources, e.g., imputation or sensitivity analyses.\nAfter all, the DAG in \\@ fig-error_dag makes you think the question is unanswerable because we have no method to close all backdoor paths.\nAs with all open paths, that depends on the severity of the bias and our ability to reckon with it.\n\n\n\n\n\n### Pick adjustment sets most likely to be successful\n\nOne area where measurement error is an important consideration is when picking an adjustment set.\nIn theory, if a DAG is correct, any adjustment set will work to create an unbiased result.\nIn practice, variables have different levels of quality.\nPick an adjustment set most likely to succeed because it contains accurate variables.\nSimilarly, non-minimal adjustment sets are helpful to consider because, together, several variables with measurement error along a backdoor path may be enough to minimize the practical bias resulting from that path.\n\nWhat if you don't have certain critical variables measured and thus do not have a valid adjustment set?\nIn that case, you should pick the adjustment set with the best chance of minimizing the bias from other backdoor paths.\nAll is not lost if you don't have every confounder measured: get the highest quality estimate you can, then conduct a sensitivity analysis about the unmeasured variables to understand the impact.\n\n### Use robustness checks\n\nFinally, we recommend checking your DAG for robustness.\nYou can never verify the correctness of your DAG under most conditions, but you can use the implications in your DAG to support it.\nThree types of robustness checks can be helpful depending on the circumstances.\n\n1. **Negative controls** [@Lipsitch2010]. These come in two flavors: negative exposure controls and negative outcome controls. The idea is to find something associated with one but not the other, e.g., the outcome but not the exposure, so there should be no effect. Since there should be no effect, you now have a measurement for how well you control for *other* effects (e.g., the difference from null). Ideally, the confounders for negative controls are similar to the research question.\n2. **DAG-data consistency** [@Textor2016]. Negative controls are an implication of your DAG. An extension of this idea is that there are *many* such implications. Because blocking a path removes statistical dependencies from that path, you can check those assumptions in several places in your DAG.\n3. **Alternate adjustment sets**. Adjustment sets should give roughly the same answer because, outside of random and measurement errors, they are all sets that block backdoor paths. If more than one adjustment set seems reasonable, you can use that as a sensitivity analysis by checking multiple models.\n\nWe'll discuss these in detail in [Chapter -@sec-sensitivity].\nThe caveat here is that these should be complementary to your initial DAG, not a way of *replacing* it.\nIn fact, if you use more than one adjustment set during your analysis, you should report the results from all of them to avoid overfitting your results to your data.\n", + "markdown": "# Expressing causal questions as DAGs {#sec-dags}\n\n\n\n\n\n## Visualizing Causal Assumptions\n\nSo you think correlation isn't causation?\nYou better be ready to name some proposed confounders!\nIn @sec-assump we discuss several assumptions that allow us to estimate unbiased causal effects with our current statistical tools; one of the main assumptions is *exchangeability*, also known as \"no unmeasured confounders\".\nThere is no statistical test that can confirm whether this assumption is met.\nInstead, we often use domain knowledge to construct an assumed world view of how different measured (or unmeasured) factors interact, and then *assuming that world view is correct* determine whether the proposed analysis included any unmeasured confounders.\nOne way to communicate one's world view with respect to how different factors interact is via a directed acyclic graph (DAG).\n\n::: callout-tip\n## Jargon\n\nWhy is it called a *directed acyclic graph*?\nLet's start with the third word: **graph**.\nA graph, as defined here, is a collection of *nodes* (sometimes these are called *vertices*) and *edges* that connect the nodes.\nIn mathematics, there is a whole field called *graph theory* which studies these graphs.\nFor our use, each *node* is a variable or factor, for example the exposure would be a node and likewise the outcome would be a node.\nEdges between nodes are depicted with *arrows* that imply causality.\nSo if we think the exposure causes the outcome we would draw an arrow from the exposure to the outcome.\nThis one way connection between nodes (*from* causes *to* effects) makes this graph **directed**.\nFinally, **acyclic** refers to the fact that there are no cycles or loops in the graph.\nThis makes sense because when thinking about causes and effects, loops are not possible without breaking the space-time continuum.\nOften when a DAG erroneously includes a \"loop\" it is because the analyst did not appropriately consider the timing of the factors in question.\n:::\n\nFor example, @fig-dag-ex, adapted from @mcgowan2023causal shows a sample DAG that suggests that `cause` causes `effect` and `other cause` causes both cause and effect.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Example DAG adapted from @mcgowan2023causal. Here, there are three nodes representing three factors: `cause`, `other cause`, and `effect`. The arrows demonstrate the causal relationships between these factors such that `cause` causes `effect` and `other cause` causes both `cause` and `effect`.](chapter-05_files/figure-html/fig-dag-ex-1.png){#fig-dag-ex width=672}\n:::\n:::\n\n\nThinking about any three nodes in a DAG, there are three ways they can be connected, via a *fork*, a *chain*, or a *collider*.\nExamining @fig-dag-3, the fork describes the scenario where the `q` node has two arrows extending from it, one pointing to `x` and one pointing to `y`.\nThis implies that `q` causes both `x` and `y`.\nIf `x` were an exposure and `y` and outcome, `q` would be a classic *confounder*.\nIf we were trying to quantify the causal relationship between `x` and `y`, the forking caused by `q` provides a potential *backdoor path* from `x` to `y` that could lead to a spurious estimate of the relationship between `x` and `y` if not accounted for (sometimes we refer to this \"accounting\" as closing the backdoor path).\nAssociations can flow through forks.\nThe second panel of @fig-dag-3 in @fig-dag-3 displays a chain.\nHere, the `x` node has an arrow to the `q` node which in turn has an arrow to the `y` node.\nIf `x` were an exposure and `y` and outcome, `q` would be a classic *mediator*.\nIn the final panel of @fig-dag-3, we see the collider.\nHere `y` has an arrow to `q` and `x` has an arrow to `q`.\nColliders *block* backdoor paths.\nOpposite from confounders, adjusting for a collider can actually *open* a backdoor path.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Three types of paths connecting three nodes, `x`, `y`, and `q`: a fork, a chain, and a collider.](chapter-05_files/figure-html/fig-dag-3-1.png){#fig-dag-3 width=960}\n:::\n:::\n\n\nOften, the goal in causal inference is to quantify the relationship between some exposure and some outcome.\nOne way this estimate can be biased is if we are actually reporting the *correlation* between these two factors rather than the *causal relationship*.\nThinking about these DAGs, spurious correlations can often be attributed to *open backdoor paths*, i.e. the relationship between other factors and our exposure and outcome of interest.\nOne way to think about these paths is to conceptualize associations *flowing* along the paths.\nWithout adjustment, associations flow through forks and chains, and are *blocked* by colliders.\nWith adjustment, the opposite is true, associations are blocked by forks and chains if the node in question is adjusted for, however an association *will* flow through a collider if it is adjusted for.\n\nWhile not strictly necessary, we recommend that DAGs are *time-ordered* from left to right.\nThis helps the analyst ensure that they are not erroneously making assumptions that violate the space-time continuum (for example it is impossible for something in the future to cause something from the past).\n\nHow do we use these in practice?\nThe basic idea is:\n\n1. Specify your causal question\n2. Using domain knowledge:\n\n- Write all relevant variables as *nodes*\n- Draw causal pathways as arrows (*edges*)\n\n## DAGs in R\n\nLet's begin by specifying our causal question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores?\nWe can diagram this using the method describe in @sec-diag (@fig-podcast).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Does listening to a comedy podcast the morning before an exam improve graduate students test scores?](chapter-05_files/figure-html/fig-podcast-1.png){#fig-podcast width=672}\n:::\n:::\n\n\nThe first step is to specify your DAG.\nIn R, we can use the {ggdag} package for this along with the `dagify()` function.\nThe `dagify()` function takes formulas, separated by commas, that specify cause and effect, with the left element of of the formula specifying the effect and the right all of the factors that cause it.\nWhat are all of the factors that \"cause\" graduate students to listen to a podcast the morning before an exam?\nWhat are all of the factors that could \"cause\" a graduate student to do well on a test?\nLet's posit some here.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(ggdag)\ndagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\ndag {\nexam\nhumor\nmood\npodcast\nprepared\nhumor -> podcast\nmood -> exam\nmood -> podcast\nprepared -> exam\nprepared -> podcast\n}\n```\n:::\n:::\n\n\nIn the code chunk above, we have posited that a graduate students mood, sense of humor, and how prepared they feel for the exam could influence whether they listened to a podcast the morning of the text.\nLikewise, we posit that their mood and how prepared they are also influences their exam score.\nNotice we *do not* see `podcast` in the `exam` equation -- this means that we assume that there is no causal relationship between podcast and the exam score.\nWe can add additional arguments to `dagify()`, for example, we can time order the coordinates, tag the exposure and outcome, and add labels.\nWe can save this `dagify` object and use the `ggdag()` function to visualize this DAG.\nThis function is a wrapper for a `ggplot2` call, meaning we can add layers like we would to a ggplot object.\nFor example, we can update the theme by adding `+ theme_dag()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npodcast_dag <- dagify(\n podcast ~ mood + humor + prepared,\n exam ~ mood + prepared,\n coords = time_ordered_coords(),\n exposure = \"podcast\",\n outcome = \"exam\",\n labels = c(\n podcast = \"podcast\",\n exam = \"exam score\",\n mood = \"mood\",\n humor = \"humor\",\n prepared = \"prepared\"\n )\n)\nggdag(podcast_dag, use_labels = \"label\", text = FALSE) + \n theme_dag()\n```\n\n::: {.cell-output-display}\n![Proposed DAG to answer the question: Does listening to a comedy podcast the morning before an exam improve graduate students test scores?](chapter-05_files/figure-html/fig-dag-podcast-1.png){#fig-dag-podcast width=672}\n:::\n:::\n\n\nIn the previous section, we discussed *backdoor paths*.\nThese are paths between factors that could potentially lead us to drawing spurious conclusions about the relationship between our exposure and outcome.\nThe `ggdag_paths()` function will help us identify potential backdoor paths.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_paths(podcast_dag)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/fig-paths-1.png){#fig-paths width=672}\n:::\n:::\n\n\nIn @fig-paths we see two open paths, one through `mood` and one through `prepared`.\nThis tells us we need to find a way to account for these open, non-causal paths.\nSome ways to do this include:\n\n- Randomization\n- Stratification, adjustment, weighting, matching, etc.\n\nIn this particular scenario, randomization is likely not possible.\nIt would be challenging to justify randomizing students to listening to a full podcast prior to taking an exam (and it would likely have lots of issues with adherence!).\nStratification is similar to what we demonstrated in @sec-group-sum.\nWe could stratify the students into all possible mood and prepared categories and analyze the causal effect within each stratum -- again, this might pose challenges depending on the sample size and the number of categories we believe exists in each of these factors.\nThe next section will dive into some of these other tools we could use to account for these potential backdoor paths.\n\nThe {ggdag} package can also help us identify adjustment sets.\nIn this particular example, this yields the same result as above, since we need to adjust for both `mood` and `prepared` in order to close the backdoor paths.\nIt is possible, however, to have different ways to close backdoor paths depending on the number of factors and complexity causal relationships assumed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(\n podcast_dag,\n use_labels = \"label\",\n text = FALSE\n)\n```\n\n::: {.cell-output-display}\n![](chapter-05_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nUsing our proposed DAG, let's simulate some data to see how this might occur in practice!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(10)\nsim_data <- podcast_dag |>\n simulate_data()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsim_data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 500 × 5\n exam humor mood podcast prepared\n \n 1 -0.435 0.263 -0.100 -0.630 1.07 \n 2 -0.593 0.317 0.143 -1.55 0.0640\n 3 0.786 1.97 -0.591 -0.318 -0.439 \n 4 -0.103 2.86 -0.139 1.07 0.754 \n 5 -0.614 -2.39 0.702 0.464 0.356 \n 6 1.01 1.21 0.910 0.769 0.561 \n 7 0.167 -1.37 -0.559 -0.866 0.214 \n 8 1.16 0.164 -0.743 0.969 -1.67 \n 9 0.650 0.215 -0.248 0.691 -0.303 \n10 0.156 0.713 1.19 -1.02 -0.219 \n# ℹ 490 more rows\n```\n:::\n:::\n\n\nSince we have simulated this data, we know that this is a case where *standard methods will succeed* (see @sec-standard), and therefore can estimate the causal effect using a basic linear regression model.\n@fig-dag-sim shows a forest plot of the simulated data based on our DAG.\nNotice the model that only included the exposure resulted in a spurious effect (an estimate of -0.1 when we know the truth is 0), whereas the model that adjusted for the two variables as suggested by `ggdag_adjustment_set()` is not spurious (0.0).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Model that does not close backdoor paths\nunadjusted_model <- lm(exam ~ podcast, sim_data) |>\n broom::tidy(conf.int = TRUE) |>\n dplyr::filter(term == \"podcast\") |>\n mutate(formula = \"podcast\")\n\n## Model that closes backdoor paths\nadjusted_model <- lm(exam ~ podcast + mood + prepared, sim_data) |>\n broom::tidy(conf.int = TRUE) |>\n dplyr::filter(term == \"podcast\") |>\n mutate(formula = \"podcast + mood + prepared\")\n\nbind_rows(\n unadjusted_model,\n adjusted_model\n) |> \n ggplot(aes(x = estimate, y = formula, xmin = conf.low, xmax = conf.high)) +\n geom_vline(xintercept = 0, linewidth = 1, color = \"grey80\") + \n geom_pointrange(fatten = 3, size = 1) +\n theme_minimal(18) +\n labs(\n y = NULL,\n caption = \"correct effect size: 0\"\n )\n```\n\n::: {.cell-output-display}\n![Forest plot of simulated data based on the DAG described in @fig-dag-podcast](chapter-05_files/figure-html/fig-dag-sim-1.png){#fig-dag-sim width=672}\n:::\n:::\n\n\n\n\n\n\n\n", "supporting": [ "chapter-05_files" ], diff --git a/_freeze/chapters/chapter-05/figure-html/fig-adustment-set-all-1.png b/_freeze/chapters/chapter-05/figure-html/fig-adustment-set-all-1.png deleted file mode 100644 index 7cde600..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-adustment-set-all-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-butterfly_bias-1.png b/_freeze/chapters/chapter-05/figure-html/fig-butterfly_bias-1.png deleted file mode 100644 index 3627507..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-butterfly_bias-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-case-control-1.png b/_freeze/chapters/chapter-05/figure-html/fig-case-control-1.png deleted file mode 100644 index 14929a7..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-case-control-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-collider-scatter-1.png b/_freeze/chapters/chapter-05/figure-html/fig-collider-scatter-1.png deleted file mode 100644 index 95714a7..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-collider-scatter-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-collider-time-1.png b/_freeze/chapters/chapter-05/figure-html/fig-collider-time-1.png deleted file mode 100644 index 277b63d..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-collider-time-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-confounder-scatter-1.png b/_freeze/chapters/chapter-05/figure-html/fig-confounder-scatter-1.png deleted file mode 100644 index c55b6e8..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-confounder-scatter-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-3-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-3-1.png new file mode 100644 index 0000000..b220025 Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-dag-3-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-basic-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-basic-1.png deleted file mode 100644 index 6d87ef6..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-dag-basic-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-ex-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-ex-1.png new file mode 100644 index 0000000..87a5e9e Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-dag-ex-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-path-types-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-path-types-1.png deleted file mode 100644 index 686fcee..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-dag-path-types-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png index 4845ec0..d82a3a4 100644 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png and b/_freeze/chapters/chapter-05/figure-html/fig-dag-podcast-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-dag-usage-1.png b/_freeze/chapters/chapter-05/figure-html/fig-dag-usage-1.png deleted file mode 100644 index f0ae309..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-dag-usage-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-diag-1.png b/_freeze/chapters/chapter-05/figure-html/fig-diag-1.png new file mode 100644 index 0000000..669a5d9 Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-diag-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-diag-2-1.png b/_freeze/chapters/chapter-05/figure-html/fig-diag-2-1.png new file mode 100644 index 0000000..53fa42c Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-diag-2-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-error_dag-1.png b/_freeze/chapters/chapter-05/figure-html/fig-error_dag-1.png deleted file mode 100644 index ced6423..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-error_dag-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-feedback-loop-1.png b/_freeze/chapters/chapter-05/figure-html/fig-feedback-loop-1.png deleted file mode 100644 index 6119267..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-feedback-loop-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-feedforward-1.png b/_freeze/chapters/chapter-05/figure-html/fig-feedforward-1.png deleted file mode 100644 index db5fb99..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-feedforward-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-m-bias-1.png b/_freeze/chapters/chapter-05/figure-html/fig-m-bias-1.png deleted file mode 100644 index ebf8d27..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-m-bias-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-mediator-scatter-1.png b/_freeze/chapters/chapter-05/figure-html/fig-mediator-scatter-1.png deleted file mode 100644 index ad8b7f0..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-mediator-scatter-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png b/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png new file mode 100644 index 0000000..1dac6ff Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-paths-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-paths-podcast-1.png b/_freeze/chapters/chapter-05/figure-html/fig-paths-podcast-1.png deleted file mode 100644 index ac37e60..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-paths-podcast-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast-1.png new file mode 100644 index 0000000..d42e03a Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/fig-podcast-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast-adustment-set-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast-adustment-set-1.png deleted file mode 100644 index a56733e..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast-adustment-set-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-1.png deleted file mode 100644 index 2b99a4c..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-paths-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-paths-1.png deleted file mode 100644 index f01e7fe..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-paths-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-set-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-set-1.png deleted file mode 100644 index ca9ce85..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag2-set-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-1.png deleted file mode 100644 index 1ecae7a..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-as-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-as-1.png deleted file mode 100644 index e6d52ac..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-as-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-direct-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-direct-1.png deleted file mode 100644 index 6f6d49e..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag3-direct-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-1.png deleted file mode 100644 index 8df035c..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-as-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-as-1.png deleted file mode 100644 index d0d643d..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag4-as-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag5-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag5-1.png deleted file mode 100644 index bb0461c..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag5-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_pruned-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_pruned-1.png deleted file mode 100644 index 6e21b21..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_pruned-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_sat-1.png b/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_sat-1.png deleted file mode 100644 index f993a6b..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-podcast_dag_sat-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/fig-proxy-confounder-1.png b/_freeze/chapters/chapter-05/figure-html/fig-proxy-confounder-1.png deleted file mode 100644 index 7ffbda6..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/fig-proxy-confounder-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png deleted file mode 100644 index 24bc661..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-14-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-15-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-15-1.png deleted file mode 100644 index 8d38e80..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-15-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png deleted file mode 100644 index e18ed7a..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-16-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-22-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-22-1.png deleted file mode 100644 index 4590d31..0000000 Binary files a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-22-1.png and /dev/null differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-6-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..6c3aa10 Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-6-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-7-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..5723ea0 Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-7-1.png differ diff --git a/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-8-1.png b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 0000000..f483af8 Binary files /dev/null and b/_freeze/chapters/chapter-05/figure-html/unnamed-chunk-8-1.png differ diff --git a/_freeze/chapters/chapter-06/execute-results/html.json b/_freeze/chapters/chapter-06/execute-results/html.json index a520434..e0797ff 100644 --- a/_freeze/chapters/chapter-06/execute-results/html.json +++ b/_freeze/chapters/chapter-06/execute-results/html.json @@ -1,8 +1,10 @@ { - "hash": "c07f319cac90f08ee55712f765baf9e6", + "hash": "40666dd337f67cd3f560e2c5d5a445c0", "result": { - "markdown": "# Causal Inference is not (just) a statistical problem {#sec-quartets}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.0683 -1.2592 1.4183 -0.4592 0.4267\n```\n\n\n:::\n:::\n\n\n## Causal and Predictive Models, Revisited {#sec-causal-pred-revisit}\n\n\n\n\n\n\n", - "supporting": [], + "markdown": "# Preparing data to answer causal questions {#sec-data-causal}\n\n\n\n\n\n## Introduction to the data {#sec-data}\n\nThroughout this book we will be using data obtained from [Touring Plans](https://touringplans.com).\nTouring Plans is a company that helps folks plan their trips to Disney and Universal theme parks.\nOne of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling.\nThe `{touringplans}` R package includes several datasets containing information about Disney theme park attractions.\nA summary of the attractions included in the package can be found by running the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(touringplans)\nattractions_metadata\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 14 × 8\n dataset_name name short_name park land opened_on \n \n 1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30\n 2 dinosaur DINO… DINOSAUR Disn… Dino… 1998-04-22\n 3 expedition_… Expe… Expeditio… Disn… Asia 2006-04-07\n 4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27\n 5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22\n 6 navi_river Na'v… Na'vi Riv… Disn… Pand… 2017-05-27\n 7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17\n 8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29\n 9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28\n10 slinky_dog Slin… Slinky Dog Disn… Toy … 2018-06-30\n11 soarin Soar… Soarin' Epcot Worl… 2005-05-05\n12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01\n13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17\n14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31\n# ℹ 2 more variables: duration ,\n# average_wait_per_hundred \n```\n:::\n:::\n\n\nAdditionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily.\nThis metadata includes information like the Walt Disney World ticket season on the particular day (was it high season -- think Christmas -- or low season -- think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as \"extra magic hours\" in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_raw\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,079 × 181\n date wdw_ticket_season dayofweek dayofyear\n \n 1 2015-01-01 5 0\n 2 2015-01-02 6 1\n 3 2015-01-03 7 2\n 4 2015-01-04 1 3\n 5 2015-01-05 2 4\n 6 2015-01-06 3 5\n 7 2015-01-07 4 6\n 8 2015-01-08 5 7\n 9 2015-01-09 6 8\n10 2015-01-10 7 9\n# ℹ 2,069 more rows\n# ℹ 177 more variables: weekofyear ,\n# monthofyear , year , season ,\n# holidaypx , holidaym , holidayn ,\n# holiday , wdwticketseason ,\n# wdwracen , wdweventn , wdwevent ,\n# wdwrace , wdwseason , …\n```\n:::\n:::\n\n\nSuppose the causal question of interest is:\n\n**Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?**\n\nLet's begin by diagramming this causal question (@fig-seven-diag).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagram of the causal question \"Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?\"](chapter-06_files/figure-html/fig-seven-diag-1.png){#fig-seven-diag width=672}\n:::\n:::\n\n\nHistorically, guests who stayed in a Walt Disney World resort hotel could access the park during \"Extra Magic Hours,\" during which the park was closed to all other guests.\nThese extra hours could be in the morning or evening.\nThe Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom.\nMagic Kingdom may or may not be selected each day to have these \"Extra Magic Hours.\" We are interested in examining the relationship between whether there were \"Extra Magic Hours\" in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am.\nBelow is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](chapter-06_files/figure-html/fig-dag-magic-1.png){#fig-dag-magic width=672}\n:::\n:::\n\n\nSince we are not in charge of Walt Disney World's operations, we cannot randomize dates to have (or not) \"Extra Magic Hours\", therefore, we need to rely on previously collected observational data and do our best to emulate the *target trial* that we would have created, should it have been possible.\nHere, our observations are *days*.\nLooking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:\n\n- **Eligibility criteria**: Days must be from 2018\n- **Exposure definition**: Magic kingdom had \"Extra Magic Hours\" in the morning\n- **Assignment procedures**: Observed -- if the historic data suggests there were \"Extra Magic Hours\" in the morning on a particular day, that day is classified as \"exposed\" otherwise it is \"unexposed\"\n- **Follow-up period**: From park open to 10am.\n- **Outcome definition**: The average posted wait time between 9am and 10am\n- **Causal contrast of interest**: Average treatment effect (we will discuss this in @sec-estimands)\n- **Analysis plan**: We use inverse probability waiting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (@fig-dag-magic)\n\n## Data wrangling and recipes\n\nMost of our data manipulation tools come from the `{dplyr}` package (@tbl-dplyr).\nWe will also use `{lubridate}` to help us manipulate dates.\n\n| Target trial protocol element | {dplyr} functions |\n|-------------------------------|---------------------------------------------|\n| Eligibility criteria | `filter()` |\n| Exposure definition | `mutate()` |\n| Assignment procedures | `mutate()` |\n| Follow-up period | `mutate()` `pivot_longer()` `pivot_wider()` |\n| Outcome definition | `mutate()` |\n| Analysis plan | `select()` `mutate()` |\n\n: Mapping target trial protocol elements to commonly used `{dplyr}` functions {#tbl-dplyr}\n\nTo answer this question, we are going to need to manipulate both the `seven_dwarfs_train` dataset as well as the `parks_metadata_raw` dataset.\nLet's start with the `seven_dwarfs_train` data set.\nThe Seven Dwarfs Mine Train ride is an attraction at Walt Disney World's Magic Kingdom.\nThe `seven_dwarfs_train` dataset in the {touringplans} package contains information about the date a particular wait time was recorded (`park_date`), the time of the wait time (`wait_datetime`), the actual wait time (`wait_minutes_actual`), and the posted wait time (`wait_minutes_posted`).\nLet's take a look at this dataset.\nThe {skimr} package is great for getting a quick glimpse at a new dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(skimr)\nskim(seven_dwarfs_train)\n```\n\n::: {.cell-output-display}\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |seven_dwarfs_train |\n|Number of rows |321631 |\n|Number of columns |4 |\n|_______________________ | |\n|Column type frequency: | |\n|Date |1 |\n|numeric |2 |\n|POSIXct |1 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|park_date | 0| 1|2015-01-01 |2021-12-28 |2018-04-07 | 2334|\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:-------------------|---------:|-------------:|-----:|-------:|------:|---:|---:|---:|----:|:-----|\n|wait_minutes_actual | 313996| 0.02| 23.99| 1064.06| -92918| 21| 31| 46| 217|▁▁▁▁▇ |\n|wait_minutes_posted | 30697| 0.90| 76.96| 33.99| 0| 50| 70| 95| 300|▆▇▁▁▁ |\n\n\n**Variable type: POSIXct**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:-------------------|:-------------------|:-------------------|--------:|\n|wait_datetime | 0| 1|2015-01-01 07:51:12 |2021-12-28 22:57:34 |2018-04-07 23:14:06 | 321586|\n:::\n:::\n\n\nExamining the output above, we learn that this dataset contains four columns and 321,631 rows.\nWe also learn that the dates span from 2015 to 2021.\nWe can also examine the distribution of each of the variables to detect any potential anomalies.\nNotice anything strange?\nLook at the `p0` (that is the minimum value) for `wait_minutes_actual`.\nIt is `-92918`!\nWe are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.\n\nWe need this dataset to calculate our *outcome*.\nRecall from above that our outcome is defined as the average posted wait time between 9am and 10am.\nAdditionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(lubridate)\nseven_dwarfs_train_2018 <- seven_dwarfs_train |>\n filter(year(park_date) == 2018) |> # eligibility criteria \n mutate(hour = hour(wait_datetime)) |> # get hour from wait\n group_by(park_date, hour) |> # group by date\n summarise(\n wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE), \n .groups = \"drop\") |> # get average wait time\n mutate(wait_minutes_posted_avg = \n case_when(is.nan(wait_minutes_posted_avg) ~ NA,\n TRUE ~ wait_minutes_posted_avg)) |> # if it is NAN make it NA \n filter(hour == 9) # only keep the average wait time between 9 and 10\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 362 × 3\n park_date hour wait_minutes_posted_avg\n \n 1 2018-01-01 9 60 \n 2 2018-01-02 9 60 \n 3 2018-01-03 9 60 \n 4 2018-01-04 9 68.9\n 5 2018-01-05 9 70.6\n 6 2018-01-06 9 33.3\n 7 2018-01-07 9 46.4\n 8 2018-01-08 9 69.5\n 9 2018-01-09 9 64.3\n10 2018-01-10 9 74.3\n# ℹ 352 more rows\n```\n:::\n:::\n\n\nNow that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for.\nExamining @fig-dag-magic, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature.\nThese are in the `parks_metadata_raw` dataset.\nThis data will require extra cleaning, since the names are in the original format.\n\n::: callout-tip\nWe like to have our variable names follow a clean convention -- one way to do this is to follow Emily Riederer's \"Column Names as Contracts\" format [@Riederer_2020].\nThe basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables.\nFor example, in these data, variables that are specific to a particular wait time are prepended with the term `wait` (e.g. `wait_datetime` and `wait_minutes_actual`), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term `park` (e.g. `park_date` or `park_temperature_high`).\n:::\n\nLet's first decide what variables we will need.\nIn practice, this decision may involve an iterative process.\nFor example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned.\nLet's start by skimming this dataframe.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskim(parks_metadata_raw)\n```\n\n::: {.cell-output-display}\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |parks_metadata_raw |\n|Number of rows |2079 |\n|Number of columns |181 |\n|_______________________ | |\n|Column type frequency: | |\n|character |42 |\n|Date |1 |\n|difftime |46 |\n|logical |6 |\n|numeric |86 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: character**\n\n|skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace|\n|:---------------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|\n|wdw_ticket_season | 861| 0.59| 4| 7| 0| 3| 0|\n|season | 253| 0.88| 4| 29| 0| 17| 0|\n|holidayn | 1865| 0.10| 3| 7| 0| 43| 0|\n|wdwticketseason | 761| 0.63| 4| 7| 0| 3| 0|\n|wdwracen | 1992| 0.04| 4| 6| 0| 5| 0|\n|wdweventn | 1832| 0.12| 3| 12| 0| 8| 0|\n|wdwseason | 87| 0.96| 4| 29| 0| 17| 0|\n|mkeventn | 1546| 0.26| 3| 11| 0| 10| 0|\n|epeventn | 868| 0.58| 4| 5| 0| 4| 0|\n|hseventn | 1877| 0.10| 4| 7| 0| 5| 0|\n|akeventn | 2010| 0.03| 4| 4| 0| 2| 0|\n|holidayj | 2037| 0.02| 5| 15| 0| 8| 0|\n|insession | 105| 0.95| 2| 3| 0| 95| 0|\n|insession_enrollment | 105| 0.95| 2| 4| 0| 100| 0|\n|insession_wdw | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_dlr | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_sqrt_wdw | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_sqrt_dlr | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_california | 105| 0.95| 2| 4| 0| 89| 0|\n|insession_dc | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_central_fl | 105| 0.95| 2| 4| 0| 71| 0|\n|insession_drive1_fl | 105| 0.95| 2| 4| 0| 85| 0|\n|insession_drive2_fl | 105| 0.95| 2| 4| 0| 95| 0|\n|insession_drive_ca | 105| 0.95| 2| 4| 0| 91| 0|\n|insession_florida | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_mardi_gras | 105| 0.95| 2| 4| 0| 82| 0|\n|insession_midwest | 105| 0.95| 2| 4| 0| 75| 0|\n|insession_ny_nj | 105| 0.95| 2| 4| 0| 8| 0|\n|insession_ny_nj_pa | 105| 0.95| 2| 4| 0| 19| 0|\n|insession_new_england | 105| 0.95| 2| 4| 0| 45| 0|\n|insession_new_jersey | 105| 0.95| 2| 4| 0| 2| 0|\n|insession_nothwest | 105| 0.95| 2| 4| 0| 17| 0|\n|insession_planes | 105| 0.95| 2| 4| 0| 81| 0|\n|insession_socal | 105| 0.95| 2| 4| 0| 80| 0|\n|insession_southwest | 105| 0.95| 2| 4| 0| 86| 0|\n|mkprddn | 183| 0.91| 33| 41| 0| 2| 0|\n|mkprdnn | 1358| 0.35| 29| 38| 0| 2| 0|\n|mkfiren | 134| 0.94| 18| 65| 0| 8| 0|\n|epfiren | 126| 0.94| 13| 35| 0| 2| 0|\n|hsfiren | 485| 0.77| 24| 66| 0| 6| 0|\n|hsshwnn | 164| 0.92| 10| 28| 0| 2| 0|\n|akshwnn | 883| 0.58| 15| 33| 0| 2| 0|\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|date | 0| 1|2015-01-01 |2021-08-31 |2017-11-05 | 2079|\n\n\n**Variable type: difftime**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:-----------|:--------|--------:|\n|mkopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclose | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkemhopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 5|\n|mkemhclose | 0| 1.00|54000 secs |107940 secs |23:00:00 | 14|\n|mkopenyest | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkcloseyest | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkopentom | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclosetom | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|epopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epemhopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epemhclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 12|\n|epopenyest | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epcloseyest | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epopentom | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclosetom | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|hsopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclose | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsemhopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 7|\n|hsemhclose | 0| 1.00|50400 secs |93600 secs |21:00:00 | 18|\n|hsopenyest | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hscloseyest | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsopentom | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclosetom | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|akopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclose | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akemhopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akemhclose | 0| 1.00|50400 secs |90000 secs |20:00:00 | 17|\n|akopenyest | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akcloseyest | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akopentom | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclosetom | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|mkprddt1 | 183| 0.91|39600 secs |61200 secs |15:00:00 | 5|\n|mkprddt2 | 1851| 0.11|50400 secs |73800 secs |15:30:00 | 5|\n|mkprdnt1 | 1358| 0.35|68400 secs |82800 secs |21:00:00 | 11|\n|mkprdnt2 | 1480| 0.29|0 secs |84600 secs |23:00:00 | 8|\n|mkfiret1 | 134| 0.94|66600 secs |80100 secs |21:15:00 | 12|\n|mkfiret2 | 2069| 0.00|85800 secs |85800 secs |23:50:00 | 1|\n|epfiret1 | 126| 0.94|64800 secs |81000 secs |21:00:00 | 6|\n|epfiret2 | 2074| 0.00|85200 secs |85200 secs |23:40:00 | 1|\n|hsfiret1 | 485| 0.77|0 secs |82800 secs |21:00:00 | 17|\n|hsfiret2 | 2045| 0.02|0 secs |81000 secs |21:00:00 | 5|\n|hsshwnt1 | 164| 0.92|65100 secs |79200 secs |20:30:00 | 10|\n|hsshwnt2 | 1369| 0.34|72000 secs |82800 secs |21:30:00 | 11|\n|akshwnt1 | 883| 0.58|65700 secs |76500 secs |20:30:00 | 13|\n|akshwnt2 | 1149| 0.45|70200 secs |81000 secs |21:45:00 | 13|\n\n\n**Variable type: logical**\n\n|skim_variable | n_missing| complete_rate| mean|count |\n|:-------------|---------:|-------------:|----:|:-----|\n|hsprddt1 | 2079| 0| NaN|: |\n|hsprddn | 2079| 0| NaN|: |\n|akprddt1 | 2079| 0| NaN|: |\n|akprddt2 | 2079| 0| NaN|: |\n|akprddn | 2079| 0| NaN|: |\n|akfiren | 2079| 0| NaN|: |\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:------------------|---------:|-------------:|-----------:|----------:|-----------:|-----------:|-----------:|-----------:|-----------:|:-----|\n|dayofweek | 0| 1| 4.00| 2.00| 1.00| 2.00| 4.00| 6.00| 7.00|▇▃▃▃▇ |\n|dayofyear | 0| 1| 181.84| 106.34| 0.00| 89.00| 184.00| 273.00| 365.00|▇▇▇▇▇ |\n|weekofyear | 0| 1| 26.09| 15.20| 0.00| 13.00| 26.00| 39.00| 53.00|▇▇▇▇▇ |\n|monthofyear | 0| 1| 6.51| 3.48| 1.00| 3.00| 7.00| 10.00| 12.00|▇▅▆▅▇ |\n|year | 0| 1| 2017.41| 1.74| 2015.00| 2016.00| 2017.00| 2019.00| 2021.00|▇▃▃▃▃ |\n|holidaypx | 0| 1| 7.85| 6.89| 0.00| 3.00| 6.00| 10.00| 33.00|▇▅▂▁▁ |\n|holidaym | 0| 1| 0.54| 1.35| 0.00| 0.00| 0.00| 0.00| 5.00|▇▁▁▁▁ |\n|holiday | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwevent | 0| 1| 0.12| 0.32| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwrace | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwmaxtemp | 5| 1| 82.80| 8.53| 51.11| 78.29| 84.50| 89.54| 97.72|▁▁▃▇▆ |\n|wdwmintemp | 6| 1| 65.50| 10.18| 27.48| 59.03| 68.35| 74.10| 81.28|▁▂▃▆▇ |\n|wdwmeantemp | 6| 1| 74.15| 9.06| 39.75| 68.76| 76.37| 81.61| 87.76|▁▂▃▆▇ |\n|mkevent | 0| 1| 0.26| 0.44| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|epevent | 0| 1| 0.58| 0.49| 0.00| 0.00| 1.00| 1.00| 1.00|▆▁▁▁▇ |\n|hsevent | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akevent | 0| 1| 0.03| 0.18| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhmorn | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmyest | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmtom | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemheve | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkhoursemh | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhyest | 0| 1| 13.65| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhtom | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkemheyest | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhetom | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmorn | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmtom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheve | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhetom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|ephoursemh | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhyest | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhtom | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|hsemhmorn | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmyest | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmtom | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemheve | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemheyest | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemhetom | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hshoursemh | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhyest | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhtom | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|akemhmorn | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmyest | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmtom | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemheve | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemheyest | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemhetom | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akhoursemh | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhyest | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhtom | 0| 1| 11.76| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|mkhours | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhoursyest | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhourstom | 0| 1| 13.26| 2.00| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|ephours | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephoursyest | 0| 1| 12.03| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephourstom | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|hshours | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshoursyest | 0| 1| 11.93| 1.20| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshourstom | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|akhours | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhoursyest | 0| 1| 11.47| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhourstom | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|weather_wdwhigh | 0| 1| 82.35| 7.86| 70.20| 74.60| 82.80| 90.60| 92.30|▅▃▂▂▇ |\n|weather_wdwlow | 0| 1| 64.10| 9.26| 49.20| 55.80| 63.60| 74.00| 76.10|▅▅▃▂▇ |\n|weather_wdwprecip | 0| 1| 0.15| 0.08| 0.03| 0.08| 0.12| 0.23| 0.35|▇▆▃▅▁ |\n|capacitylost_mk | 0| 1| 422110.61| 36458.81| 352865.00| 385812.00| 433857.00| 456055.00| 473553.00|▅▂▁▇▆ |\n|capacitylost_ep | 0| 1| 366897.04| 24019.96| 325168.00| 338367.00| 380763.00| 381963.00| 394662.00|▅▁▁▂▇ |\n|capacitylost_hs | 0| 1| 287485.76| 33198.89| 203780.00| 279573.00| 301871.00| 311870.00| 321869.00|▂▁▁▁▇ |\n|capacitylost_ak | 0| 1| 228193.83| 14967.82| 210779.00| 220778.00| 223178.00| 232777.00| 273873.00|▇▅▁▁▂ |\n|capacitylostwgt_mk | 0| 1| 41374025.71| 3621097.96| 34661635.00| 37641738.00| 42585643.00| 44577245.00| 46545047.00|▅▂▂▇▆ |\n|capacitylostwgt_ep | 0| 1| 35344939.61| 2201138.89| 31692832.00| 32692333.00| 36666337.00| 36668737.00| 37678138.00|▅▁▁▁▇ |\n|capacitylostwgt_hs | 0| 1| 27528647.53| 3049291.37| 19812520.00| 26772627.00| 28771129.00| 29761030.00| 30750931.00|▂▁▁▁▇ |\n|capacitylostwgt_ak | 0| 1| 22386447.82| 1398263.45| 20790321.00| 21780222.00| 21799422.00| 22780123.00| 26739827.00|▇▅▁▁▂ |\n|mkprdday | 0| 1| 1.07| 0.59| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|mkprdngt | 0| 1| 0.64| 0.90| 0.00| 0.00| 0.00| 2.00| 3.00|▇▁▁▃▁ |\n|mkfirewk | 0| 1| 0.94| 0.26| 0.00| 1.00| 1.00| 1.00| 2.00|▁▁▇▁▁ |\n|epfirewk | 0| 1| 0.94| 0.25| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|hsprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|hsfirewk | 0| 1| 0.78| 0.45| 0.00| 1.00| 1.00| 1.00| 2.00|▂▁▇▁▁ |\n|hsshwngt | 0| 1| 1.28| 0.62| 0.00| 1.00| 1.00| 2.00| 3.00|▁▇▁▅▁ |\n|hsfirewks | 0| 1| 1.00| 0.00| 1.00| 1.00| 1.00| 1.00| 1.00|▁▁▇▁▁ |\n|akprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|akshwngt | 0| 1| 1.04| 0.97| 0.00| 0.00| 1.00| 2.00| 3.00|▇▂▁▇▁ |\n:::\n:::\n\n\nThis dataset contains many more variables than the one we worked with previously.\nFor this analysis, we are going to select `date` (the observation date), `wdw_ticket_season` (the ticket season for the observation), `wdwmaxtemp` (the maximum temperature), `mkclose` (the time Magic Kingdom closed), `mkemhmorn` (whether Magic Kingdom had an \"Extra Magic Hour\" in the morning).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_clean <- parks_metadata_raw |>\n ## based on our analysis plan, we will select the following variables\n select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>\n ## based on eligibility criteria, limit to 2018\n filter(year(date) == 2018) |>\n ## rename variables\n rename(\n park_date = date,\n park_ticket_season = wdw_ticket_season,\n park_temperature_high = wdwmaxtemp,\n park_close = mkclose,\n park_extra_magic_morning = mkemhmorn\n )\n```\n:::\n\n\n## Working with multiple data sources\n\nFrequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for.\nThe way we can combine datasets is via *joins* -- joining two or more datasets based on a set or sets of common variables.\nWe can think of three main types of *joins*: left, right, and inner.\nA *left* join combines data from two datasets based on a common variable and includes all records from the *left* dataset along with matching records from the *right* dataset (in `{dplyr}`, `left_join()`), while a *right* join includes all records from the *right* dataset and their corresponding matches from the *left* dataset (in `{dplyr}` `right_join()`); an inner join, on the other hand, includes only the records with matching values in *both* datasets, excluding non-matching records (in `{dplyr}` `inner_join()`.\nFor this analysis, we need to use a left join to pull in the cleaned parks metadata.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n left_join(parks_metadata_clean, by = \"park_date\")\n```\n:::\n\n\n## Recognizing missing data\n\nIt is important to recognize whether we have any missing data in our variables.\nThe `{visdat}` package is great for getting a quick sense of whether we have any missing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(visdat)\nvis_miss(seven_dwarfs_train_2018)\n```\n\n::: {.cell-output-display}\n![](chapter-06_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nIt looks like we only have a few observations (2%) missing our outcome of interest.\nThis is not too bad.\nFor this first analysis we will ignore the missing values.\nWe can explicitly drop them using the `drop_na()` function from `{dplyr}`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n drop_na()\n```\n:::\n\n\n## Exploring and visualizing data and assumptions\n\nThe *positivity* assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (@sec-assump).\nWe can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.\n\n### Single variable checks for positivity violations\n\n@fig-close shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning.\nThere is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seven_dwarfs_train_2018, aes(x = factor(park_close), \n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning))) +\n geom_bar(position = position_dodge2(width = 0.9, preserve = \"single\")) + \n labs(fill = \"Extra Magic Morning\",\n x = \"Time of Park Close\")\n```\n\n::: {.cell-output-display}\n![Distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning](chapter-06_files/figure-html/fig-close-1.png){#fig-close width=672}\n:::\n:::\n\n\nTo examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nExamining @fig-temp, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space.\nIf we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days.\nThis of course would also restrict which days we could draw conclusions about for the future.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(seven_dwarfs_train_2018, aes(x = park_temperature_high, \n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning))) +\n geom_mirror_histogram(bins = 20) + \n labs(fill = \"Extra Magic Morning\",\n x = \"Historic maximum temperature (degrees F)\")\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](chapter-06_files/figure-html/fig-temp-1.png){#fig-temp width=672}\n:::\n:::\n\n\nFinally, let's look at the distribution of ticket season by whether there were extra magic hours in the morning.\nExamining @fig-ticket, we do not see any positivity violations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seven_dwarfs_train_2018, aes(x = park_ticket_season, \n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning))) +\n geom_bar(position = \"dodge\") + \n labs(fill = \"Extra Magic Morning\",\n x = \"Magic Kingdom Ticket Season\")\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](chapter-06_files/figure-html/fig-ticket-1.png){#fig-ticket width=672}\n:::\n:::\n\n\n### Multiple variable checks for positivity violations\n\nWe have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations.\nBecause we have so few variables here, we can examine this a bit more closely.\nLet's start by discretizing the `park_temperature_high` variable a bit (we will cut it into tertiles).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 |>\n ## cut park_temperature_high into tertiles\n mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>\n ## bin park close time\n mutate(park_close_bin = case_when(\n hour(park_close) < 19 & hour(park_close) > 12 ~ \"(1) early\",\n hour(park_close) >= 19 & hour(park_close) < 24 ~ \"(2) standard\",\n hour(park_close) >= 24 | hour(park_close) < 12 ~ \"(3) late\"\n )) |>\n group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>\n ## calculate the proportion exposed in each bin\n summarise(prop_exposed = mean(park_extra_magic_morning), .groups = \"drop\") |>\nggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) + \n geom_tile() + \n scale_fill_gradient2(midpoint = 0.5) +\n facet_wrap(~park_ticket_season) + \n labs(y = \"Historic Maximum Temperature (F)\",\n x = \"Magic Kingdom Park Close Time\",\n fill = \"Proportion of Days Exposed\")\n```\n\n::: {.cell-output-display}\n![Check for positivity violations across three confounders: historic high temperature, park close time, and ticket season.](chapter-06_files/figure-html/fig-positivity-1.png){#fig-positivity width=864}\n:::\n:::\n\n\nInteresting!\n@fig-positivity shows an interesting potential violation.\nIt looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning.\nThis actually makes sense if we think a bit about this data set.\nThe only days with cold temperatures in Florida that would also be considered a \"peak\" time to visit Walt Disney World would be over Christmas / New Years.\nDuring this time there historically were always extra magic hours.\n\nWe are going to proceed with the analysis, but we will keep these observations in mind.\n\n## Presenting descriptive statistics\n\nLet's examine a table of the variables of interest in this data frame.\nTo do so, we are going to use the `tbl_summary()` function from the `{gtsummary}` package.\n(We'll also use the `{labelled}` package to clean up the variable names for the table.)\n\n\n::: {#tbl-unweighted-gtsummary .cell tbl-cap='A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.'}\n\n```{.r .cell-code}\nlibrary(gtsummary)\nlibrary(labelled)\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n set_variable_labels(\n park_ticket_season = \"Ticket Season\",\n park_close = \"Close Time\",\n park_temperature_high = \"Historic High Temperature\"\n )\n\ntbl_summary(\n seven_dwarfs_train_2018,\n by = park_extra_magic_morning,\n include = c(park_ticket_season, park_close, park_temperature_high)\n) |>\n # add an overall column to the table\n add_overall(last = TRUE)\n```\n\n::: {.cell-output-display}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n \n \n \n \n
Characteristic0, N = 29411, N = 601Overall, N = 3541
Ticket Season
    peak60 (20%)18 (30%)78 (22%)
    regular158 (54%)35 (58%)193 (55%)
    value76 (26%)7 (12%)83 (23%)
Close Time
    16:30:001 (0.3%)0 (0%)1 (0.3%)
    18:00:0037 (13%)18 (30%)55 (16%)
    20:00:0018 (6.1%)2 (3.3%)20 (5.6%)
    21:00:0028 (9.5%)0 (0%)28 (7.9%)
    22:00:0091 (31%)11 (18%)102 (29%)
    23:00:0078 (27%)11 (18%)89 (25%)
    24:00:0040 (14%)17 (28%)57 (16%)
    25:00:001 (0.3%)1 (1.7%)2 (0.6%)
Historic High Temperature84 (78, 89)83 (76, 87)84 (78, 89)
1 n (%); Median (IQR)
\n
\n```\n:::\n:::\n", + "supporting": [ + "chapter-06_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/chapters/chapter-07/figure-html/fig-close-1.png b/_freeze/chapters/chapter-06/figure-html/fig-close-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-close-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-close-1.png diff --git a/_freeze/chapters/chapter-07/figure-html/fig-dag-magic-1.png b/_freeze/chapters/chapter-06/figure-html/fig-dag-magic-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-dag-magic-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-dag-magic-1.png diff --git a/_freeze/chapters/chapter-07/figure-html/fig-seven-diag-1.png b/_freeze/chapters/chapter-06/figure-html/fig-diagram-3-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-seven-diag-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-diagram-3-1.png diff --git a/_freeze/chapters/chapter-07/figure-html/fig-positivity-1.png b/_freeze/chapters/chapter-06/figure-html/fig-positivity-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-positivity-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-positivity-1.png diff --git a/_freeze/chapters/chapter-06/figure-html/fig-seven-diag-1.png b/_freeze/chapters/chapter-06/figure-html/fig-seven-diag-1.png new file mode 100644 index 0000000..696fc6b Binary files /dev/null and b/_freeze/chapters/chapter-06/figure-html/fig-seven-diag-1.png differ diff --git a/_freeze/chapters/chapter-07/figure-html/fig-temp-1.png b/_freeze/chapters/chapter-06/figure-html/fig-temp-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-temp-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-temp-1.png diff --git a/_freeze/chapters/chapter-07/figure-html/fig-ticket-1.png b/_freeze/chapters/chapter-06/figure-html/fig-ticket-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/fig-ticket-1.png rename to _freeze/chapters/chapter-06/figure-html/fig-ticket-1.png diff --git a/_freeze/chapters/chapter-07/figure-html/unnamed-chunk-12-1.png b/_freeze/chapters/chapter-06/figure-html/unnamed-chunk-12-1.png similarity index 100% rename from _freeze/chapters/chapter-07/figure-html/unnamed-chunk-12-1.png rename to _freeze/chapters/chapter-06/figure-html/unnamed-chunk-12-1.png diff --git a/_freeze/chapters/chapter-07/execute-results/html.json b/_freeze/chapters/chapter-07/execute-results/html.json index d13a024..c4382c7 100644 --- a/_freeze/chapters/chapter-07/execute-results/html.json +++ b/_freeze/chapters/chapter-07/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "4f88b20b3b97dcff76810a8b0616f1ef", + "hash": "a9b2204a68f6670a68f7890b9dd580f5", "result": { - "markdown": "# Preparing data to answer causal questions {#sec-data-causal}\n\n\n\n\n\n## Introduction to the data {#sec-data}\n\nThroughout this book we will be using data obtained from [Touring Plans](https://touringplans.com).\nTouring Plans is a company that helps folks plan their trips to Disney and Universal theme parks.\nOne of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling.\nThe `{touringplans}` R package includes several datasets containing information about Disney theme park attractions.\nA summary of the attractions included in the package can be found by running the following:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(touringplans)\nattractions_metadata\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 14 × 8\n dataset_name name short_name park land opened_on \n \n 1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30\n 2 dinosaur DINO… DINOSAUR Disn… Dino… 1998-04-22\n 3 expedition_… Expe… Expeditio… Disn… Asia 2006-04-07\n 4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27\n 5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22\n 6 navi_river Na'v… Na'vi Riv… Disn… Pand… 2017-05-27\n 7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17\n 8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29\n 9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28\n10 slinky_dog Slin… Slinky Dog Disn… Toy … 2018-06-30\n11 soarin Soar… Soarin' Epcot Worl… 2005-05-05\n12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01\n13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17\n14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31\n# ℹ 2 more variables: duration ,\n# average_wait_per_hundred \n```\n\n\n:::\n:::\n\n\nAdditionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily.\nThis metadata includes information like the Walt Disney World ticket season on the particular day (was it high season -- think Christmas -- or low season -- think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as \"extra magic hours\" in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_raw\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2,079 × 181\n date wdw_ticket_season dayofweek dayofyear\n \n 1 2015-01-01 5 0\n 2 2015-01-02 6 1\n 3 2015-01-03 7 2\n 4 2015-01-04 1 3\n 5 2015-01-05 2 4\n 6 2015-01-06 3 5\n 7 2015-01-07 4 6\n 8 2015-01-08 5 7\n 9 2015-01-09 6 8\n10 2015-01-10 7 9\n# ℹ 2,069 more rows\n# ℹ 177 more variables: weekofyear ,\n# monthofyear , year , season ,\n# holidaypx , holidaym , holidayn ,\n# holiday , wdwticketseason ,\n# wdwracen , wdweventn , wdwevent ,\n# wdwrace , wdwseason , …\n```\n\n\n:::\n:::\n\n\nSuppose the causal question of interest is:\n\n**Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?**\n\nLet's begin by diagramming this causal question (@fig-seven-diag).\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagram of the causal question \"Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?\"](chapter-07_files/figure-html/fig-seven-diag-1.png){#fig-seven-diag width=672}\n:::\n:::\n\n\nHistorically, guests who stayed in a Walt Disney World resort hotel could access the park during \"Extra Magic Hours,\" during which the park was closed to all other guests.\nThese extra hours could be in the morning or evening.\nThe Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom.\nMagic Kingdom may or may not be selected each day to have these \"Extra Magic Hours.\" We are interested in examining the relationship between whether there were \"Extra Magic Hours\" in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am.\nBelow is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](chapter-07_files/figure-html/fig-dag-magic-1.png){#fig-dag-magic width=672}\n:::\n:::\n\n\nSince we are not in charge of Walt Disney World's operations, we cannot randomize dates to have (or not) \"Extra Magic Hours\", therefore, we need to rely on previously collected observational data and do our best to emulate the *target trial* that we would have created, should it have been possible.\nHere, our observations are *days*.\nLooking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:\n\n- **Eligibility criteria**: Days must be from 2018\n- **Exposure definition**: Magic kingdom had \"Extra Magic Hours\" in the morning\n- **Assignment procedures**: Observed -- if the historic data suggests there were \"Extra Magic Hours\" in the morning on a particular day, that day is classified as \"exposed\" otherwise it is \"unexposed\"\n- **Follow-up period**: From park open to 10am.\n- **Outcome definition**: The average posted wait time between 9am and 10am\n- **Causal contrast of interest**: Average treatment effect (we will discuss this in @sec-estimands)\n- **Analysis plan**: We use inverse probability waiting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (@fig-dag-magic)\n\n## Data wrangling and recipes\n\nMost of our data manipulation tools come from the `{dplyr}` package (@tbl-dplyr).\nWe will also use `{lubridate}` to help us manipulate dates.\n\n| Target trial protocol element | {dplyr} functions |\n|-------------------------------|---------------------------------------------|\n| Eligibility criteria | `filter()` |\n| Exposure definition | `mutate()` |\n| Assignment procedures | `mutate()` |\n| Follow-up period | `mutate()` `pivot_longer()` `pivot_wider()` |\n| Outcome definition | `mutate()` |\n| Analysis plan | `select()` `mutate()` |\n\n: Mapping target trial protocol elements to commonly used `{dplyr}` functions {#tbl-dplyr}\n\nTo answer this question, we are going to need to manipulate both the `seven_dwarfs_train` dataset as well as the `parks_metadata_raw` dataset.\nLet's start with the `seven_dwarfs_train` data set.\nThe Seven Dwarfs Mine Train ride is an attraction at Walt Disney World's Magic Kingdom.\nThe `seven_dwarfs_train` dataset in the {touringplans} package contains information about the date a particular wait time was recorded (`park_date`), the time of the wait time (`wait_datetime`), the actual wait time (`wait_minutes_actual`), and the posted wait time (`wait_minutes_posted`).\nLet's take a look at this dataset.\nThe {skimr} package is great for getting a quick glimpse at a new dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(skimr)\nskim(seven_dwarfs_train)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |seven_dwarfs_train |\n|Number of rows |321631 |\n|Number of columns |4 |\n|_______________________ | |\n|Column type frequency: | |\n|Date |1 |\n|numeric |2 |\n|POSIXct |1 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|park_date | 0| 1|2015-01-01 |2021-12-28 |2018-04-07 | 2334|\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:-------------------|---------:|-------------:|-----:|-------:|------:|---:|---:|---:|----:|:-----|\n|wait_minutes_actual | 313996| 0.02| 23.99| 1064.06| -92918| 21| 31| 46| 217|▁▁▁▁▇ |\n|wait_minutes_posted | 30697| 0.90| 76.96| 33.99| 0| 50| 70| 95| 300|▆▇▁▁▁ |\n\n\n**Variable type: POSIXct**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:-------------------|:-------------------|:-------------------|--------:|\n|wait_datetime | 0| 1|2015-01-01 07:51:12 |2021-12-28 22:57:34 |2018-04-07 23:14:06 | 321586|\n\n\n:::\n:::\n\n\nExamining the output above, we learn that this dataset contains four columns and 321,631 rows.\nWe also learn that the dates span from 2015 to 2021.\nWe can also examine the distribution of each of the variables to detect any potential anomalies.\nNotice anything strange?\nLook at the `p0` (that is the minimum value) for `wait_minutes_actual`.\nIt is `-92918`!\nWe are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.\n\nWe need this dataset to calculate our *outcome*.\nRecall from above that our outcome is defined as the average posted wait time between 9am and 10am.\nAdditionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(lubridate)\nseven_dwarfs_train_2018 <- seven_dwarfs_train |>\n filter(year(park_date) == 2018) |> # eligibility criteria\n mutate(hour = hour(wait_datetime)) |> # get hour from wait\n group_by(park_date, hour) |> # group by date\n summarise(\n wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE),\n .groups = \"drop\"\n ) |> # get average wait time\n mutate(\n wait_minutes_posted_avg =\n case_when(\n is.nan(wait_minutes_posted_avg) ~ NA,\n TRUE ~ wait_minutes_posted_avg\n )\n ) |> # if it is NAN make it NA\n filter(hour == 9) # only keep the average wait time between 9 and 10\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 362 × 3\n park_date hour wait_minutes_posted_avg\n \n 1 2018-01-01 9 60 \n 2 2018-01-02 9 60 \n 3 2018-01-03 9 60 \n 4 2018-01-04 9 68.9\n 5 2018-01-05 9 70.6\n 6 2018-01-06 9 33.3\n 7 2018-01-07 9 46.4\n 8 2018-01-08 9 69.5\n 9 2018-01-09 9 64.3\n10 2018-01-10 9 74.3\n# ℹ 352 more rows\n```\n\n\n:::\n:::\n\n\nNow that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for.\nExamining @fig-dag-magic, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature.\nThese are in the `parks_metadata_raw` dataset.\nThis data will require extra cleaning, since the names are in the original format.\n\n::: callout-tip\nWe like to have our variable names follow a clean convention -- one way to do this is to follow Emily Riederer's \"Column Names as Contracts\" format [@Riederer_2020].\nThe basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables.\nFor example, in these data, variables that are specific to a particular wait time are prepended with the term `wait` (e.g. `wait_datetime` and `wait_minutes_actual`), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term `park` (e.g. `park_date` or `park_temperature_high`).\n:::\n\nLet's first decide what variables we will need.\nIn practice, this decision may involve an iterative process.\nFor example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned.\nLet's start by skimming this dataframe.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nskim(parks_metadata_raw)\n```\n\n::: {.cell-output-display}\n\nTable: Data summary\n\n| | |\n|:------------------------|:------------------|\n|Name |parks_metadata_raw |\n|Number of rows |2079 |\n|Number of columns |181 |\n|_______________________ | |\n|Column type frequency: | |\n|character |42 |\n|Date |1 |\n|difftime |46 |\n|logical |6 |\n|numeric |86 |\n|________________________ | |\n|Group variables |None |\n\n\n**Variable type: character**\n\n|skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace|\n|:---------------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|\n|wdw_ticket_season | 861| 0.59| 4| 7| 0| 3| 0|\n|season | 253| 0.88| 4| 29| 0| 17| 0|\n|holidayn | 1865| 0.10| 3| 7| 0| 43| 0|\n|wdwticketseason | 761| 0.63| 4| 7| 0| 3| 0|\n|wdwracen | 1992| 0.04| 4| 6| 0| 5| 0|\n|wdweventn | 1832| 0.12| 3| 12| 0| 8| 0|\n|wdwseason | 87| 0.96| 4| 29| 0| 17| 0|\n|mkeventn | 1546| 0.26| 3| 11| 0| 10| 0|\n|epeventn | 868| 0.58| 4| 5| 0| 4| 0|\n|hseventn | 1877| 0.10| 4| 7| 0| 5| 0|\n|akeventn | 2010| 0.03| 4| 4| 0| 2| 0|\n|holidayj | 2037| 0.02| 5| 15| 0| 8| 0|\n|insession | 105| 0.95| 2| 3| 0| 95| 0|\n|insession_enrollment | 105| 0.95| 2| 4| 0| 100| 0|\n|insession_wdw | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_dlr | 105| 0.95| 2| 4| 0| 94| 0|\n|insession_sqrt_wdw | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_sqrt_dlr | 105| 0.95| 2| 4| 0| 97| 0|\n|insession_california | 105| 0.95| 2| 4| 0| 89| 0|\n|insession_dc | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_central_fl | 105| 0.95| 2| 4| 0| 71| 0|\n|insession_drive1_fl | 105| 0.95| 2| 4| 0| 85| 0|\n|insession_drive2_fl | 105| 0.95| 2| 4| 0| 95| 0|\n|insession_drive_ca | 105| 0.95| 2| 4| 0| 91| 0|\n|insession_florida | 105| 0.95| 2| 4| 0| 86| 0|\n|insession_mardi_gras | 105| 0.95| 2| 4| 0| 82| 0|\n|insession_midwest | 105| 0.95| 2| 4| 0| 75| 0|\n|insession_ny_nj | 105| 0.95| 2| 4| 0| 8| 0|\n|insession_ny_nj_pa | 105| 0.95| 2| 4| 0| 19| 0|\n|insession_new_england | 105| 0.95| 2| 4| 0| 45| 0|\n|insession_new_jersey | 105| 0.95| 2| 4| 0| 2| 0|\n|insession_nothwest | 105| 0.95| 2| 4| 0| 17| 0|\n|insession_planes | 105| 0.95| 2| 4| 0| 81| 0|\n|insession_socal | 105| 0.95| 2| 4| 0| 80| 0|\n|insession_southwest | 105| 0.95| 2| 4| 0| 86| 0|\n|mkprddn | 183| 0.91| 33| 41| 0| 2| 0|\n|mkprdnn | 1358| 0.35| 29| 38| 0| 2| 0|\n|mkfiren | 134| 0.94| 18| 65| 0| 8| 0|\n|epfiren | 126| 0.94| 13| 35| 0| 2| 0|\n|hsfiren | 485| 0.77| 24| 66| 0| 6| 0|\n|hsshwnn | 164| 0.92| 10| 28| 0| 2| 0|\n|akshwnn | 883| 0.58| 15| 33| 0| 2| 0|\n\n\n**Variable type: Date**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:----------|:----------|--------:|\n|date | 0| 1|2015-01-01 |2021-08-31 |2017-11-05 | 2079|\n\n\n**Variable type: difftime**\n\n|skim_variable | n_missing| complete_rate|min |max |median | n_unique|\n|:-------------|---------:|-------------:|:----------|:-----------|:--------|--------:|\n|mkopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclose | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkemhopen | 0| 1.00|21600 secs |32400 secs |09:00:00 | 5|\n|mkemhclose | 0| 1.00|54000 secs |107940 secs |23:00:00 | 14|\n|mkopenyest | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkcloseyest | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|mkopentom | 0| 1.00|21600 secs |32400 secs |09:00:00 | 4|\n|mkclosetom | 0| 1.00|54000 secs |107940 secs |22:00:00 | 13|\n|epopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epemhopen | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epemhclose | 0| 1.00|61200 secs |90000 secs |21:00:00 | 12|\n|epopenyest | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epcloseyest | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|epopentom | 0| 1.00|25200 secs |43200 secs |09:00:00 | 6|\n|epclosetom | 0| 1.00|61200 secs |90000 secs |21:00:00 | 9|\n|hsopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclose | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsemhopen | 0| 1.00|21600 secs |36000 secs |09:00:00 | 7|\n|hsemhclose | 0| 1.00|50400 secs |93600 secs |21:00:00 | 18|\n|hsopenyest | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hscloseyest | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|hsopentom | 0| 1.00|21600 secs |36000 secs |09:00:00 | 6|\n|hsclosetom | 0| 1.00|50400 secs |86400 secs |21:00:00 | 14|\n|akopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclose | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akemhopen | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akemhclose | 0| 1.00|50400 secs |90000 secs |20:00:00 | 17|\n|akopenyest | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akcloseyest | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|akopentom | 0| 1.00|25200 secs |32400 secs |09:00:00 | 3|\n|akclosetom | 0| 1.00|50400 secs |86400 secs |20:00:00 | 16|\n|mkprddt1 | 183| 0.91|39600 secs |61200 secs |15:00:00 | 5|\n|mkprddt2 | 1851| 0.11|50400 secs |73800 secs |15:30:00 | 5|\n|mkprdnt1 | 1358| 0.35|68400 secs |82800 secs |21:00:00 | 11|\n|mkprdnt2 | 1480| 0.29|0 secs |84600 secs |23:00:00 | 8|\n|mkfiret1 | 134| 0.94|66600 secs |80100 secs |21:15:00 | 12|\n|mkfiret2 | 2069| 0.00|85800 secs |85800 secs |23:50:00 | 1|\n|epfiret1 | 126| 0.94|64800 secs |81000 secs |21:00:00 | 6|\n|epfiret2 | 2074| 0.00|85200 secs |85200 secs |23:40:00 | 1|\n|hsfiret1 | 485| 0.77|0 secs |82800 secs |21:00:00 | 17|\n|hsfiret2 | 2045| 0.02|0 secs |81000 secs |21:00:00 | 5|\n|hsshwnt1 | 164| 0.92|65100 secs |79200 secs |20:30:00 | 10|\n|hsshwnt2 | 1369| 0.34|72000 secs |82800 secs |21:30:00 | 11|\n|akshwnt1 | 883| 0.58|65700 secs |76500 secs |20:30:00 | 13|\n|akshwnt2 | 1149| 0.45|70200 secs |81000 secs |21:45:00 | 13|\n\n\n**Variable type: logical**\n\n|skim_variable | n_missing| complete_rate| mean|count |\n|:-------------|---------:|-------------:|----:|:-----|\n|hsprddt1 | 2079| 0| NaN|: |\n|hsprddn | 2079| 0| NaN|: |\n|akprddt1 | 2079| 0| NaN|: |\n|akprddt2 | 2079| 0| NaN|: |\n|akprddn | 2079| 0| NaN|: |\n|akfiren | 2079| 0| NaN|: |\n\n\n**Variable type: numeric**\n\n|skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist |\n|:------------------|---------:|-------------:|-----------:|----------:|-----------:|-----------:|-----------:|-----------:|-----------:|:-----|\n|dayofweek | 0| 1| 4.00| 2.00| 1.00| 2.00| 4.00| 6.00| 7.00|▇▃▃▃▇ |\n|dayofyear | 0| 1| 181.84| 106.34| 0.00| 89.00| 184.00| 273.00| 365.00|▇▇▇▇▇ |\n|weekofyear | 0| 1| 26.09| 15.20| 0.00| 13.00| 26.00| 39.00| 53.00|▇▇▇▇▇ |\n|monthofyear | 0| 1| 6.51| 3.48| 1.00| 3.00| 7.00| 10.00| 12.00|▇▅▆▅▇ |\n|year | 0| 1| 2017.41| 1.74| 2015.00| 2016.00| 2017.00| 2019.00| 2021.00|▇▃▃▃▃ |\n|holidaypx | 0| 1| 7.85| 6.89| 0.00| 3.00| 6.00| 10.00| 33.00|▇▅▂▁▁ |\n|holidaym | 0| 1| 0.54| 1.35| 0.00| 0.00| 0.00| 0.00| 5.00|▇▁▁▁▁ |\n|holiday | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwevent | 0| 1| 0.12| 0.32| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwrace | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|wdwmaxtemp | 5| 1| 82.80| 8.53| 51.11| 78.29| 84.50| 89.54| 97.72|▁▁▃▇▆ |\n|wdwmintemp | 6| 1| 65.50| 10.18| 27.48| 59.03| 68.35| 74.10| 81.28|▁▂▃▆▇ |\n|wdwmeantemp | 6| 1| 74.15| 9.06| 39.75| 68.76| 76.37| 81.61| 87.76|▁▂▃▆▇ |\n|mkevent | 0| 1| 0.26| 0.44| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|epevent | 0| 1| 0.58| 0.49| 0.00| 0.00| 1.00| 1.00| 1.00|▆▁▁▁▇ |\n|hsevent | 0| 1| 0.10| 0.30| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akevent | 0| 1| 0.03| 0.18| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhmorn | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmyest | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemhmtom | 0| 1| 0.19| 0.40| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|mkemheve | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkhoursemh | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhyest | 0| 1| 13.65| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkhoursemhtom | 0| 1| 13.64| 1.98| 7.50| 13.00| 14.00| 15.00| 23.98|▁▇▅▁▁ |\n|mkemheyest | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|mkemhetom | 0| 1| 0.13| 0.33| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmorn | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhmtom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheve | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemheyest | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|epemhetom | 0| 1| 0.13| 0.34| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|ephoursemh | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhyest | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|ephoursemhtom | 0| 1| 12.41| 0.96| 9.00| 12.00| 12.00| 13.00| 17.00|▁▇▃▂▁ |\n|hsemhmorn | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmyest | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemhmtom | 0| 1| 0.18| 0.38| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▂ |\n|hsemheve | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemheyest | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hsemhetom | 0| 1| 0.06| 0.25| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|hshoursemh | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhyest | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|hshoursemhtom | 0| 1| 12.32| 1.49| 8.00| 11.00| 12.00| 13.00| 18.00|▁▇▇▂▁ |\n|akemhmorn | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmyest | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemhmtom | 0| 1| 0.30| 0.46| 0.00| 0.00| 0.00| 1.00| 1.00|▇▁▁▁▃ |\n|akemheve | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemheyest | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akemhetom | 0| 1| 0.04| 0.20| 0.00| 0.00| 0.00| 0.00| 1.00|▇▁▁▁▁ |\n|akhoursemh | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhyest | 0| 1| 11.77| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|akhoursemhtom | 0| 1| 11.76| 1.80| 7.00| 11.00| 12.00| 13.00| 17.00|▂▇▇▅▁ |\n|mkhours | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhoursyest | 0| 1| 13.26| 2.01| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|mkhourstom | 0| 1| 13.26| 2.00| 7.00| 12.00| 13.00| 15.00| 23.98|▂▇▇▁▁ |\n|ephours | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephoursyest | 0| 1| 12.03| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|ephourstom | 0| 1| 12.02| 0.64| 8.00| 12.00| 12.00| 12.00| 17.00|▁▁▇▁▁ |\n|hshours | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshoursyest | 0| 1| 11.93| 1.20| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|hshourstom | 0| 1| 11.92| 1.19| 5.00| 11.00| 12.00| 12.50| 18.00|▁▁▇▂▁ |\n|akhours | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhoursyest | 0| 1| 11.47| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|akhourstom | 0| 1| 11.46| 1.68| 6.00| 10.50| 11.00| 12.50| 17.00|▁▃▇▃▁ |\n|weather_wdwhigh | 0| 1| 82.35| 7.86| 70.20| 74.60| 82.80| 90.60| 92.30|▅▃▂▂▇ |\n|weather_wdwlow | 0| 1| 64.10| 9.26| 49.20| 55.80| 63.60| 74.00| 76.10|▅▅▃▂▇ |\n|weather_wdwprecip | 0| 1| 0.15| 0.08| 0.03| 0.08| 0.12| 0.23| 0.35|▇▆▃▅▁ |\n|capacitylost_mk | 0| 1| 422110.61| 36458.81| 352865.00| 385812.00| 433857.00| 456055.00| 473553.00|▅▂▁▇▆ |\n|capacitylost_ep | 0| 1| 366897.04| 24019.96| 325168.00| 338367.00| 380763.00| 381963.00| 394662.00|▅▁▁▂▇ |\n|capacitylost_hs | 0| 1| 287485.76| 33198.89| 203780.00| 279573.00| 301871.00| 311870.00| 321869.00|▂▁▁▁▇ |\n|capacitylost_ak | 0| 1| 228193.83| 14967.82| 210779.00| 220778.00| 223178.00| 232777.00| 273873.00|▇▅▁▁▂ |\n|capacitylostwgt_mk | 0| 1| 41374025.71| 3621097.96| 34661635.00| 37641738.00| 42585643.00| 44577245.00| 46545047.00|▅▂▂▇▆ |\n|capacitylostwgt_ep | 0| 1| 35344939.61| 2201138.89| 31692832.00| 32692333.00| 36666337.00| 36668737.00| 37678138.00|▅▁▁▁▇ |\n|capacitylostwgt_hs | 0| 1| 27528647.53| 3049291.37| 19812520.00| 26772627.00| 28771129.00| 29761030.00| 30750931.00|▂▁▁▁▇ |\n|capacitylostwgt_ak | 0| 1| 22386447.82| 1398263.45| 20790321.00| 21780222.00| 21799422.00| 22780123.00| 26739827.00|▇▅▁▁▂ |\n|mkprdday | 0| 1| 1.07| 0.59| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|mkprdngt | 0| 1| 0.64| 0.90| 0.00| 0.00| 0.00| 2.00| 3.00|▇▁▁▃▁ |\n|mkfirewk | 0| 1| 0.94| 0.26| 0.00| 1.00| 1.00| 1.00| 2.00|▁▁▇▁▁ |\n|epfirewk | 0| 1| 0.94| 0.25| 0.00| 1.00| 1.00| 1.00| 3.00|▁▇▁▁▁ |\n|hsprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|hsfirewk | 0| 1| 0.78| 0.45| 0.00| 1.00| 1.00| 1.00| 2.00|▂▁▇▁▁ |\n|hsshwngt | 0| 1| 1.28| 0.62| 0.00| 1.00| 1.00| 2.00| 3.00|▁▇▁▅▁ |\n|hsfirewks | 0| 1| 1.00| 0.00| 1.00| 1.00| 1.00| 1.00| 1.00|▁▁▇▁▁ |\n|akprdday | 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|▁▁▇▁▁ |\n|akshwngt | 0| 1| 1.04| 0.97| 0.00| 0.00| 1.00| 2.00| 3.00|▇▂▁▇▁ |\n\n\n:::\n:::\n\n\nThis dataset contains many more variables than the one we worked with previously.\nFor this analysis, we are going to select `date` (the observation date), `wdw_ticket_season` (the ticket season for the observation), `wdwmaxtemp` (the maximum temperature), `mkclose` (the time Magic Kingdom closed), `mkemhmorn` (whether Magic Kingdom had an \"Extra Magic Hour\" in the morning).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparks_metadata_clean <- parks_metadata_raw |>\n ## based on our analysis plan, we will select the following variables\n select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>\n ## based on eligibility criteria, limit to 2018\n filter(year(date) == 2018) |>\n ## rename variables\n rename(\n park_date = date,\n park_ticket_season = wdw_ticket_season,\n park_temperature_high = wdwmaxtemp,\n park_close = mkclose,\n park_extra_magic_morning = mkemhmorn\n )\n```\n:::\n\n\n## Working with multiple data sources\n\nFrequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for.\nThe way we can combine datasets is via *joins* -- joining two or more datasets based on a set or sets of common variables.\nWe can think of three main types of *joins*: left, right, and inner.\nA *left* join combines data from two datasets based on a common variable and includes all records from the *left* dataset along with matching records from the *right* dataset (in `{dplyr}`, `left_join()`), while a *right* join includes all records from the *right* dataset and their corresponding matches from the *left* dataset (in `{dplyr}` `right_join()`); an inner join, on the other hand, includes only the records with matching values in *both* datasets, excluding non-matching records (in `{dplyr}` `inner_join()`.\nFor this analysis, we need to use a left join to pull in the cleaned parks metadata.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n left_join(parks_metadata_clean, by = \"park_date\")\n```\n:::\n\n\n## Recognizing missing data\n\nIt is important to recognize whether we have any missing data in our variables.\nThe `{visdat}` package is great for getting a quick sense of whether we have any missing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(visdat)\nvis_miss(seven_dwarfs_train_2018)\n```\n\n::: {.cell-output-display}\n![](chapter-07_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nIt looks like we only have a few observations (2%) missing our outcome of interest.\nThis is not too bad.\nFor this first analysis we will ignore the missing values.\nWe can explicitly drop them using the `drop_na()` function from `{dplyr}`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n drop_na()\n```\n:::\n\n\n## Exploring and visualizing data and assumptions\n\nThe *positivity* assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (@sec-assump).\nWe can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.\n\n### Single variable checks for positivity violations\n\n@fig-close shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning.\nThere is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seven_dwarfs_train_2018, aes(\n x = factor(park_close),\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n)) +\n geom_bar(position = position_dodge2(width = 0.9, preserve = \"single\")) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Time of Park Close\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning](chapter-07_files/figure-html/fig-close-1.png){#fig-close width=672}\n:::\n:::\n\n\nTo examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nExamining @fig-temp, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space.\nIf we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days.\nThis of course would also restrict which days we could draw conclusions about for the future.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(seven_dwarfs_train_2018, aes(\n x = park_temperature_high,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n)) +\n geom_mirror_histogram(bins = 20) +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Historic maximum temperature (degrees F)\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](chapter-07_files/figure-html/fig-temp-1.png){#fig-temp width=672}\n:::\n:::\n\n\nFinally, let's look at the distribution of ticket season by whether there were extra magic hours in the morning.\nExamining @fig-ticket, we do not see any positivity violations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(seven_dwarfs_train_2018, aes(\n x = park_ticket_season,\n group = factor(park_extra_magic_morning),\n fill = factor(park_extra_magic_morning)\n)) +\n geom_bar(position = \"dodge\") +\n labs(\n fill = \"Extra Magic Morning\",\n x = \"Magic Kingdom Ticket Season\"\n )\n```\n\n::: {.cell-output-display}\n![Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning](chapter-07_files/figure-html/fig-ticket-1.png){#fig-ticket width=672}\n:::\n:::\n\n\n### Multiple variable checks for positivity violations\n\nWe have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations.\nBecause we have so few variables here, we can examine this a bit more closely.\nLet's start by discretizing the `park_temperature_high` variable a bit (we will cut it into tertiles).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseven_dwarfs_train_2018 |>\n ## cut park_temperature_high into tertiles\n mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>\n ## bin park close time\n mutate(park_close_bin = case_when(\n hour(park_close) < 19 & hour(park_close) > 12 ~ \"(1) early\",\n hour(park_close) >= 19 & hour(park_close) < 24 ~ \"(2) standard\",\n hour(park_close) >= 24 | hour(park_close) < 12 ~ \"(3) late\"\n )) |>\n group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>\n ## calculate the proportion exposed in each bin\n summarise(prop_exposed = mean(park_extra_magic_morning), .groups = \"drop\") |>\n ggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) +\n geom_tile() +\n scale_fill_gradient2(midpoint = 0.5) +\n facet_wrap(~park_ticket_season) +\n labs(\n y = \"Historic Maximum Temperature (F)\",\n x = \"Magic Kingdom Park Close Time\",\n fill = \"Proportion of Days Exposed\"\n )\n```\n\n::: {.cell-output-display}\n![Check for positivity violations across three confounders: historic high temperature, park close time, and ticket season.](chapter-07_files/figure-html/fig-positivity-1.png){#fig-positivity width=864}\n:::\n:::\n\n\nInteresting!\n@fig-positivity shows an interesting potential violation.\nIt looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning.\nThis actually makes sense if we think a bit about this data set.\nThe only days with cold temperatures in Florida that would also be considered a \"peak\" time to visit Walt Disney World would be over Christmas / New Years.\nDuring this time there historically were always extra magic hours.\n\nWe are going to proceed with the analysis, but we will keep these observations in mind.\n\n## Presenting descriptive statistics\n\nLet's examine a table of the variables of interest in this data frame.\nTo do so, we are going to use the `tbl_summary()` function from the `{gtsummary}` package.\n(We'll also use the `{labelled}` package to clean up the variable names for the table.)\n\n\n::: {#tbl-unweighted-gtsummary .cell tbl-cap='A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.'}\n\n```{.r .cell-code}\nlibrary(gtsummary)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#BlackLivesMatter\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(labelled)\nseven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>\n set_variable_labels(\n park_ticket_season = \"Ticket Season\",\n park_close = \"Close Time\",\n park_temperature_high = \"Historic High Temperature\"\n )\n\ntbl_summary(\n seven_dwarfs_train_2018,\n by = park_extra_magic_morning,\n include = c(park_ticket_season, park_close, park_temperature_high)\n) |>\n # add an overall column to the table\n add_overall(last = TRUE)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n\n\n\n \n \n \n \n \n \n \n
Characteristic0, N = 29411, N = 601Overall, N = 3541
Ticket Season


    peak60 (20%)18 (30%)78 (22%)
    regular158 (54%)35 (58%)193 (55%)
    value76 (26%)7 (12%)83 (23%)
Close Time


    16:30:001 (0.3%)0 (0%)1 (0.3%)
    18:00:0037 (13%)18 (30%)55 (16%)
    20:00:0018 (6.1%)2 (3.3%)20 (5.6%)
    21:00:0028 (9.5%)0 (0%)28 (7.9%)
    22:00:0091 (31%)11 (18%)102 (29%)
    23:00:0078 (27%)11 (18%)89 (25%)
    24:00:0040 (14%)17 (28%)57 (16%)
    25:00:001 (0.3%)1 (1.7%)2 (0.6%)
Historic High Temperature84 (78, 89)83 (76, 87)84 (78, 89)
1 n (%); Median (IQR)
\n
\n```\n\n:::\n:::\n", + "markdown": "# Building propensity score models {#sec-building-models}\n\n\n\n\n\nOften we are interested in how some *exposure* (or treatment) impacts an outcome.\nFor example, we could assess how an ad campaign (exposure) impacts sales (outcome), whether a particular medication (exposure) improves patient survival (outcome), or whether opening a theme park early to some visitors (exposure) reduces wait times later in the day (outcome).\nAs defined in the @sec-counterfactuals, an exposure in the context of this book is often a modifiable event or condition that occurs before the outcome.\nIn an ideal world, we would simply estimate the correlation between the exposure and outcome as the causal effect of the exposure.\nRandomized trials are the best practical examples of this idealized scenario: participants are randomly assigned to exposure groups.\nIf all goes well, this allows for an unbiased estimate of the causal effect between the exposure and outcome.\nIn the \"real world,\" outside this randomized trial setting, we are often *exposed* to something based on other factors.\nFor example, when deciding what medication to give a diabetic patient, a doctor may consider the patient's medical history, their likelihood to adhere to certain medications, and the severity of their disease.\nThe treatment is no longer random; it is *conditional* on factors about that patient, also known as the patient's *covariates*.\nIf these covariates also affect the outcome, they are *confounders*.\n\n::: callout-note\nA **confounder** is a common cause of exposure and outcome.\n:::\n\nSuppose we could collect information about all of these factors.\nIn that case, we could determine each patient's probability of exposure and use this to inform an analysis assessing the relationship between that exposure and some outcome.\nThis probability is the propensity score!\nWhen used appropriately, modeling with a propensity score can simulate what the relationship between exposure and outcome would have looked like if we had run a randomized trial.\nThe correlation between exposure and outcome will estimate the causal effect after applying a propensity score.\nWhen fitting a *propensity score model* we want to condition on all known confounders.\n\n::: callout-note\nA **propensity score** is the probability of being in the exposure group, conditioned on observed covariates.\n:::\n\n@rosenbaum1983central showed in observational studies conditioning on propensity scores can lead to unbiased estimates of the exposure effect as long as certain assumptions hold:\n\n1. There are no unmeasured confounders\n2. Every subject has a nonzero probability of receiving either exposure\n\n## Logistic Regression\n\nThere are many ways to estimate the propensity score; typically, people use logistic regression for binary exposures.\nThe logistic regression model predicts the exposure using known confounders.\nEach individual's predicted value is the propensity score.\nThe `glm()` function will fit a logistic regression model in R.\nBelow is pseudo-code.\nThe first argument is the model, with the exposure on the left side and the confounders on the right.\nThe `data` argument takes the data frame, and the `family = binomial()` argument denotes the model should be fit using logistic regression (as opposed to a different generalized linear model).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm(\n exposure ~ confounder_1 + confounder_2,\n data = df,\n family = binomial()\n)\n```\n:::\n\n\nWe can extract the propensity scores by pulling out the predictions on the probability scale.\nUsing the `augment()` function from the [{`broom`}](https://broom.tidymodels.org/) package, we can extract these propensity scores and add them to our original data frame.\nThe argument `type.predict` is set to `\"response\"` to indicate that we want to extract the predicted values on the *probability* scale.\nBy default, these will be on the linear logit scale.\nThe `data` argument contains the original data frame.\nThis code will output a new data frame consisting of all components in `df` with six additional columns corresponding to the logistic regression model that was fit.\nThe `.fitted` column is the propensity score.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm(\n exposure ~ confounder_1 + confounder_2,\n data = df,\n family = binomial()\n) |>\n augment(type.predict = \"response\", data = df)\n```\n:::\n\n\nLet's look at an example.\n\n### Extra Magic Hours at Magic Kingdom {#sec-prop-dag}\n\nRecall our causal question of interest from @sec-data: **Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?** Below is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](chapter-07_files/figure-html/fig-dag-magic-hours-wait-1.png){#fig-dag-magic-hours-wait width=672}\n:::\n:::\n\n\nIn @fig-dag-magic-hours-wait, we propose three confounders: the historic high temperature on the day, the time the park closed, and the ticket season: value, regular, or peak.\nWe can build a propensity score model using the `seven_dwarfs_train_2018` data set from the `{touringplans}` package.\nEach row of this dataset contains information about the Seven Dwarfs Mine Train during a particular hour on a given day.\nFirst, we need to subset the data to only include average wait times between 9 and 10 am.\nThen we will use the `glm()` function to fit the propensity score model, predicting `park_extra_magic_morning` using the four confounders specified above.\nWe'll add the propensity scores to the data frame (in a column called `.fitted` as set by the `augment()` function in the `{broom}` package).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> filter(wait_hour == 9)\n\nseven_dwarfs_9_with_ps <-\n glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) |>\n augment(type.predict = \"response\", data = seven_dwarfs_9)\n```\n:::\n\n\nLet's take a look at these propensity scores.\n@tbl-df-ps shows the propensity scores (in the `.fitted` column) for the first six days in the dataset, as well as the values of each day's exposure, outcome, and confounders.\nThe propensity score here is the probability that a given date will have Extra Magic Hours in the morning given the observed confounders, in this case, the historical high temperatures on a given date, the time the park closed, and Ticket Season.\nFor example, on January 1, 2018, there was a 30.2% chance that there would be Extra Magic Hours at the Magic Kingdom given the Ticket Season (peak in this case), time of park closure (11 pm), and the historic high temperature on this date (58.6 degrees).\nOn this particular day, there were *not* Extra Magic Hours in the morning (as indicated by the 0 in the first row of the `park_extra_magic_morning` column).\n\n\n::: {#tbl-df-ps .cell tbl-cap='The first six observations in the `seven_dwarfs_9_with_ps` dataset, including their propensity scores in the `.fitted` column.'}\n\n```{.r .cell-code}\nseven_dwarfs_9_with_ps |>\n select(\n .fitted,\n park_date,\n park_extra_magic_morning,\n park_ticket_season,\n park_close,\n park_temperature_high\n ) |>\n head() |>\n knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| .fitted|park_date | park_extra_magic_morning|park_ticket_season |park_close | park_temperature_high|\n|-------:|:----------|------------------------:|:------------------|:----------|---------------------:|\n| 0.3019|2018-01-01 | 0|peak |23:00:00 | 58.63|\n| 0.2815|2018-01-02 | 0|peak |24:00:00 | 53.65|\n| 0.2900|2018-01-03 | 0|peak |24:00:00 | 51.11|\n| 0.1881|2018-01-04 | 0|regular |24:00:00 | 52.66|\n| 0.1841|2018-01-05 | 1|regular |24:00:00 | 54.29|\n| 0.2074|2018-01-06 | 0|regular |23:00:00 | 56.25|\n\n\n:::\n:::\n\n\nWe can examine the distribution of propensity scores by exposure group.\nA nice way to visualize this is via mirrored histograms.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nThe code below creates two histograms of the propensity scores, one on the \"top\" for the exposed group (the dates with Extra Magic Hours in the morning) and one on the \"bottom\" for the unexposed group.\nWe'll also tweak the y-axis labels to use absolute values (rather than negative values for the bottom histogram) via `scale_y_continuous(labels = abs)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(\n seven_dwarfs_9_with_ps,\n aes(.fitted, fill = factor(park_extra_magic_morning))\n) +\n geom_mirror_histogram(bins = 50) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\", fill = \"extra magic morning\")\n```\n\n::: {.cell-output-display}\n![Mirrored histograms of estimated propensity scores for Extra Magic Hour days (exposed group, top) and days without Extra Magic hours (unexposed group, bottom)](chapter-07_files/figure-html/fig-mirrored-ps-1.png){#fig-mirrored-ps width=672}\n:::\n:::\n\n\nHere are some questions to ask to gain diagnostic insights we gain from @fig-mirrored-ps.\n\n\n\n1. Look for lack of overlap as a potential positivity problem.\n But too much overlap may indicate a poor model \n \n\n2. Avg treatment effect among treated is easier to estimate with precision (because of higher counts) than in the control group.\n\n3. A single outlier in either group concerning range could be a problem and warrant data inspection\n\n\n\n\n## Choosing what variables to include {#sec-choosing-vars}\n\nThe best way to decide what variables to include in your propensity score model is to look at your DAG and have at least a minimal adjustment set of confounders.\nOf course, sometimes, essential variables are missing or measured with error.\nIn addition, there is often more than one theoretical adjustment set that debiases your estimate; it may be that one of the minimal adjustment sets is measured well in your data set and another is not.\nIf you have confounders on your DAG that you do not have access to, sensitivity analyses can help quantify the potential impact.\nSee Chapter 11 for an in-depth discussion of sensitivity analyses.\n\nAccurately specifying a DAG improves our ability to add the correct variables to our models.\nHowever, confounders are not the only necessary type of variable to consider.\nFor example, variables that are predictors of the *outcome* *but not the exposure* can improve the precision of propensity score models.\nConversely, including variables that are predictors of the *exposure but not the outcome* (instrumental variables) can bias the model.\nLuckily, this bias seems relatively negligible in practice, especially compared to the risk of confounding bias [@Myers2011].\n\n::: callout-note\nSome estimates, such as the odds and hazard ratios, suffer from an additional problem called *non-collapsibility*.\nFor these estimates, adding noise variables (variables unrelated to the exposure or outcome) doesn't reduce precision: they can bias the estimate as well---more the reason to avoid data-driven approaches to selecting variables for causal models.\n:::\n\nAnother variable to be wary of is a *collider*, a descendant of both the exposure and outcome.\nIf you specify your DAG correctly, you can avoid colliders by only using adjustment sets that completely close backdoor paths from the exposure to the outcome.\nHowever, some circumstances make this difficult: some colliders are inherently stratified by the study's design or the nature of follow-up.\nFor example, loss-to-follow-up is a common source of collider-stratification bias; in Chapter XX, we'll discuss this further.\n\nA variable can also be both a confounder and a collider, as in the case of so-called butterfly bias:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag(butterfly_bias()) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![A causal diagram that displays a variable `m`, which is both a collider and a confounder for `x` and `y`. In situations where you don't have all measured variables in a DAG, you may have to make a tough choice about which type of bias is the least bad.\n](chapter-07_files/figure-html/fig-butterfly-dag-1.png){#fig-butterfly-dag width=480}\n:::\n:::\n\n\nConsider @fig-butterfly-dag.\nTo estimate the causal effect of `x` on `y`, we need to account for `m` because it's a counfounder.\nHowever, `m` is also a collider between `a` and `b`, so controlling for it will induce a relationship between those variables, creating a second set of confounders.\nIf we have all the variables measured well, we can avoid the bias from adjusting for `m` by adjusting for either `a` or `b` as well.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(butterfly_bias()) +\n theme_dag() +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![The adjustment sets in the case of butterfly bias: we can get an unbiased effect by accounting for both `m` and the collider bias induced by adjusting for it. But what if we don't have `a` or `b`?\n](chapter-07_files/figure-html/fig-butterfly-sets-1.png){#fig-butterfly-sets width=480}\n:::\n:::\n\n\nHowever, what should we do if we don't have those variables?\nAdjusting for `m` opens a biasing pathway that we cannot block through `a` and `b`(collider-stratification bias), but `m` is also a confounder for `x` and `y`.\nAs in the case above, it appears that confounding bias is often the worse of the two options, so we should adjust for `m` unless we have reason to believe it will cause more problems than it solves [@DingMiratrix2015].\n\n### Don't use prediction metrics for causal modeling\n\nBy and large, metrics commonly used for building prediction models are inappropriate for building causal models.\nResearchers and data scientists often make decisions about models using metrics like R^2^, AUC, accuracy, and (often inappropriately) p-values.\nHowever, a causal model's goal is not to predict as much about the outcome as possible [@hernan2021]; the goal is to estimate the relationship between the exposure and outcome accurately.\nA causal model needn't predict particularly well to be unbiased.\n\nThese metrics, however, may help identify a model's best *functional form*.\nGenerally, we'll use DAGs and our domain knowledge to build the model itself.\nHowever, we may be unsure of the mathematical relationship between a confounder and the outcome or exposure.\nFor instance, we may not know if the relationship is linear.\nMisspecifying this relationship can lead to residual confounding: we may only partially account for the confounder in question, leaving some bias in the estimate.\nTesting different functional forms using prediction-focused metrics can help improve the model's accuracy, potentially allowing for better control.\n\nAnother technique researchers sometimes use to determine confounders is to add a variable, then calculate the percent change in the coefficient between the outcome and exposure.\nFor instance, we first model `y ~ x` to estimate the relationship between `x` and `y`.\nThen, we model `y ~ x + z` and see how much the coefficient on `x` has changed.\nA common rule is to add a variable if it changes the coefficient of`x` by 10%.\n\nUnfortunately, this technique is unreliable.\nAs we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias.\nIn other words, there are many different types of variables besides confounders that can cause a change in the coefficient of the exposure.\nAs discussed above, confounding bias is often the most crucial factor, but systematically searching your variables for anything that changes the exposure coefficient can compound many types of bias.\n", "supporting": [ "chapter-07_files" ], diff --git a/_freeze/chapters/chapter-07/figure-html/fig-butterfly-dag-1.png b/_freeze/chapters/chapter-07/figure-html/fig-butterfly-dag-1.png new file mode 100644 index 0000000..3212ab2 Binary files /dev/null and b/_freeze/chapters/chapter-07/figure-html/fig-butterfly-dag-1.png differ diff --git a/_freeze/chapters/chapter-07/figure-html/fig-butterfly-sets-1.png b/_freeze/chapters/chapter-07/figure-html/fig-butterfly-sets-1.png new file mode 100644 index 0000000..cc1c066 Binary files /dev/null and b/_freeze/chapters/chapter-07/figure-html/fig-butterfly-sets-1.png differ diff --git a/_freeze/chapters/chapter-07/figure-html/fig-dag-magic-hours-wait-1.png b/_freeze/chapters/chapter-07/figure-html/fig-dag-magic-hours-wait-1.png new file mode 100644 index 0000000..77ff716 Binary files /dev/null and b/_freeze/chapters/chapter-07/figure-html/fig-dag-magic-hours-wait-1.png differ diff --git a/_freeze/chapters/chapter-07/figure-html/fig-mirrored-ps-1.png b/_freeze/chapters/chapter-07/figure-html/fig-mirrored-ps-1.png new file mode 100644 index 0000000..7e3234e Binary files /dev/null and b/_freeze/chapters/chapter-07/figure-html/fig-mirrored-ps-1.png differ diff --git a/_freeze/chapters/chapter-08/execute-results/html.json b/_freeze/chapters/chapter-08/execute-results/html.json index 82fe525..ba5d4ad 100644 --- a/_freeze/chapters/chapter-08/execute-results/html.json +++ b/_freeze/chapters/chapter-08/execute-results/html.json @@ -1,10 +1,8 @@ { - "hash": "a9b2204a68f6670a68f7890b9dd580f5", + "hash": "568bf0c2251ac748fd5b2fde89e14851", "result": { - "markdown": "# Building propensity score models {#sec-building-models}\n\n\n\n\n\nOften we are interested in how some *exposure* (or treatment) impacts an outcome.\nFor example, we could assess how an ad campaign (exposure) impacts sales (outcome), whether a particular medication (exposure) improves patient survival (outcome), or whether opening a theme park early to some visitors (exposure) reduces wait times later in the day (outcome).\nAs defined in the @sec-counterfactuals, an exposure in the context of this book is often a modifiable event or condition that occurs before the outcome.\nIn an ideal world, we would simply estimate the correlation between the exposure and outcome as the causal effect of the exposure.\nRandomized trials are the best practical examples of this idealized scenario: participants are randomly assigned to exposure groups.\nIf all goes well, this allows for an unbiased estimate of the causal effect between the exposure and outcome.\nIn the \"real world,\" outside this randomized trial setting, we are often *exposed* to something based on other factors.\nFor example, when deciding what medication to give a diabetic patient, a doctor may consider the patient's medical history, their likelihood to adhere to certain medications, and the severity of their disease.\nThe treatment is no longer random; it is *conditional* on factors about that patient, also known as the patient's *covariates*.\nIf these covariates also affect the outcome, they are *confounders*.\n\n::: callout-note\nA **confounder** is a common cause of exposure and outcome.\n:::\n\nSuppose we could collect information about all of these factors.\nIn that case, we could determine each patient's probability of exposure and use this to inform an analysis assessing the relationship between that exposure and some outcome.\nThis probability is the propensity score!\nWhen used appropriately, modeling with a propensity score can simulate what the relationship between exposure and outcome would have looked like if we had run a randomized trial.\nThe correlation between exposure and outcome will estimate the causal effect after applying a propensity score.\nWhen fitting a *propensity score model* we want to condition on all known confounders.\n\n::: callout-note\nA **propensity score** is the probability of being in the exposure group, conditioned on observed covariates.\n:::\n\n@rosenbaum1983central showed in observational studies conditioning on propensity scores can lead to unbiased estimates of the exposure effect as long as certain assumptions hold:\n\n1. There are no unmeasured confounders\n2. Every subject has a nonzero probability of receiving either exposure\n\n## Logistic Regression\n\nThere are many ways to estimate the propensity score; typically, people use logistic regression for binary exposures.\nThe logistic regression model predicts the exposure using known confounders.\nEach individual's predicted value is the propensity score.\nThe `glm()` function will fit a logistic regression model in R.\nBelow is pseudo-code.\nThe first argument is the model, with the exposure on the left side and the confounders on the right.\nThe `data` argument takes the data frame, and the `family = binomial()` argument denotes the model should be fit using logistic regression (as opposed to a different generalized linear model).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm(\n exposure ~ confounder_1 + confounder_2,\n data = df,\n family = binomial()\n)\n```\n:::\n\n\nWe can extract the propensity scores by pulling out the predictions on the probability scale.\nUsing the `augment()` function from the [{`broom`}](https://broom.tidymodels.org/) package, we can extract these propensity scores and add them to our original data frame.\nThe argument `type.predict` is set to `\"response\"` to indicate that we want to extract the predicted values on the *probability* scale.\nBy default, these will be on the linear logit scale.\nThe `data` argument contains the original data frame.\nThis code will output a new data frame consisting of all components in `df` with six additional columns corresponding to the logistic regression model that was fit.\nThe `.fitted` column is the propensity score.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglm(\n exposure ~ confounder_1 + confounder_2,\n data = df,\n family = binomial()\n) |>\n augment(type.predict = \"response\", data = df)\n```\n:::\n\n\nLet's look at an example.\n\n### Extra Magic Hours at Magic Kingdom {#sec-prop-dag}\n\nRecall our causal question of interest from @sec-data: **Is there a relationship between whether there were \"Extra Magic Hours\" in the morning at Magic Kingdom and the average wait time for an attraction called the \"Seven Dwarfs Mine Train\" the same day between 9am and 10am in 2018?** Below is a proposed DAG for this question.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.](chapter-08_files/figure-html/fig-dag-magic-hours-wait-1.png){#fig-dag-magic-hours-wait width=672}\n:::\n:::\n\n\nIn @fig-dag-magic-hours-wait, we propose three confounders: the historic high temperature on the day, the time the park closed, and the ticket season: value, regular, or peak.\nWe can build a propensity score model using the `seven_dwarfs_train_2018` data set from the `{touringplans}` package.\nEach row of this dataset contains information about the Seven Dwarfs Mine Train during a particular hour on a given day.\nFirst, we need to subset the data to only include average wait times between 9 and 10 am.\nThen we will use the `glm()` function to fit the propensity score model, predicting `park_extra_magic_morning` using the four confounders specified above.\nWe'll add the propensity scores to the data frame (in a column called `.fitted` as set by the `augment()` function in the `{broom}` package).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> filter(wait_hour == 9)\n\nseven_dwarfs_9_with_ps <-\n glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) |>\n augment(type.predict = \"response\", data = seven_dwarfs_9)\n```\n:::\n\n\nLet's take a look at these propensity scores.\n@tbl-df-ps shows the propensity scores (in the `.fitted` column) for the first six days in the dataset, as well as the values of each day's exposure, outcome, and confounders.\nThe propensity score here is the probability that a given date will have Extra Magic Hours in the morning given the observed confounders, in this case, the historical high temperatures on a given date, the time the park closed, and Ticket Season.\nFor example, on January 1, 2018, there was a 30.2% chance that there would be Extra Magic Hours at the Magic Kingdom given the Ticket Season (peak in this case), time of park closure (11 pm), and the historic high temperature on this date (58.6 degrees).\nOn this particular day, there were *not* Extra Magic Hours in the morning (as indicated by the 0 in the first row of the `park_extra_magic_morning` column).\n\n\n::: {#tbl-df-ps .cell tbl-cap='The first six observations in the `seven_dwarfs_9_with_ps` dataset, including their propensity scores in the `.fitted` column.'}\n\n```{.r .cell-code}\nseven_dwarfs_9_with_ps |>\n select(\n .fitted,\n park_date,\n park_extra_magic_morning,\n park_ticket_season,\n park_close,\n park_temperature_high\n ) |>\n head() |>\n knitr::kable()\n```\n\n::: {.cell-output-display}\n\n\n| .fitted|park_date | park_extra_magic_morning|park_ticket_season |park_close | park_temperature_high|\n|-------:|:----------|------------------------:|:------------------|:----------|---------------------:|\n| 0.3019|2018-01-01 | 0|peak |23:00:00 | 58.63|\n| 0.2815|2018-01-02 | 0|peak |24:00:00 | 53.65|\n| 0.2900|2018-01-03 | 0|peak |24:00:00 | 51.11|\n| 0.1881|2018-01-04 | 0|regular |24:00:00 | 52.66|\n| 0.1841|2018-01-05 | 1|regular |24:00:00 | 54.29|\n| 0.2074|2018-01-06 | 0|regular |23:00:00 | 56.25|\n\n\n:::\n:::\n\n\nWe can examine the distribution of propensity scores by exposure group.\nA nice way to visualize this is via mirrored histograms.\nWe'll use the {halfmoon} package's `geom_mirror_histogram()` to create one.\nThe code below creates two histograms of the propensity scores, one on the \"top\" for the exposed group (the dates with Extra Magic Hours in the morning) and one on the \"bottom\" for the unexposed group.\nWe'll also tweak the y-axis labels to use absolute values (rather than negative values for the bottom histogram) via `scale_y_continuous(labels = abs)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(halfmoon)\nggplot(\n seven_dwarfs_9_with_ps,\n aes(.fitted, fill = factor(park_extra_magic_morning))\n) +\n geom_mirror_histogram(bins = 50) +\n scale_y_continuous(labels = abs) +\n labs(x = \"propensity score\", fill = \"extra magic morning\")\n```\n\n::: {.cell-output-display}\n![Mirrored histograms of estimated propensity scores for Extra Magic Hour days (exposed group, top) and days without Extra Magic hours (unexposed group, bottom)](chapter-08_files/figure-html/fig-mirrored-ps-1.png){#fig-mirrored-ps width=672}\n:::\n:::\n\n\nHere are some questions to ask to gain diagnostic insights we gain from @fig-mirrored-ps.\n\n\n\n1. Look for lack of overlap as a potential positivity problem.\n But too much overlap may indicate a poor model \n \n\n2. Avg treatment effect among treated is easier to estimate with precision (because of higher counts) than in the control group.\n\n3. A single outlier in either group concerning range could be a problem and warrant data inspection\n\n\n\n\n## Choosing what variables to include {#sec-choosing-vars}\n\nThe best way to decide what variables to include in your propensity score model is to look at your DAG and have at least a minimal adjustment set of confounders.\nOf course, sometimes, essential variables are missing or measured with error.\nIn addition, there is often more than one theoretical adjustment set that debiases your estimate; it may be that one of the minimal adjustment sets is measured well in your data set and another is not.\nIf you have confounders on your DAG that you do not have access to, sensitivity analyses can help quantify the potential impact.\nSee Chapter 11 for an in-depth discussion of sensitivity analyses.\n\nAccurately specifying a DAG improves our ability to add the correct variables to our models.\nHowever, confounders are not the only necessary type of variable to consider.\nFor example, variables that are predictors of the *outcome* *but not the exposure* can improve the precision of propensity score models.\nConversely, including variables that are predictors of the *exposure but not the outcome* (instrumental variables) can bias the model.\nLuckily, this bias seems relatively negligible in practice, especially compared to the risk of confounding bias [@Myers2011].\n\n::: callout-note\nSome estimates, such as the odds and hazard ratios, suffer from an additional problem called *non-collapsibility*.\nFor these estimates, adding noise variables (variables unrelated to the exposure or outcome) doesn't reduce precision: they can bias the estimate as well---more the reason to avoid data-driven approaches to selecting variables for causal models.\n:::\n\nAnother variable to be wary of is a *collider*, a descendant of both the exposure and outcome.\nIf you specify your DAG correctly, you can avoid colliders by only using adjustment sets that completely close backdoor paths from the exposure to the outcome.\nHowever, some circumstances make this difficult: some colliders are inherently stratified by the study's design or the nature of follow-up.\nFor example, loss-to-follow-up is a common source of collider-stratification bias; in Chapter XX, we'll discuss this further.\n\nA variable can also be both a confounder and a collider, as in the case of so-called butterfly bias:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag(butterfly_bias()) +\n theme_dag()\n```\n\n::: {.cell-output-display}\n![A causal diagram that displays a variable `m`, which is both a collider and a confounder for `x` and `y`. In situations where you don't have all measured variables in a DAG, you may have to make a tough choice about which type of bias is the least bad.\n](chapter-08_files/figure-html/fig-butterfly-dag-1.png){#fig-butterfly-dag width=480}\n:::\n:::\n\n\nConsider @fig-butterfly-dag.\nTo estimate the causal effect of `x` on `y`, we need to account for `m` because it's a counfounder.\nHowever, `m` is also a collider between `a` and `b`, so controlling for it will induce a relationship between those variables, creating a second set of confounders.\nIf we have all the variables measured well, we can avoid the bias from adjusting for `m` by adjusting for either `a` or `b` as well.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggdag_adjustment_set(butterfly_bias()) +\n theme_dag() +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![The adjustment sets in the case of butterfly bias: we can get an unbiased effect by accounting for both `m` and the collider bias induced by adjusting for it. But what if we don't have `a` or `b`?\n](chapter-08_files/figure-html/fig-butterfly-sets-1.png){#fig-butterfly-sets width=480}\n:::\n:::\n\n\nHowever, what should we do if we don't have those variables?\nAdjusting for `m` opens a biasing pathway that we cannot block through `a` and `b`(collider-stratification bias), but `m` is also a confounder for `x` and `y`.\nAs in the case above, it appears that confounding bias is often the worse of the two options, so we should adjust for `m` unless we have reason to believe it will cause more problems than it solves [@DingMiratrix2015].\n\n### Don't use prediction metrics for causal modeling\n\nBy and large, metrics commonly used for building prediction models are inappropriate for building causal models.\nResearchers and data scientists often make decisions about models using metrics like R^2^, AUC, accuracy, and (often inappropriately) p-values.\nHowever, a causal model's goal is not to predict as much about the outcome as possible [@hernan2021]; the goal is to estimate the relationship between the exposure and outcome accurately.\nA causal model needn't predict particularly well to be unbiased.\n\nThese metrics, however, may help identify a model's best *functional form*.\nGenerally, we'll use DAGs and our domain knowledge to build the model itself.\nHowever, we may be unsure of the mathematical relationship between a confounder and the outcome or exposure.\nFor instance, we may not know if the relationship is linear.\nMisspecifying this relationship can lead to residual confounding: we may only partially account for the confounder in question, leaving some bias in the estimate.\nTesting different functional forms using prediction-focused metrics can help improve the model's accuracy, potentially allowing for better control.\n\nAnother technique researchers sometimes use to determine confounders is to add a variable, then calculate the percent change in the coefficient between the outcome and exposure.\nFor instance, we first model `y ~ x` to estimate the relationship between `x` and `y`.\nThen, we model `y ~ x + z` and see how much the coefficient on `x` has changed.\nA common rule is to add a variable if it changes the coefficient of`x` by 10%.\n\nUnfortunately, this technique is unreliable.\nAs we've discussed, controlling for mediators, colliders, and instrumental variables all affect the estimate of the relationship between `x` and `y`, and usually, they result in bias.\nIn other words, there are many different types of variables besides confounders that can cause a change in the coefficient of the exposure.\nAs discussed above, confounding bias is often the most crucial factor, but systematically searching your variables for anything that changes the exposure coefficient can compound many types of bias.\n", - "supporting": [ - "chapter-08_files/figure-html" - ], + "markdown": "# Using the propensity score {#sec-using-ps}\n\n\n\n\n\nThe propensity score is a *balancing* tool -- we use it to help us make our exposure groups *exchangeable*. There are many ways to incorporate the propensity score into an analysis. Commonly used techniques include stratification (estimating the causal effect within propensity score stratum), matching, weighting, and direct covariate adjustment. In this section, we will focus on *matching* and *weighting*; other techniques will be discussed once we introduce the *outcome model*. Recall at this point in the book we are still in the *design* phase. We have not yet incorporated the outcome into our analysis at all. \n\n## Matching\n\nUltimately, we want the exposed and unexposed observations to be *exchangeable* with respect to the confounders we have proposed in our DAG (so we can use the observed effect for one to estimate the counterfactual for the other). One way to do this is to ensure that each observation in our analysis sample has at least one observation of the opposite exposure that has *match*ing values for each of these confounders. If we had a small number of binary confounders, for example, we might be able to construct an *exact match* for observations (and only include those for whom such a match exists), but as the number and continuity of confounders increases, exact matching becomes less feasible. This is where the propensity score, a summary measure of all of the confounders, comes in to play. \n\nLet's setup the data as we did in @sec-building-models.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> filter(wait_hour == 9)\n```\n:::\n\n\nWe can re-fit the propensity score using the `{MatchIt}` package, as below. Notice here the `matchit` function fit a logistic regression model for our propensity score, as we had in @sec-building-models. There were 60 days in 2018 where the Magic Kingdom had extra magic morning hours. For each of these 60 exposed days, `matchit` found a comparable unexposed day, by implementing a nearest-neighbor match using the constructed propensity score. Examining the output, we also see that the target estimand is an \"ATT\" (do not worry about this yet, we will discuss this and several other estimands in @sec-estimands).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(MatchIt)\nm <- matchit(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9)\nm\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nA matchit object\n - method: 1:1 nearest neighbor matching without replacement\n - distance: Propensity score\n - estimated with logistic regression\n - number of obs.: 354 (original), 120 (matched)\n - target estimand: ATT\n - covariates: park_ticket_season, park_close, park_temperature_high\n```\n\n\n:::\n:::\n\nWe can use the `get_matches` function to create a data frame with the original variables that only consists of those who were matched. Notice here our sample size has been reduced from the original 354 days to 120. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatched_data <- get_matches(m)\nglimpse(matched_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nRows: 120\nColumns: 18\n$ id \"5\", \"340\", \"12\", \"1…\n$ subclass 1, 1, 2, 2, 3, 3, 4,…\n$ weights 1, 1, 1, 1, 1, 1, 1,…\n$ park_date 2018-01-05, 2018-12…\n$ wait_hour 9, 9, 9, 9, 9, 9, 9,…\n$ attraction_name \"Seven Dwarfs Mine T…\n$ wait_minutes_actual_avg 33.0, 8.0, 114.0, 32…\n$ wait_minutes_posted_avg 70.56, 80.62, 79.29,…\n$ attraction_park \"Magic Kingdom\", \"Ma…\n$ attraction_land \"Fantasyland\", \"Fant…\n$ park_open