Merge pull request #194 from r-causal/outcome

update outcome model chapter
r-causal · Oct 30, 2023 · c11edc1 · c11edc1
2 parents 0634944 + ca168b2
commit c11edc1
Show file tree

Hide file tree

Showing 3 changed files with 204 additions and 3 deletions.
diff --git a/_freeze/chapters/chapter-11/execute-results/html.json b/_freeze/chapters/chapter-11/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "52cdc5c043504620e3c65dd4015038d2",
+  "hash": "35aeb2819ee49bb26e30753bbfdeca1e",
   "result": {
-    "markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -0.5790 -0.5911  0.3102  0.3006  1.4746\n```\n\n\n:::\n:::\n\n\n## Using weights in outcome models\n\n## Estimating uncertainty\n",
+    "markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\nWhen fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\nlibrary(MatchIt)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> \n  filter(wait_hour == 9) \n\nm <- matchit(\n  park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n  data = seven_dwarfs_9\n)\nmatched_data <- get_matches(m)\n```\n:::\n\n\nWe can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>\n  tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n  term   estimate std.error statistic  p.value conf.low\n  <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>\n1 (Inte…    67.0       2.37     28.3  1.94e-54    62.3 \n2 park_…     7.87      3.35      2.35 2.04e- 2     1.24\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n\nRecall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5). \n\n## Using weights in outcome models\n\nNow let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(propensity)\n\nseven_dwarfs_9_with_ps <-\n  glm(\n    park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n    data = seven_dwarfs_9,\n    family = binomial()\n  ) |>\n  augment(type.predict = \"response\", data = seven_dwarfs_9)\nseven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>\n  mutate(w_att = wt_att(.fitted, park_extra_magic_morning))\n```\n:::\n\n\nWe can fit a *weighted* outcome model by using the `weights` argument. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n   data = seven_dwarfs_9_with_wt,\n   weights = w_att) |>\n  tidy()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n  term           estimate std.error statistic   p.value\n  <chr>             <dbl>     <dbl>     <dbl>     <dbl>\n1 (Intercept)       68.7       1.45     47.3  1.69e-154\n2 park_extra_ma…     6.23      2.05      3.03 2.62e-  3\n```\n\n\n:::\n:::\n\n\nUsing weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes. \nWhile this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.\n\n\n## Estimating uncertainty\n\nThere are three ways to estimate the uncertainty:\n\n1. A bootstrap\n2. A sandwich estimator that only takes into account the outcome model\n3. A sandwich estimator that takes into account the propensity score model\n\nThe first option can be computationally intensive, but should get you the correct estimates.\nThe second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this. \n\n### The bootstrap\n\n1. Create a function to run your analysis once on a sample of your data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit_ipw <- function(split, ...) {\n  .df <- analysis(split)\n  \n  # fit propensity score model\n  propensity_model <- glm(\n    park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n    data = seven_dwarfs_9,\n    family = binomial()\n  ) \n  \n  # calculate inverse probability weights\n  .df <- propensity_model |>\n    augment(type.predict = \"response\", data = .df) |>\n    mutate(wts = wt_att(.fitted, \n                        park_extra_magic_morning,\n                        exposure_type = \"binary\"))\n  \n  # fit correctly bootstrapped ipw model\n  lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n     data = .df, weights = wts) |>\n    tidy()\n}\n```\n:::\n\n\n\n2. Use {rsample} to bootstrap our causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\n# fit ipw model to bootstrapped samples\nbootstrapped_seven_dwarfs <- bootstraps(\n  seven_dwarfs_9, \n  times = 1000, \n  apparent = TRUE\n)\n\nipw_results <- bootstrapped_seven_dwarfs |> \n  mutate(boot_fits = map(splits, fit_ipw)) \n\nipw_results\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 3\n   splits            id            boot_fits       \n   <list>            <chr>         <list>          \n 1 <split [354/117]> Bootstrap0001 <tibble [2 × 5]>\n 2 <split [354/131]> Bootstrap0002 <tibble [2 × 5]>\n 3 <split [354/138]> Bootstrap0003 <tibble [2 × 5]>\n 4 <split [354/139]> Bootstrap0004 <tibble [2 × 5]>\n 5 <split [354/131]> Bootstrap0005 <tibble [2 × 5]>\n 6 <split [354/131]> Bootstrap0006 <tibble [2 × 5]>\n 7 <split [354/142]> Bootstrap0007 <tibble [2 × 5]>\n 8 <split [354/133]> Bootstrap0008 <tibble [2 × 5]>\n 9 <split [354/132]> Bootstrap0009 <tibble [2 × 5]>\n10 <split [354/127]> Bootstrap0010 <tibble [2 × 5]>\n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\n\nLet's look at the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results |>\n  mutate(\n    estimate = map_dbl(\n      boot_fits,\n      \\(.fit) .fit |>\n        filter(term == \"park_extra_magic_morning\") |>\n        pull(estimate)\n    )\n  ) |>\n  ggplot(aes(estimate)) +\n  geom_histogram(bins = 30, fill = \"#D55E00FF\", color = \"white\", alpha = 0.8) + \n  theme_minimal()\n```\n\n::: {.cell-output-display}\n![](chapter-11_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n\n3. Pull out the causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get t-based CIs\nboot_estimate <- int_t(ipw_results, boot_fits) |> \n  filter(term == \"park_extra_magic_morning\")\nboot_estimate\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 6\n  term           .lower .estimate .upper .alpha .method\n  <chr>           <dbl>     <dbl>  <dbl>  <dbl> <chr>  \n1 park_extra_m… -0.0637      6.95   11.6   0.05 studen…\n```\n\n\n:::\n:::\n\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7 minutes, 95% CI (-0.1, 11.6). \n\n### The outcome model sandwich\n\nThere are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sandwich)\nweighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n   data = seven_dwarfs_9_with_wt,\n   weights = w_att) \n\nsandwich(weighted_mod)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                         (Intercept)\n(Intercept)                    1.488\npark_extra_magic_morning      -1.488\n                         park_extra_magic_morning\n(Intercept)                                -1.488\npark_extra_magic_morning                    8.727\n```\n\n\n:::\n:::\n\n\nHere, our robust variance estimate is 8.727. We can then use this to construct a robust confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrobust_var <- sandwich(weighted_mod)[2, 2]\npoint_est <- coef(weighted_mod)[2]\nlb <- point_est - 1.96 * sqrt(robust_var)\nub <- point_est + 1.96 * sqrt(robust_var)\nlb\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n                  0.4383 \n```\n\n\n:::\n\n```{.r .cell-code}\nub\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n                   12.02 \n```\n\n\n:::\n:::\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes, 95% CI (0.4, 12). \n\nAlternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(survey)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: grid\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n    expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: survival\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'survey'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following object is masked from 'package:graphics':\n\n    dotchart\n```\n\n\n:::\n\n```{.r .cell-code}\ndes <- svydesign(\n  ids = ~1,\n  weights = ~w_att,\n  data = seven_dwarfs_9_with_wt\n)\n```\n:::\n\n\nThen we can use `svyglm` to fit the outcome model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsvyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>\n  tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n  term  estimate std.error statistic   p.value conf.low\n  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>\n1 (Int…    68.7       1.22     56.2  6.14e-178   66.3  \n2 park…     6.23      2.96      2.11 3.60e-  2    0.410\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n### Sandwich estimator that takes into account the propensity score model\n\nThe correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(PSW)\n\nseven_dwarfs_9 <- seven_dwarfs_9 |>\n  mutate(park_ticket_season_regular = ifelse(park_ticket_season == \"regular\", 1, 0),\n         park_ticket_season_value = ifelse(park_ticket_season == \"value\", 1, 0)\n  )\npsw(data = seven_dwarfs_9, \n    form.ps = \"park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high\",\n    weight = \"ATT\",\n    wt = TRUE,\n    out.var = \"wait_minutes_posted_avg\")\n```\n:::\n",
     "supporting": [
       "chapter-11_files"
     ],

diff --git a/_freeze/chapters/chapter-11/figure-html/unnamed-chunk-8-1.png b/_freeze/chapters/chapter-11/figure-html/unnamed-chunk-8-1.png
diff --git a/chapters/chapter-11.qmd b/chapters/chapter-11.qmd
@@ -4,10 +4,211 @@
 
 ## Using matched data sets
 
+When fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.
+
 ```{r}
-rnorm(5)
+library(broom)
+library(touringplans)
+library(MatchIt)
+
+seven_dwarfs_9 <- seven_dwarfs_train_2018 |> 
+  filter(wait_hour == 9) 
+
+m <- matchit(
+  park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
+  data = seven_dwarfs_9
+)
+matched_data <- get_matches(m)
 ```
 
+We can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.
+
+```{r}
+lm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>
+  tidy(conf.int = TRUE)
+```
+
+Recall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5). 
+
 ## Using weights in outcome models
 
+Now let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.
+
+```{r}
+library(propensity)
+
+seven_dwarfs_9_with_ps <-
+  glm(
+    park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
+    data = seven_dwarfs_9,
+    family = binomial()
+  ) |>
+  augment(type.predict = "response", data = seven_dwarfs_9)
+seven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>
+  mutate(w_att = wt_att(.fitted, park_extra_magic_morning))
+```
+
+We can fit a *weighted* outcome model by using the `weights` argument. 
+
+```{r}
+lm(wait_minutes_posted_avg ~ park_extra_magic_morning, 
+   data = seven_dwarfs_9_with_wt,
+   weights = w_att) |>
+  tidy()
+```
+
+Using weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes. 
+While this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.
+
+
 ## Estimating uncertainty
+
+There are three ways to estimate the uncertainty:
+
+1. A bootstrap
+2. A sandwich estimator that only takes into account the outcome model
+3. A sandwich estimator that takes into account the propensity score model
+
+The first option can be computationally intensive, but should get you the correct estimates.
+The second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this. 
+
+### The bootstrap
+
+1. Create a function to run your analysis once on a sample of your data
+
+```{r}
+fit_ipw <- function(split, ...) {
+  .df <- analysis(split)
+  
+  # fit propensity score model
+  propensity_model <- glm(
+    park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
+    data = seven_dwarfs_9,
+    family = binomial()
+  ) 
+  
+  # calculate inverse probability weights
+  .df <- propensity_model |>
+    augment(type.predict = "response", data = .df) |>
+    mutate(wts = wt_att(.fitted, 
+                        park_extra_magic_morning,
+                        exposure_type = "binary"))
+  
+  # fit correctly bootstrapped ipw model
+  lm(wait_minutes_posted_avg ~ park_extra_magic_morning, 
+     data = .df, weights = wts) |>
+    tidy()
+}
+```
+
+
+2. Use {rsample} to bootstrap our causal effect
+
+```{r}
+library(rsample)
+
+# fit ipw model to bootstrapped samples
+bootstrapped_seven_dwarfs <- bootstraps(
+  seven_dwarfs_9, 
+  times = 1000, 
+  apparent = TRUE
+)
+
+ipw_results <- bootstrapped_seven_dwarfs |> 
+  mutate(boot_fits = map(splits, fit_ipw)) 
+
+ipw_results
+```
+
+
+Let's look at the results.
+
+```{r}
+ipw_results |>
+  mutate(
+    estimate = map_dbl(
+      boot_fits,
+      \(.fit) .fit |>
+        filter(term == "park_extra_magic_morning") |>
+        pull(estimate)
+    )
+  ) |>
+  ggplot(aes(estimate)) +
+  geom_histogram(bins = 30, fill = "#D55E00FF", color = "white", alpha = 0.8) + 
+  theme_minimal()
+```
+
+
+3. Pull out the causal effect
+
+```{r}
+# get t-based CIs
+boot_estimate <- int_t(ipw_results, boot_fits) |> 
+  filter(term == "park_extra_magic_morning")
+boot_estimate
+```
+
+We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(boot_estimate$.estimate, 1)` minutes, 95% CI (`r round(boot_estimate$.lower, 1)`, `r round(boot_estimate$.upper, 1)`). 
+
+### The outcome model sandwich
+
+There are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.
+
+```{r}
+library(sandwich)
+weighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning, 
+   data = seven_dwarfs_9_with_wt,
+   weights = w_att) 
+
+sandwich(weighted_mod)
+```
+
+Here, our robust variance estimate is `r round(sandwich(weighted_mod)[2,2], 3)`. We can then use this to construct a robust confidence interval.
+
+```{r}
+robust_var <- sandwich(weighted_mod)[2, 2]
+point_est <- coef(weighted_mod)[2]
+lb <- point_est - 1.96 * sqrt(robust_var)
+ub <- point_est + 1.96 * sqrt(robust_var)
+lb
+ub
+```
+We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(point_est, 1)` minutes, 95% CI (`r round(lb, 1)`, `r round(ub, 1)`). 
+
+Alternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.
+
+```{r}
+library(survey)
+
+des <- svydesign(
+  ids = ~1,
+  weights = ~w_att,
+  data = seven_dwarfs_9_with_wt
+)
+```
+
+Then we can use `svyglm` to fit the outcome model.
+
+```{r}
+svyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>
+  tidy(conf.int = TRUE)
+```
+### Sandwich estimator that takes into account the propensity score model
+
+The correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*
+
+```{r}
+#| eval: false
+library(PSW)
+
+seven_dwarfs_9 <- seven_dwarfs_9 |>
+  mutate(park_ticket_season_regular = ifelse(park_ticket_season == "regular", 1, 0),
+         park_ticket_season_value = ifelse(park_ticket_season == "value", 1, 0)
+  )
+psw(data = seven_dwarfs_9, 
+    form.ps = "park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high",
+    weight = "ATT",
+    wt = TRUE,
+    out.var = "wait_minutes_posted_avg")
+```
+