Skip to content

Commit

Permalink
Merge pull request #194 from r-causal/outcome
Browse files Browse the repository at this point in the history
update outcome model chapter
  • Loading branch information
LucyMcGowan authored Oct 30, 2023
2 parents 0634944 + ca168b2 commit c11edc1
Show file tree
Hide file tree
Showing 3 changed files with 204 additions and 3 deletions.
4 changes: 2 additions & 2 deletions _freeze/chapters/chapter-11/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "52cdc5c043504620e3c65dd4015038d2",
"hash": "35aeb2819ee49bb26e30753bbfdeca1e",
"result": {
"markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -0.5790 -0.5911 0.3102 0.3006 1.4746\n```\n\n\n:::\n:::\n\n\n## Using weights in outcome models\n\n## Estimating uncertainty\n",
"markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\nWhen fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\nlibrary(MatchIt)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> \n filter(wait_hour == 9) \n\nm <- matchit(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9\n)\nmatched_data <- get_matches(m)\n```\n:::\n\n\nWe can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 (Inte… 67.0 2.37 28.3 1.94e-54 62.3 \n2 park_… 7.87 3.35 2.35 2.04e- 2 1.24\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n\nRecall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5). \n\n## Using weights in outcome models\n\nNow let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(propensity)\n\nseven_dwarfs_9_with_ps <-\n glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) |>\n augment(type.predict = \"response\", data = seven_dwarfs_9)\nseven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>\n mutate(w_att = wt_att(.fitted, park_extra_magic_morning))\n```\n:::\n\n\nWe can fit a *weighted* outcome model by using the `weights` argument. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = seven_dwarfs_9_with_wt,\n weights = w_att) |>\n tidy()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) 68.7 1.45 47.3 1.69e-154\n2 park_extra_ma… 6.23 2.05 3.03 2.62e- 3\n```\n\n\n:::\n:::\n\n\nUsing weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes. \nWhile this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.\n\n\n## Estimating uncertainty\n\nThere are three ways to estimate the uncertainty:\n\n1. A bootstrap\n2. A sandwich estimator that only takes into account the outcome model\n3. A sandwich estimator that takes into account the propensity score model\n\nThe first option can be computationally intensive, but should get you the correct estimates.\nThe second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this. \n\n### The bootstrap\n\n1. Create a function to run your analysis once on a sample of your data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit_ipw <- function(split, ...) {\n .df <- analysis(split)\n \n # fit propensity score model\n propensity_model <- glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) \n \n # calculate inverse probability weights\n .df <- propensity_model |>\n augment(type.predict = \"response\", data = .df) |>\n mutate(wts = wt_att(.fitted, \n park_extra_magic_morning,\n exposure_type = \"binary\"))\n \n # fit correctly bootstrapped ipw model\n lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\n\n2. Use {rsample} to bootstrap our causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\n# fit ipw model to bootstrapped samples\nbootstrapped_seven_dwarfs <- bootstraps(\n seven_dwarfs_9, \n times = 1000, \n apparent = TRUE\n)\n\nipw_results <- bootstrapped_seven_dwarfs |> \n mutate(boot_fits = map(splits, fit_ipw)) \n\nipw_results\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 3\n splits id boot_fits \n <list> <chr> <list> \n 1 <split [354/117]> Bootstrap0001 <tibble [2 × 5]>\n 2 <split [354/131]> Bootstrap0002 <tibble [2 × 5]>\n 3 <split [354/138]> Bootstrap0003 <tibble [2 × 5]>\n 4 <split [354/139]> Bootstrap0004 <tibble [2 × 5]>\n 5 <split [354/131]> Bootstrap0005 <tibble [2 × 5]>\n 6 <split [354/131]> Bootstrap0006 <tibble [2 × 5]>\n 7 <split [354/142]> Bootstrap0007 <tibble [2 × 5]>\n 8 <split [354/133]> Bootstrap0008 <tibble [2 × 5]>\n 9 <split [354/132]> Bootstrap0009 <tibble [2 × 5]>\n10 <split [354/127]> Bootstrap0010 <tibble [2 × 5]>\n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\n\nLet's look at the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results |>\n mutate(\n estimate = map_dbl(\n boot_fits,\n \\(.fit) .fit |>\n filter(term == \"park_extra_magic_morning\") |>\n pull(estimate)\n )\n ) |>\n ggplot(aes(estimate)) +\n geom_histogram(bins = 30, fill = \"#D55E00FF\", color = \"white\", alpha = 0.8) + \n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](chapter-11_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n\n3. Pull out the causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get t-based CIs\nboot_estimate <- int_t(ipw_results, boot_fits) |> \n filter(term == \"park_extra_magic_morning\")\nboot_estimate\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 6\n term .lower .estimate .upper .alpha .method\n <chr> <dbl> <dbl> <dbl> <dbl> <chr> \n1 park_extra_m… -0.0637 6.95 11.6 0.05 studen…\n```\n\n\n:::\n:::\n\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7 minutes, 95% CI (-0.1, 11.6). \n\n### The outcome model sandwich\n\nThere are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sandwich)\nweighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = seven_dwarfs_9_with_wt,\n weights = w_att) \n\nsandwich(weighted_mod)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (Intercept)\n(Intercept) 1.488\npark_extra_magic_morning -1.488\n park_extra_magic_morning\n(Intercept) -1.488\npark_extra_magic_morning 8.727\n```\n\n\n:::\n:::\n\n\nHere, our robust variance estimate is 8.727. We can then use this to construct a robust confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrobust_var <- sandwich(weighted_mod)[2, 2]\npoint_est <- coef(weighted_mod)[2]\nlb <- point_est - 1.96 * sqrt(robust_var)\nub <- point_est + 1.96 * sqrt(robust_var)\nlb\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n 0.4383 \n```\n\n\n:::\n\n```{.r .cell-code}\nub\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n 12.02 \n```\n\n\n:::\n:::\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes, 95% CI (0.4, 12). \n\nAlternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(survey)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: grid\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nLoading required package: survival\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\n\nAttaching package: 'survey'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe following object is masked from 'package:graphics':\n\n dotchart\n```\n\n\n:::\n\n```{.r .cell-code}\ndes <- svydesign(\n ids = ~1,\n weights = ~w_att,\n data = seven_dwarfs_9_with_wt\n)\n```\n:::\n\n\nThen we can use `svyglm` to fit the outcome model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsvyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 (Int… 68.7 1.22 56.2 6.14e-178 66.3 \n2 park… 6.23 2.96 2.11 3.60e- 2 0.410\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n### Sandwich estimator that takes into account the propensity score model\n\nThe correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(PSW)\n\nseven_dwarfs_9 <- seven_dwarfs_9 |>\n mutate(park_ticket_season_regular = ifelse(park_ticket_season == \"regular\", 1, 0),\n park_ticket_season_value = ifelse(park_ticket_season == \"value\", 1, 0)\n )\npsw(data = seven_dwarfs_9, \n form.ps = \"park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high\",\n weight = \"ATT\",\n wt = TRUE,\n out.var = \"wait_minutes_posted_avg\")\n```\n:::\n",
"supporting": [
"chapter-11_files"
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
203 changes: 202 additions & 1 deletion chapters/chapter-11.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,211 @@

## Using matched data sets

When fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.

```{r}
rnorm(5)
library(broom)
library(touringplans)
library(MatchIt)
seven_dwarfs_9 <- seven_dwarfs_train_2018 |>
filter(wait_hour == 9)
m <- matchit(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9
)
matched_data <- get_matches(m)
```

We can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.

```{r}
lm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>
tidy(conf.int = TRUE)
```

Recall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5).

## Using weights in outcome models

Now let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.

```{r}
library(propensity)
seven_dwarfs_9_with_ps <-
glm(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9,
family = binomial()
) |>
augment(type.predict = "response", data = seven_dwarfs_9)
seven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>
mutate(w_att = wt_att(.fitted, park_extra_magic_morning))
```

We can fit a *weighted* outcome model by using the `weights` argument.

```{r}
lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = seven_dwarfs_9_with_wt,
weights = w_att) |>
tidy()
```

Using weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes.
While this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.


## Estimating uncertainty

There are three ways to estimate the uncertainty:

1. A bootstrap
2. A sandwich estimator that only takes into account the outcome model
3. A sandwich estimator that takes into account the propensity score model

The first option can be computationally intensive, but should get you the correct estimates.
The second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this.

### The bootstrap

1. Create a function to run your analysis once on a sample of your data

```{r}
fit_ipw <- function(split, ...) {
.df <- analysis(split)
# fit propensity score model
propensity_model <- glm(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9,
family = binomial()
)
# calculate inverse probability weights
.df <- propensity_model |>
augment(type.predict = "response", data = .df) |>
mutate(wts = wt_att(.fitted,
park_extra_magic_morning,
exposure_type = "binary"))
# fit correctly bootstrapped ipw model
lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = .df, weights = wts) |>
tidy()
}
```


2. Use {rsample} to bootstrap our causal effect

```{r}
library(rsample)
# fit ipw model to bootstrapped samples
bootstrapped_seven_dwarfs <- bootstraps(
seven_dwarfs_9,
times = 1000,
apparent = TRUE
)
ipw_results <- bootstrapped_seven_dwarfs |>
mutate(boot_fits = map(splits, fit_ipw))
ipw_results
```


Let's look at the results.

```{r}
ipw_results |>
mutate(
estimate = map_dbl(
boot_fits,
\(.fit) .fit |>
filter(term == "park_extra_magic_morning") |>
pull(estimate)
)
) |>
ggplot(aes(estimate)) +
geom_histogram(bins = 30, fill = "#D55E00FF", color = "white", alpha = 0.8) +
theme_minimal()
```


3. Pull out the causal effect

```{r}
# get t-based CIs
boot_estimate <- int_t(ipw_results, boot_fits) |>
filter(term == "park_extra_magic_morning")
boot_estimate
```

We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(boot_estimate$.estimate, 1)` minutes, 95% CI (`r round(boot_estimate$.lower, 1)`, `r round(boot_estimate$.upper, 1)`).

### The outcome model sandwich

There are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.

```{r}
library(sandwich)
weighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = seven_dwarfs_9_with_wt,
weights = w_att)
sandwich(weighted_mod)
```

Here, our robust variance estimate is `r round(sandwich(weighted_mod)[2,2], 3)`. We can then use this to construct a robust confidence interval.

```{r}
robust_var <- sandwich(weighted_mod)[2, 2]
point_est <- coef(weighted_mod)[2]
lb <- point_est - 1.96 * sqrt(robust_var)
ub <- point_est + 1.96 * sqrt(robust_var)
lb
ub
```
We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(point_est, 1)` minutes, 95% CI (`r round(lb, 1)`, `r round(ub, 1)`).

Alternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.

```{r}
library(survey)
des <- svydesign(
ids = ~1,
weights = ~w_att,
data = seven_dwarfs_9_with_wt
)
```

Then we can use `svyglm` to fit the outcome model.

```{r}
svyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>
tidy(conf.int = TRUE)
```
### Sandwich estimator that takes into account the propensity score model

The correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*

```{r}
#| eval: false
library(PSW)
seven_dwarfs_9 <- seven_dwarfs_9 |>
mutate(park_ticket_season_regular = ifelse(park_ticket_season == "regular", 1, 0),
park_ticket_season_value = ifelse(park_ticket_season == "value", 1, 0)
)
psw(data = seven_dwarfs_9,
form.ps = "park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high",
weight = "ATT",
wt = TRUE,
out.var = "wait_minutes_posted_avg")
```

0 comments on commit c11edc1

Please sign in to comment.