Skip to content

Commit

Permalink
Merged origin/main into update-dag-ch
Browse files Browse the repository at this point in the history
  • Loading branch information
malcolmbarrett committed Nov 3, 2023
2 parents 06fad28 + 6b81afc commit 715dd11
Show file tree
Hide file tree
Showing 3 changed files with 214 additions and 3 deletions.
4 changes: 2 additions & 2 deletions _freeze/chapters/chapter-11/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "52cdc5c043504620e3c65dd4015038d2",
"hash": "2dd7673707674bd2b0a74baf76440102",
"result": {
"markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnorm(5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] -0.5790 -0.5911 0.3102 0.3006 1.4746\n```\n\n\n:::\n:::\n\n\n## Using weights in outcome models\n\n## Estimating uncertainty\n",
"markdown": "# Fitting the outcome model {#sec-outcome-model}\n\n\n\n\n\n## Using matched data sets\n\nWhen fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(broom)\nlibrary(touringplans)\nlibrary(MatchIt)\n\nseven_dwarfs_9 <- seven_dwarfs_train_2018 |> \n filter(wait_hour == 9) \n\nm <- matchit(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9\n)\nmatched_data <- get_matches(m)\n```\n:::\n\n\nWe can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 (Inte… 67.0 2.37 28.3 1.94e-54 62.3 \n2 park_… 7.87 3.35 2.35 2.04e- 2 1.24\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n\nRecall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5). \n\n## Using weights in outcome models\n\nNow let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(propensity)\n\nseven_dwarfs_9_with_ps <-\n glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) |>\n augment(type.predict = \"response\", data = seven_dwarfs_9)\nseven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>\n mutate(w_att = wt_att(.fitted, park_extra_magic_morning))\n```\n:::\n\n\nWe can fit a *weighted* outcome model by using the `weights` argument. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = seven_dwarfs_9_with_wt,\n weights = w_att) |>\n tidy()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 (Intercept) 68.7 1.45 47.3 1.69e-154\n2 park_extra_ma… 6.23 2.05 3.03 2.62e- 3\n```\n\n\n:::\n:::\n\n\nUsing weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes. \nWhile this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.\n\n\n## Estimating uncertainty\n\nThere are three ways to estimate the uncertainty:\n\n1. A bootstrap\n2. A sandwich estimator that only takes into account the outcome model\n3. A sandwich estimator that takes into account the propensity score model\n\nThe first option can be computationally intensive, but should get you the correct estimates.\nThe second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this. \n\n### The bootstrap\n\n1. Create a function to run your analysis once on a sample of your data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit_ipw <- function(split, ...) {\n .df <- analysis(split)\n \n # fit propensity score model\n propensity_model <- glm(\n park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,\n data = seven_dwarfs_9,\n family = binomial()\n ) \n \n # calculate inverse probability weights\n .df <- propensity_model |>\n augment(type.predict = \"response\", data = .df) |>\n mutate(wts = wt_att(.fitted, \n park_extra_magic_morning,\n exposure_type = \"binary\"))\n \n # fit correctly bootstrapped ipw model\n lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = .df, weights = wts) |>\n tidy()\n}\n```\n:::\n\n\n\n2. Use {rsample} to bootstrap our causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rsample)\n\n# fit ipw model to bootstrapped samples\nbootstrapped_seven_dwarfs <- bootstraps(\n seven_dwarfs_9, \n times = 1000, \n apparent = TRUE\n)\n\nipw_results <- bootstrapped_seven_dwarfs |> \n mutate(boot_fits = map(splits, fit_ipw)) \n\nipw_results\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Bootstrap sampling with apparent sample \n# A tibble: 1,001 × 3\n splits id boot_fits \n <list> <chr> <list> \n 1 <split [354/124]> Bootstrap0001 <tibble [2 × 5]>\n 2 <split [354/134]> Bootstrap0002 <tibble [2 × 5]>\n 3 <split [354/126]> Bootstrap0003 <tibble [2 × 5]>\n 4 <split [354/127]> Bootstrap0004 <tibble [2 × 5]>\n 5 <split [354/125]> Bootstrap0005 <tibble [2 × 5]>\n 6 <split [354/125]> Bootstrap0006 <tibble [2 × 5]>\n 7 <split [354/143]> Bootstrap0007 <tibble [2 × 5]>\n 8 <split [354/123]> Bootstrap0008 <tibble [2 × 5]>\n 9 <split [354/134]> Bootstrap0009 <tibble [2 × 5]>\n10 <split [354/134]> Bootstrap0010 <tibble [2 × 5]>\n# ℹ 991 more rows\n```\n\n\n:::\n:::\n\n\n\nLet's look at the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nipw_results |>\n mutate(\n estimate = map_dbl(\n boot_fits,\n \\(.fit) .fit |>\n filter(term == \"park_extra_magic_morning\") |>\n pull(estimate)\n )\n ) |>\n ggplot(aes(estimate)) +\n geom_histogram(bins = 30, fill = \"#D55E00FF\", color = \"white\", alpha = 0.8) + \n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](chapter-11_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n\n3. Pull out the causal effect\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# get t-based CIs\nboot_estimate <- int_t(ipw_results, boot_fits) |> \n filter(term == \"park_extra_magic_morning\")\nboot_estimate\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 6\n term .lower .estimate .upper .alpha .method\n <chr> <dbl> <dbl> <dbl> <dbl> <chr> \n1 park_extra_ma… -0.368 7.10 11.6 0.05 studen…\n```\n\n\n:::\n:::\n\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.1 minutes, 95% CI (-0.4, 11.6). \n\n### The outcome model sandwich\n\nThere are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sandwich)\nweighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning, \n data = seven_dwarfs_9_with_wt,\n weights = w_att) \n\nsandwich(weighted_mod)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n (Intercept)\n(Intercept) 1.488\npark_extra_magic_morning -1.488\n park_extra_magic_morning\n(Intercept) -1.488\npark_extra_magic_morning 8.727\n```\n\n\n:::\n:::\n\n\nHere, our robust variance estimate is 8.727. We can then use this to construct a robust confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrobust_var <- sandwich(weighted_mod)[2, 2]\npoint_est <- coef(weighted_mod)[2]\nlb <- point_est - 1.96 * sqrt(robust_var)\nub <- point_est + 1.96 * sqrt(robust_var)\nlb\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n 0.4383 \n```\n\n\n:::\n\n```{.r .cell-code}\nub\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npark_extra_magic_morning \n 12.02 \n```\n\n\n:::\n:::\n\nWe estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes, 95% CI (0.4, 12). \n\nAlternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(survey)\n\ndes <- svydesign(\n ids = ~1,\n weights = ~w_att,\n data = seven_dwarfs_9_with_wt\n)\n```\n:::\n\n\nThen we can use `svyglm` to fit the outcome model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsvyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>\n tidy(conf.int = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 (Int… 68.7 1.22 56.2 6.14e-178 66.3 \n2 park… 6.23 2.96 2.11 3.60e- 2 0.410\n# ℹ 1 more variable: conf.high <dbl>\n```\n\n\n:::\n:::\n\n### Sandwich estimator that takes into account the propensity score model\n\nThe correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(PSW)\n\nseven_dwarfs_9 <- seven_dwarfs_9 |>\n mutate(park_ticket_season_regular = ifelse(park_ticket_season == \"regular\", 1, 0),\n park_ticket_season_value = ifelse(park_ticket_season == \"value\", 1, 0)\n )\npsw(data = seven_dwarfs_9, \n form.ps = \"park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high\",\n weight = \"ATT\",\n wt = TRUE,\n out.var = \"wait_minutes_posted_avg\")\n```\n:::\n",
"supporting": [
"chapter-11_files"
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
213 changes: 212 additions & 1 deletion chapters/chapter-11.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,221 @@

## Using matched data sets

When fitting an outcome model on matched data sets, we can simply subset the original data to only those who were matched and then fit a model on these data as we would otherwise. For example, re-performing the matching as we did in @sec-using-ps, we can extract the matched observations in a dataset called `matched_data` as follows.

```{r}
rnorm(5)
#| message: false
#| warning: false
library(broom)
library(touringplans)
library(MatchIt)
seven_dwarfs_9 <- seven_dwarfs_train_2018 |>
filter(wait_hour == 9)
m <- matchit(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9
)
matched_data <- get_matches(m)
```

We can then fit the outcome model on this data. For this analysis, we are interested in the impact of extra magic morning hours on the average posted wait time between 9 and 10am. The linear model below will estimate this in the matched cohort.

```{r}
lm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = matched_data) |>
tidy(conf.int = TRUE)
```

Recall that by default `{MatchIt}` estimates an average treatment effect among the treated. This means among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 7.9 minutes (95% CI: 1.2-14.5).

## Using weights in outcome models

Now let's use propensity score weights to estimate this same estimand. We will use the ATT weights so the analysis matches that for matching above.

```{r}
#| message: false
#| warning: false
library(propensity)
seven_dwarfs_9_with_ps <-
glm(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9,
family = binomial()
) |>
augment(type.predict = "response", data = seven_dwarfs_9)
seven_dwarfs_9_with_wt <- seven_dwarfs_9_with_ps |>
mutate(w_att = wt_att(.fitted, park_extra_magic_morning))
```

We can fit a *weighted* outcome model by using the `weights` argument.

```{r}
lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = seven_dwarfs_9_with_wt,
weights = w_att) |>
tidy()
```

Using weighting, we estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is 6.2 minutes.
While this approach will get us the desired estimate for the point estimate, the default output using the `lm` function for the uncertainty (the standard errors and confidence intervals) are not correct.


## Estimating uncertainty

There are three ways to estimate the uncertainty:

1. A bootstrap
2. A sandwich estimator that only takes into account the outcome model
3. A sandwich estimator that takes into account the propensity score model

The first option can be computationally intensive, but should get you the correct estimates.
The second option is computationally the easiest, but tends to overestimate the variability. There are not many current solutions in R (aside from coding it up yourself) for the third; however, the `{PSW}` package will do this.

### The bootstrap

1. Create a function to run your analysis once on a sample of your data

```{r}
fit_ipw <- function(split, ...) {
.df <- analysis(split)
# fit propensity score model
propensity_model <- glm(
park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
data = seven_dwarfs_9,
family = binomial()
)
# calculate inverse probability weights
.df <- propensity_model |>
augment(type.predict = "response", data = .df) |>
mutate(wts = wt_att(.fitted,
park_extra_magic_morning,
exposure_type = "binary"))
# fit correctly bootstrapped ipw model
lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = .df, weights = wts) |>
tidy()
}
```


2. Use {rsample} to bootstrap our causal effect

```{r}
#| message: false
#| warning: false
library(rsample)
# fit ipw model to bootstrapped samples
bootstrapped_seven_dwarfs <- bootstraps(
seven_dwarfs_9,
times = 1000,
apparent = TRUE
)
ipw_results <- bootstrapped_seven_dwarfs |>
mutate(boot_fits = map(splits, fit_ipw))
ipw_results
```


Let's look at the results.

```{r}
ipw_results |>
mutate(
estimate = map_dbl(
boot_fits,
\(.fit) .fit |>
filter(term == "park_extra_magic_morning") |>
pull(estimate)
)
) |>
ggplot(aes(estimate)) +
geom_histogram(bins = 30, fill = "#D55E00FF", color = "white", alpha = 0.8) +
theme_minimal()
```


3. Pull out the causal effect

```{r}
# get t-based CIs
boot_estimate <- int_t(ipw_results, boot_fits) |>
filter(term == "park_extra_magic_morning")
boot_estimate
```

We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(boot_estimate$.estimate, 1)` minutes, 95% CI (`r round(boot_estimate$.lower, 1)`, `r round(boot_estimate$.upper, 1)`).

### The outcome model sandwich

There are two ways to get the sandwich estimator. The first is to use the same weighted outcome model as above along with the `{sandwich}` package. Using the `sandwich` function, we can get the robust estimate for the variance for the parameter of interest, as shown below.

```{r}
#| message: false
#| warning: false
library(sandwich)
weighted_mod <- lm(wait_minutes_posted_avg ~ park_extra_magic_morning,
data = seven_dwarfs_9_with_wt,
weights = w_att)
sandwich(weighted_mod)
```

Here, our robust variance estimate is `r round(sandwich(weighted_mod)[2,2], 3)`. We can then use this to construct a robust confidence interval.

```{r}
robust_var <- sandwich(weighted_mod)[2, 2]
point_est <- coef(weighted_mod)[2]
lb <- point_est - 1.96 * sqrt(robust_var)
ub <- point_est + 1.96 * sqrt(robust_var)
lb
ub
```
We estimate that among days that have extra magic hours, the expected impact of having extra magic hours on the average posted wait time between 9 and 10am is `r round(point_est, 1)` minutes, 95% CI (`r round(lb, 1)`, `r round(ub, 1)`).

Alternatively, we could fit the model using the `{survey}` package. To do this, we need to create a design object, like we did when fitting weighted tables.

```{r}
#| message: false
#| warning: false
library(survey)
des <- svydesign(
ids = ~1,
weights = ~w_att,
data = seven_dwarfs_9_with_wt
)
```

Then we can use `svyglm` to fit the outcome model.

```{r}
svyglm(wait_minutes_posted_avg ~ park_extra_magic_morning, des) |>
tidy(conf.int = TRUE)
```
### Sandwich estimator that takes into account the propensity score model

The correct sandwich estimator will also take into account the propensity score model. The `{PSW}` will allow us to do this. This package has some quirks, for example it doesn't work well with categorical variables, so we need to create dummy variables for `park_ticket_season` to pass into the model. *Actually, the code below isn't working because it seems there is a bug in the package. Stay tuned!*

```{r}
#| eval: false
library(PSW)
seven_dwarfs_9 <- seven_dwarfs_9 |>
mutate(park_ticket_season_regular = ifelse(park_ticket_season == "regular", 1, 0),
park_ticket_season_value = ifelse(park_ticket_season == "value", 1, 0)
)
psw(data = seven_dwarfs_9,
form.ps = "park_extra_magic_morning ~ park_ticket_season_regular + park_ticket_season_value + park_close + park_temperature_high",
weight = "ATT",
wt = TRUE,
out.var = "wait_minutes_posted_avg")
```

0 comments on commit 715dd11

Please sign in to comment.