diff --git a/_freeze/ancova/execute-results/html.json b/_freeze/ancova/execute-results/html.json
index e8e5ee1..43e72fc 100644
--- a/_freeze/ancova/execute-results/html.json
+++ b/_freeze/ancova/execute-results/html.json
@@ -1,7 +1,7 @@
{
- "hash": "39e30dfec49e66a4ffc46ebb3db688c2",
+ "hash": "1a46e4f6302227d07b496793324ee69b",
"result": {
- "markdown": "---\ntitle: \"Analysis of Covariance\"\n---\n\n\n\n\n## Analysis of covariance\n\n\n* ANOVA: explanatory variables categorical (divide data into groups)\n\n* traditionally, analysis of covariance has categorical $x$'s plus one numerical $x$ (\"covariate\") to be adjusted for.\n\n* `lm` handles this too.\n\n* Simple example: two treatments (drugs) (`a` and `b`), with before and after scores. \n\n\n* Does knowing before score and/or treatment help to predict after score?\n\n* Is after score different by treatment/before score?\n\n\n\n\n\n## Data\n\nTreatment, before, after: \n\n\n\\scriptsize\n```\na 5 20\na 10 23\na 12 30\na 9 25\na 23 34\na 21 40\na 14 27\na 18 38\na 6 24\na 13 31\nb 7 19\nb 12 26\nb 27 33\nb 24 35\nb 18 30\nb 22 31\nb 26 34\nb 21 28\nb 14 23\nb 9 22\n```\n\\normalsize\n\n\n## Packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(marginaleffects)\n```\n:::\n\n\nthe last of these for predictions.\n\n## Read in data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl <- \"http://ritsokiguess.site/datafiles/ancova.txt\"\nprepost <- read_delim(url, \" \")\nprepost\n```\n\n::: {.cell-output-display}\n`````{=html}\n
\n \n
\n`````\n:::\n:::\n\n\n\n\n## Making a plot\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(prepost, aes(x = before, y = after, colour = drug)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/ancova-plot-1.png){width=960}\n:::\n:::\n\n\n\n\n## Comments \n\n* As before score goes up, after score goes up.\n\n* Red points (drug A) generally above blue points (drug B), for\ncomparable before score.\n\n* Suggests before score effect *and* drug effect.\n\n\n## The means\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost %>%\n group_by(drug) %>%\n summarize(\n before_mean = mean(before),\n after_mean = mean(after)\n )\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\n\n* Mean \"after\" score slightly higher for treatment A.\n\n* Mean \"before\" score much higher for treatment B.\n\n* Greater *improvement* on treatment A. \n\n\n\n## Testing for interaction\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.1 <- lm(after ~ before * drug, data = prepost)\nanova(prepost.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\n\n* Interaction not significant. Will remove later.\n\n## Predictions\n\nSet up values to predict for:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before after \n Length:20 Min. : 5.00 Min. :19.00 \n Class :character 1st Qu.: 9.75 1st Qu.:23.75 \n Mode :character Median :14.00 Median :29.00 \n Mean :15.55 Mean :28.65 \n 3rd Qu.:21.25 3rd Qu.:33.25 \n Max. :27.00 Max. :40.00 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnew <- datagrid(before = c(9.75, 14, 21.25), \n drug = c(\"a\", \"b\"), model = prepost.1)\n```\n:::\n\n\n## and then\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.1, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n## Predictions (with interaction included), plotted\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cap(model = prepost.1, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/unnamed-chunk-4-1.png){width=960}\n:::\n:::\n\n\nLines almost parallel, but not quite.\n\n\n\n## Taking out interaction\n\\small\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.2 <- update(prepost.1, . ~ . - before:drug)\nanova(prepost.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\\normalsize\n\n\n* Take out non-significant interaction.\n\n* `before` and `drug` strongly significant.\n\n* Do predictions again and plot them.\n\n## Predictions\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.2, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\n## Plot of predicted values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cap(prepost.2, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/unnamed-chunk-6-1.png){width=960}\n:::\n:::\n\n\nThis time the lines are *exactly* parallel. No-interaction model forces them\nto have the same slope. \n\n\n\n## Different look at model output\n\n\n* `anova(prepost.2)` tests for significant effect of\nbefore score and of drug, but doesn't help with interpretation.\n\n* `summary(prepost.2)` views as regression with slopes:\n\n\\scriptsize\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = after ~ before + drug, data = prepost)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6348 -2.5099 -0.2038 1.8871 4.7453 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 18.3600 1.5115 12.147 8.35e-10 ***\nbefore 0.8275 0.0955 8.665 1.21e-07 ***\ndrugb -5.1547 1.2876 -4.003 0.000921 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.682 on 17 degrees of freedom\nMultiple R-squared: 0.817,\tAdjusted R-squared: 0.7955 \nF-statistic: 37.96 on 2 and 17 DF, p-value: 5.372e-07\n```\n:::\n:::\n\n\n\\normalsize \n\n\n\n\n## Understanding those slopes\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(prepost.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n\n\n* `before` ordinary numerical variable; `drug`\ncategorical. \n\n* `lm` uses first category `druga` as baseline.\n\n* Intercept is prediction of after score for before score 0 and\n*drug A*.\n\n* `before` slope is predicted change in after score when\nbefore score increases by 1 (usual slope)\n\n* Slope for `drugb` is *change* in predicted after\nscore for being on drug B rather than drug A. Same for *any*\nbefore score (no interaction).\n\n\n\n## Summary\n\n\n* ANCOVA model: fits different regression line for each group,\npredicting response from covariate.\n\n* ANCOVA model with interaction between factor and covariate\nallows different slopes for each line.\n\n* Sometimes those lines can cross over!\n\n* If interaction not significant, take out. Lines then parallel.\n\n* With parallel lines, groups have consistent effect regardless\nof value of covariate.\n\n",
+ "markdown": "---\ntitle: \"Analysis of Covariance\"\n---\n\n\n\n\n## Analysis of covariance\n\n\n* ANOVA: explanatory variables categorical (divide data into groups)\n\n* traditionally, analysis of covariance has categorical $x$'s plus one numerical $x$ (\"covariate\") to be adjusted for.\n\n* `lm` handles this too.\n\n* Simple example: two treatments (drugs) (`a` and `b`), with before and after scores. \n\n\n* Does knowing before score and/or treatment help to predict after score?\n\n* Is after score different by treatment/before score?\n\n\n\n\n\n## Data\n\nTreatment, before, after: \n\n\n\\scriptsize\n```\na 5 20\na 10 23\na 12 30\na 9 25\na 23 34\na 21 40\na 14 27\na 18 38\na 6 24\na 13 31\nb 7 19\nb 12 26\nb 27 33\nb 24 35\nb 18 30\nb 22 31\nb 26 34\nb 21 28\nb 14 23\nb 9 22\n```\n\\normalsize\n\n\n## Packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(marginaleffects)\n```\n:::\n\n\nthe last of these for predictions.\n\n## Read in data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl <- \"http://ritsokiguess.site/datafiles/ancova.txt\"\nprepost <- read_delim(url, \" \")\nprepost\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\n\n## Making a plot\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(prepost, aes(x = before, y = after, colour = drug)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/ancova-plot-1.png){width=960}\n:::\n:::\n\n\n\n\n## Comments \n\n* As before score goes up, after score goes up.\n\n* Red points (drug A) generally above blue points (drug B), for\ncomparable before score.\n\n* Suggests before score effect *and* drug effect.\n\n\n## The means\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost %>%\n group_by(drug) %>%\n summarize(\n before_mean = mean(before),\n after_mean = mean(after)\n )\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\n\n* Mean \"after\" score slightly higher for treatment A.\n\n* Mean \"before\" score much higher for treatment B.\n\n* Greater *improvement* on treatment A. \n\n\n\n## Testing for interaction\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.1 <- lm(after ~ before * drug, data = prepost)\nanova(prepost.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\n\n* Interaction not significant. Will remove later.\n\n## Predictions\n\nSet up values to predict for:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before after \n Length:20 Min. : 5.00 Min. :19.00 \n Class :character 1st Qu.: 9.75 1st Qu.:23.75 \n Mode :character Median :14.00 Median :29.00 \n Mean :15.55 Mean :28.65 \n 3rd Qu.:21.25 3rd Qu.:33.25 \n Max. :27.00 Max. :40.00 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnew <- datagrid(before = c(9.75, 14, 21.25), \n drug = c(\"a\", \"b\"), model = prepost.1)\n```\n:::\n\n\n## and then\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.1, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n## Predictions (with interaction included), plotted\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_predictions(model = prepost.1, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/unnamed-chunk-4-1.png){width=960}\n:::\n:::\n\n\nLines almost parallel, but not quite.\n\n\n\n## Taking out interaction\n\\small\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.2 <- update(prepost.1, . ~ . - before:drug)\nanova(prepost.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n \n\\normalsize\n\n\n* Take out non-significant interaction.\n\n* `before` and `drug` strongly significant.\n\n* Do predictions again and plot them.\n\n## Predictions\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.2, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\n## Plot of predicted values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_predictions(prepost.2, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-revealjs/unnamed-chunk-6-1.png){width=960}\n:::\n:::\n\n\nThis time the lines are *exactly* parallel. No-interaction model forces them\nto have the same slope. \n\n\n\n## Different look at model output\n\n\n* `anova(prepost.2)` tests for significant effect of\nbefore score and of drug, but doesn't help with interpretation.\n\n* `summary(prepost.2)` views as regression with slopes:\n\n\\scriptsize\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = after ~ before + drug, data = prepost)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6348 -2.5099 -0.2038 1.8871 4.7453 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 18.3600 1.5115 12.147 8.35e-10 ***\nbefore 0.8275 0.0955 8.665 1.21e-07 ***\ndrugb -5.1547 1.2876 -4.003 0.000921 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.682 on 17 degrees of freedom\nMultiple R-squared: 0.817,\tAdjusted R-squared: 0.7955 \nF-statistic: 37.96 on 2 and 17 DF, p-value: 5.372e-07\n```\n:::\n:::\n\n\n\\normalsize \n\n\n\n\n## Understanding those slopes\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(prepost.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n\n\n* `before` ordinary numerical variable; `drug`\ncategorical. \n\n* `lm` uses first category `druga` as baseline.\n\n* Intercept is prediction of after score for before score 0 and\n*drug A*.\n\n* `before` slope is predicted change in after score when\nbefore score increases by 1 (usual slope)\n\n* Slope for `drugb` is *change* in predicted after\nscore for being on drug B rather than drug A. Same for *any*\nbefore score (no interaction).\n\n\n\n## Summary\n\n\n* ANCOVA model: fits different regression line for each group,\npredicting response from covariate.\n\n* ANCOVA model with interaction between factor and covariate\nallows different slopes for each line.\n\n* Sometimes those lines can cross over!\n\n* If interaction not significant, take out. Lines then parallel.\n\n* With parallel lines, groups have consistent effect regardless\nof value of covariate.\n\n",
"supporting": [
"ancova_files/figure-revealjs"
],
diff --git a/_freeze/ancova/execute-results/tex.json b/_freeze/ancova/execute-results/tex.json
index 89a1cf8..e0e3a78 100644
--- a/_freeze/ancova/execute-results/tex.json
+++ b/_freeze/ancova/execute-results/tex.json
@@ -1,7 +1,7 @@
{
- "hash": "39e30dfec49e66a4ffc46ebb3db688c2",
+ "hash": "1a46e4f6302227d07b496793324ee69b",
"result": {
- "markdown": "---\ntitle: \"Analysis of Covariance\"\n---\n\n\n\n\n\n## Analysis of covariance\n\n\n* ANOVA: explanatory variables categorical (divide data into groups)\n\n* traditionally, analysis of covariance has categorical $x$'s plus one numerical $x$ (\"covariate\") to be adjusted for.\n\n* `lm` handles this too.\n\n* Simple example: two treatments (drugs) (`a` and `b`), with before and after scores. \n\n\n* Does knowing before score and/or treatment help to predict after score?\n\n* Is after score different by treatment/before score?\n\n\n\n\n\n## Data\n\nTreatment, before, after: \n\n\n\\scriptsize\n```\na 5 20\na 10 23\na 12 30\na 9 25\na 23 34\na 21 40\na 14 27\na 18 38\na 6 24\na 13 31\nb 7 19\nb 12 26\nb 27 33\nb 24 35\nb 18 30\nb 22 31\nb 26 34\nb 21 28\nb 14 23\nb 9 22\n```\n\\normalsize\n\n\n## Packages\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(marginaleffects)\n```\n:::\n\n\n\nthe last of these for predictions.\n\n## Read in data\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl <- \"http://ritsokiguess.site/datafiles/ancova.txt\"\nprepost <- read_delim(url, \" \")\nprepost\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 20 x 3\n drug before after\n \n 1 a 5 20\n 2 a 10 23\n 3 a 12 30\n 4 a 9 25\n 5 a 23 34\n 6 a 21 40\n 7 a 14 27\n 8 a 18 38\n 9 a 6 24\n10 a 13 31\n11 b 7 19\n12 b 12 26\n13 b 27 33\n14 b 24 35\n15 b 18 30\n16 b 22 31\n17 b 26 34\n18 b 21 28\n19 b 14 23\n20 b 9 22\n```\n:::\n:::\n\n\n\n\n\n## Making a plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(prepost, aes(x = before, y = after, colour = drug)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/ancova-plot-1.pdf)\n:::\n:::\n\n\n\n\n\n## Comments \n\n* As before score goes up, after score goes up.\n\n* Red points (drug A) generally above blue points (drug B), for\ncomparable before score.\n\n* Suggests before score effect *and* drug effect.\n\n\n## The means\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost %>%\n group_by(drug) %>%\n summarize(\n before_mean = mean(before),\n after_mean = mean(after)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 x 3\n drug before_mean after_mean\n \n1 a 13.1 29.2\n2 b 18 28.1\n```\n:::\n:::\n\n\n \n\n\n* Mean \"after\" score slightly higher for treatment A.\n\n* Mean \"before\" score much higher for treatment B.\n\n* Greater *improvement* on treatment A. \n\n\n\n## Testing for interaction\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.1 <- lm(after ~ before * drug, data = prepost)\nanova(prepost.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nResponse: after\n Df Sum Sq Mean Sq F value Pr(>F) \nbefore 1 430.92 430.92 62.6894 6.34e-07 ***\ndrug 1 115.31 115.31 16.7743 0.0008442 ***\nbefore:drug 1 12.34 12.34 1.7948 0.1990662 \nResiduals 16 109.98 6.87 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n:::\n:::\n\n\n \n\n\n* Interaction not significant. Will remove later.\n\n## Predictions\n\nSet up values to predict for:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before after \n Length:20 Min. : 5.00 Min. :19.00 \n Class :character 1st Qu.: 9.75 1st Qu.:23.75 \n Mode :character Median :14.00 Median :29.00 \n Mean :15.55 Mean :28.65 \n 3rd Qu.:21.25 3rd Qu.:33.25 \n Max. :27.00 Max. :40.00 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnew <- datagrid(before = c(9.75, 14, 21.25), \n drug = c(\"a\", \"b\"), model = prepost.1)\n```\n:::\n\n\n\n## and then\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.1, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before estimate\n1 a 9.75 25.93250\n2 b 9.75 22.14565\n3 a 14.00 30.07784\n4 b 14.00 25.21304\n5 a 21.25 37.14929\n6 b 21.25 30.44565\n```\n:::\n:::\n\n\n\\normalsize\n\n## Predictions (with interaction included), plotted\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cap(model = prepost.1, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/unnamed-chunk-4-1.pdf)\n:::\n:::\n\n\n\nLines almost parallel, but not quite.\n\n\n\n## Taking out interaction\n\\small\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.2 <- update(prepost.1, . ~ . - before:drug)\nanova(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nResponse: after\n Df Sum Sq Mean Sq F value Pr(>F) \nbefore 1 430.92 430.92 59.890 5.718e-07 ***\ndrug 1 115.31 115.31 16.025 0.0009209 ***\nResiduals 17 122.32 7.20 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n:::\n:::\n\n\n \n\\normalsize\n\n\n* Take out non-significant interaction.\n\n* `before` and `drug` strongly significant.\n\n* Do predictions again and plot them.\n\n## Predictions\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.2, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before estimate\n1 a 9.75 26.42794\n2 b 9.75 21.27328\n3 a 14.00 29.94473\n4 b 14.00 24.79007\n5 a 21.25 35.94397\n6 b 21.25 30.78931\n```\n:::\n:::\n\n\n\n\n## Plot of predicted values\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cap(prepost.2, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/unnamed-chunk-6-1.pdf)\n:::\n:::\n\n\n\nThis time the lines are *exactly* parallel. No-interaction model forces them\nto have the same slope. \n\n\n\n## Different look at model output\n\n\n* `anova(prepost.2)` tests for significant effect of\nbefore score and of drug, but doesn't help with interpretation.\n\n* `summary(prepost.2)` views as regression with slopes:\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = after ~ before + drug, data = prepost)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6348 -2.5099 -0.2038 1.8871 4.7453 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 18.3600 1.5115 12.147 8.35e-10 ***\nbefore 0.8275 0.0955 8.665 1.21e-07 ***\ndrugb -5.1547 1.2876 -4.003 0.000921 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.682 on 17 degrees of freedom\nMultiple R-squared: 0.817,\tAdjusted R-squared: 0.7955 \nF-statistic: 37.96 on 2 and 17 DF, p-value: 5.372e-07\n```\n:::\n:::\n\n\n\n\\normalsize \n\n\n\n\n## Understanding those slopes\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 18.4 1.51 12.1 8.35e-10\n2 before 0.827 0.0955 8.66 1.21e- 7\n3 drugb -5.15 1.29 -4.00 9.21e- 4\n```\n:::\n:::\n\n\n\n\\normalsize\n\n\n\n* `before` ordinary numerical variable; `drug`\ncategorical. \n\n* `lm` uses first category `druga` as baseline.\n\n* Intercept is prediction of after score for before score 0 and\n*drug A*.\n\n* `before` slope is predicted change in after score when\nbefore score increases by 1 (usual slope)\n\n* Slope for `drugb` is *change* in predicted after\nscore for being on drug B rather than drug A. Same for *any*\nbefore score (no interaction).\n\n\n\n## Summary\n\n\n* ANCOVA model: fits different regression line for each group,\npredicting response from covariate.\n\n* ANCOVA model with interaction between factor and covariate\nallows different slopes for each line.\n\n* Sometimes those lines can cross over!\n\n* If interaction not significant, take out. Lines then parallel.\n\n* With parallel lines, groups have consistent effect regardless\nof value of covariate.\n\n",
+ "markdown": "---\ntitle: \"Analysis of Covariance\"\n---\n\n\n\n\n\n## Analysis of covariance\n\n\n* ANOVA: explanatory variables categorical (divide data into groups)\n\n* traditionally, analysis of covariance has categorical $x$'s plus one numerical $x$ (\"covariate\") to be adjusted for.\n\n* `lm` handles this too.\n\n* Simple example: two treatments (drugs) (`a` and `b`), with before and after scores. \n\n\n* Does knowing before score and/or treatment help to predict after score?\n\n* Is after score different by treatment/before score?\n\n\n\n\n\n## Data\n\nTreatment, before, after: \n\n\n\\scriptsize\n```\na 5 20\na 10 23\na 12 30\na 9 25\na 23 34\na 21 40\na 14 27\na 18 38\na 6 24\na 13 31\nb 7 19\nb 12 26\nb 27 33\nb 24 35\nb 18 30\nb 22 31\nb 26 34\nb 21 28\nb 14 23\nb 9 22\n```\n\\normalsize\n\n\n## Packages\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(marginaleffects)\n```\n:::\n\n\n\nthe last of these for predictions.\n\n## Read in data\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nurl <- \"http://ritsokiguess.site/datafiles/ancova.txt\"\nprepost <- read_delim(url, \" \")\nprepost\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 20 x 3\n drug before after\n \n 1 a 5 20\n 2 a 10 23\n 3 a 12 30\n 4 a 9 25\n 5 a 23 34\n 6 a 21 40\n 7 a 14 27\n 8 a 18 38\n 9 a 6 24\n10 a 13 31\n11 b 7 19\n12 b 12 26\n13 b 27 33\n14 b 24 35\n15 b 18 30\n16 b 22 31\n17 b 26 34\n18 b 21 28\n19 b 14 23\n20 b 9 22\n```\n:::\n:::\n\n\n\n\n\n## Making a plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(prepost, aes(x = before, y = after, colour = drug)) +\n geom_point()\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/ancova-plot-1.pdf)\n:::\n:::\n\n\n\n\n\n## Comments \n\n* As before score goes up, after score goes up.\n\n* Red points (drug A) generally above blue points (drug B), for\ncomparable before score.\n\n* Suggests before score effect *and* drug effect.\n\n\n## The means\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost %>%\n group_by(drug) %>%\n summarize(\n before_mean = mean(before),\n after_mean = mean(after)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 x 3\n drug before_mean after_mean\n \n1 a 13.1 29.2\n2 b 18 28.1\n```\n:::\n:::\n\n\n \n\n\n* Mean \"after\" score slightly higher for treatment A.\n\n* Mean \"before\" score much higher for treatment B.\n\n* Greater *improvement* on treatment A. \n\n\n\n## Testing for interaction\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.1 <- lm(after ~ before * drug, data = prepost)\nanova(prepost.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nResponse: after\n Df Sum Sq Mean Sq F value Pr(>F) \nbefore 1 430.92 430.92 62.6894 6.34e-07 ***\ndrug 1 115.31 115.31 16.7743 0.0008442 ***\nbefore:drug 1 12.34 12.34 1.7948 0.1990662 \nResiduals 16 109.98 6.87 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n:::\n:::\n\n\n \n\n\n* Interaction not significant. Will remove later.\n\n## Predictions\n\nSet up values to predict for:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before after \n Length:20 Min. : 5.00 Min. :19.00 \n Class :character 1st Qu.: 9.75 1st Qu.:23.75 \n Mode :character Median :14.00 Median :29.00 \n Mean :15.55 Mean :28.65 \n 3rd Qu.:21.25 3rd Qu.:33.25 \n Max. :27.00 Max. :40.00 \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnew <- datagrid(before = c(9.75, 14, 21.25), \n drug = c(\"a\", \"b\"), model = prepost.1)\n```\n:::\n\n\n\n## and then\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.1, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before estimate\n1 a 9.75 25.93250\n2 b 9.75 22.14565\n3 a 14.00 30.07784\n4 b 14.00 25.21304\n5 a 21.25 37.14929\n6 b 21.25 30.44565\n```\n:::\n:::\n\n\n\\normalsize\n\n## Predictions (with interaction included), plotted\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_predictions(model = prepost.1, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/unnamed-chunk-4-1.pdf)\n:::\n:::\n\n\n\nLines almost parallel, but not quite.\n\n\n\n## Taking out interaction\n\\small\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprepost.2 <- update(prepost.1, . ~ . - before:drug)\nanova(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nResponse: after\n Df Sum Sq Mean Sq F value Pr(>F) \nbefore 1 430.92 430.92 59.890 5.718e-07 ***\ndrug 1 115.31 115.31 16.025 0.0009209 ***\nResiduals 17 122.32 7.20 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n:::\n:::\n\n\n \n\\normalsize\n\n\n* Take out non-significant interaction.\n\n* `before` and `drug` strongly significant.\n\n* Do predictions again and plot them.\n\n## Predictions\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncbind(predictions(prepost.2, newdata = new)) %>% \n select(drug, before, estimate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n drug before estimate\n1 a 9.75 26.42794\n2 b 9.75 21.27328\n3 a 14.00 29.94473\n4 b 14.00 24.79007\n5 a 21.25 35.94397\n6 b 21.25 30.78931\n```\n:::\n:::\n\n\n\n\n## Plot of predicted values\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_predictions(prepost.2, condition = c(\"before\", \"drug\"))\n```\n\n::: {.cell-output-display}\n![](ancova_files/figure-beamer/unnamed-chunk-6-1.pdf)\n:::\n:::\n\n\n\nThis time the lines are *exactly* parallel. No-interaction model forces them\nto have the same slope. \n\n\n\n## Different look at model output\n\n\n* `anova(prepost.2)` tests for significant effect of\nbefore score and of drug, but doesn't help with interpretation.\n\n* `summary(prepost.2)` views as regression with slopes:\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = after ~ before + drug, data = prepost)\n\nResiduals:\n Min 1Q Median 3Q Max \n-3.6348 -2.5099 -0.2038 1.8871 4.7453 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 18.3600 1.5115 12.147 8.35e-10 ***\nbefore 0.8275 0.0955 8.665 1.21e-07 ***\ndrugb -5.1547 1.2876 -4.003 0.000921 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.682 on 17 degrees of freedom\nMultiple R-squared: 0.817,\tAdjusted R-squared: 0.7955 \nF-statistic: 37.96 on 2 and 17 DF, p-value: 5.372e-07\n```\n:::\n:::\n\n\n\n\\normalsize \n\n\n\n\n## Understanding those slopes\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(prepost.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 18.4 1.51 12.1 8.35e-10\n2 before 0.827 0.0955 8.66 1.21e- 7\n3 drugb -5.15 1.29 -4.00 9.21e- 4\n```\n:::\n:::\n\n\n\n\\normalsize\n\n\n\n* `before` ordinary numerical variable; `drug`\ncategorical. \n\n* `lm` uses first category `druga` as baseline.\n\n* Intercept is prediction of after score for before score 0 and\n*drug A*.\n\n* `before` slope is predicted change in after score when\nbefore score increases by 1 (usual slope)\n\n* Slope for `drugb` is *change* in predicted after\nscore for being on drug B rather than drug A. Same for *any*\nbefore score (no interaction).\n\n\n\n## Summary\n\n\n* ANCOVA model: fits different regression line for each group,\npredicting response from covariate.\n\n* ANCOVA model with interaction between factor and covariate\nallows different slopes for each line.\n\n* Sometimes those lines can cross over!\n\n* If interaction not significant, take out. Lines then parallel.\n\n* With parallel lines, groups have consistent effect regardless\nof value of covariate.\n\n",
"supporting": [
"ancova_files/figure-beamer"
],
diff --git a/_freeze/ancova/figure-beamer/ancova-plot-1.pdf b/_freeze/ancova/figure-beamer/ancova-plot-1.pdf
index 4634a40..9451afc 100644
Binary files a/_freeze/ancova/figure-beamer/ancova-plot-1.pdf and b/_freeze/ancova/figure-beamer/ancova-plot-1.pdf differ
diff --git a/_freeze/ancova/figure-beamer/unnamed-chunk-4-1.pdf b/_freeze/ancova/figure-beamer/unnamed-chunk-4-1.pdf
index a15d6f6..217fc6a 100644
Binary files a/_freeze/ancova/figure-beamer/unnamed-chunk-4-1.pdf and b/_freeze/ancova/figure-beamer/unnamed-chunk-4-1.pdf differ
diff --git a/_freeze/ancova/figure-beamer/unnamed-chunk-6-1.pdf b/_freeze/ancova/figure-beamer/unnamed-chunk-6-1.pdf
index 4530a59..3d2ae2c 100644
Binary files a/_freeze/ancova/figure-beamer/unnamed-chunk-6-1.pdf and b/_freeze/ancova/figure-beamer/unnamed-chunk-6-1.pdf differ
diff --git a/_freeze/ancova/figure-revealjs/unnamed-chunk-4-1.png b/_freeze/ancova/figure-revealjs/unnamed-chunk-4-1.png
index b6f4101..99604e4 100644
Binary files a/_freeze/ancova/figure-revealjs/unnamed-chunk-4-1.png and b/_freeze/ancova/figure-revealjs/unnamed-chunk-4-1.png differ
diff --git a/_freeze/ancova/figure-revealjs/unnamed-chunk-6-1.png b/_freeze/ancova/figure-revealjs/unnamed-chunk-6-1.png
index d0f2084..5ec9ff2 100644
Binary files a/_freeze/ancova/figure-revealjs/unnamed-chunk-6-1.png and b/_freeze/ancova/figure-revealjs/unnamed-chunk-6-1.png differ
diff --git a/_freeze/asphalt/execute-results/html.json b/_freeze/asphalt/execute-results/html.json
index f838b58..b5bc203 100644
--- a/_freeze/asphalt/execute-results/html.json
+++ b/_freeze/asphalt/execute-results/html.json
@@ -1,7 +1,7 @@
{
- "hash": "25e7e208e7eefbed7de9dbae587937fa",
+ "hash": "6c8eba076b49aa7fbc7a066250547c9f",
"result": {
- "markdown": "---\ntitle: \"Case study: asphalt\"\n---\n\n\n## The asphalt data\n- 31 asphalt pavements prepared under different conditions. How does\nquality of pavement depend on these?\n- Variables:\n - `pct.a.surf` Percentage of asphalt in surface layer\n - `pct.a.base` Percentage of asphalt in base layer\n - `fines` Percentage of fines in surface layer\n - `voids` Percentage of voids in surface layer\n - `rut.depth` Change in rut depth per million vehicle passes\n - `viscosity` Viscosity of asphalt\n - `run` 2 data collection periods: 1 for run 1, 0 for run 2.\n- `rut.depth` response. Depends on other variables, how?\n\n## Packages for this section\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(MASS)\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(leaps)\n```\n:::\n\n\nMake sure to load `MASS` before `tidyverse` (for annoying technical reasons).\n\n## Getting set up \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_url <- \"http://ritsokiguess.site/datafiles/asphalt.txt\"\nasphalt <- read_delim(my_url, \" \")\n```\n:::\n\n\n- Quantitative variables with one response: multiple regression.\n- Some issues here that don’t come up in “simple” regression; handle as\nwe go. (STAB27/STAC67 ideas.)\n\n## The data (some)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Plotting response “rut depth” against everything else\n\nSame idea as for plotting separate predictions on one plot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n pivot_longer(\n -rut.depth,\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = rut.depth)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n“collect all the x-variables together into one column called x, with another\ncolumn xname saying which x they were, then plot these x’s against\nrut.depth, a separate facet for each x-variable.”\n\nI saved this graph to plot later (on the next page).\n\n## The plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-5-1.png){width=960}\n:::\n:::\n\n\n## Interpreting the plots\n- One plot of rut depth against each of the six other variables.\n- Get rough idea of what’s going on.\n- Trends mostly weak.\n- `viscosity` has strong but non-linear trend.\n- `run` has effect but variability bigger when run is 1.\n- Weak but downward trend for `voids`.\n- Non-linearity of `rut.depth`-`viscosity` relationship should concern\nus.\n\n## Log of `viscosity`: more nearly linear?\n\n- Take this back to asphalt engineer: suggests log of `viscosity`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +\n geom_point() + geom_smooth(se = F)\n```\n:::\n\n\n(plot overleaf)\n\n## Rut depth against log-viscosity\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-6-1.png){width=960}\n:::\n:::\n\n\n## Comments and next steps\n- Not very linear, but better than before.\n- In multiple regression, hard to guess which x’s affect response. So\ntypically start by predicting from everything else.\n- Model formula has response on left, squiggle, explanatories on right\njoined by plusses:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines +\n voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n## Regression output: `summary(rut.1)` or:\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\nglance(rut.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n\n```{.r .cell-code}\ntidy(rut.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n## Comments\n- R-squared 81%, not so bad. \n- P-value in `glance` asserts that something helping to predict\nrut.depth.\n- Table of coefficients says `log(viscosity)`.\n- But confused by clearly non-significant variables: remove those to get\nclearer picture of what is helpful.\n- Before we do anything, look at residual plots:\n - (a) of residuals against fitted values (as usual)\n - (b) of residuals against each explanatory.\n- Problem fixes:\n - with (a): fix response variable; \n - with some plots in (b): fix those explanatory variables.\n\n## Plot fitted values against residuals\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-9-1.png){width=960}\n:::\n:::\n\n\n## Plotting residuals against $x$ variables\n- Problem here is that residuals are in the fitted model, and the\nobserved $x$-values are in the original data frame `asphalt`. \n- Package broom contains a function `augment` that combines these two\ntogether so that they can later be plotted: start with a model first, and then augment with a\ndata frame:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 %>% augment(asphalt) -> rut.1a\n```\n:::\n\n\n\n## What does rut.1a contain?\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(rut.1a)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"pct.a.surf\" \"pct.a.base\" \"fines\" \"voids\" \"rut.depth\" \n [6] \"viscosity\" \"run\" \".fitted\" \".resid\" \".hat\" \n[11] \".sigma\" \".cooksd\" \".std.resid\"\n```\n:::\n:::\n\n\n- all the stuff in original data frame, plus:\n- quantities from regression (starting with a dot)\n\n\n## Plotting residuals against $x$-variables \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1a %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = .resid)) +\n geom_point() + facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n## The plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-14-1.png){width=960}\n:::\n:::\n\n\n## Comments\n- There is serious curve in plot of residuals vs. fitted values. Suggests a\ntransformation of $y$. \n- The residuals-vs-$x$’s plots don’t show any serious trends. Worst\nprobably that potential curve against log-viscosity.\n- Also, large positive residual, 10, that shows up on all plots. Perhaps\ntransformation of $y$ will help with this too.\n- If residual-fitted plot OK, but some residual-$x$ plots not, try\ntransforming those $x$’s, eg. by adding $x^2$ to help with curve.\n\n## Which transformation?\n- Best way: consult with person who brought you the data.\n- Can’t do that here!\n- No idea what transformation would be good.\n- Let data choose: “Box-Cox transformation”.\n- Scale is that of “ladder of powers”: power transformation, but 0 is\nlog.\n\n\n## Running Box-Cox\n\nFrom package `MASS`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +\n log(viscosity) + run, data = asphalt)\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-15-1.png){width=960}\n:::\n:::\n\n\n## Comments on Box-Cox plot\n- $\\lambda$ represents power to transform $y$ with.\n- Best single choice of transformation parameter $\\lambda$ is peak of curve,\nclose to 0.\n- Vertical dotted lines give CI for $\\lambda$, about (−0.05, 0.2).\n- $\\lambda = 0$ means “log”.\n- Narrowness of confidence interval mean that these not supported by\ndata:\n - No transformation ($\\lambda = 1$)\n - Square root ($\\lambda = 0.5$)\n - Reciprocal ($\\lambda = −1$).\n\n## Relationships with explanatories\n- As before: plot response (now `log(rut.depth)`) against other\nexplanatory variables, all in one shot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g3\n```\n:::\n\n\n## The new plots\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng3\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-17-1.png){width=960}\n:::\n:::\n\n\n## Modelling with transformed response\n- These trends look pretty straight, especially with `log.viscosity`.\n- Values of `log.rut.depth` for each `run` have same spread.\n- Other trends weak, but are straight if they exist.\n- Start modelling from the beginning again.\n- Model `log.rut.depth` in terms of everything else, see what can be\nremoved:\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base +\n fines + voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n- use `tidy` from `broom` to display just the coefficients.\n\n## Output\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Taking out everything non-significant\n- Try: remove everything but pct.a.surf and log.viscosity:\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\n```\n:::\n\n\\normalsize\n\n\\footnotesize\n- Check that removing all those variables wasn’t too much:\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(rut.3, rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n- $H_0$ : two models equally good; $H_a$ : bigger model better.\n- Null not rejected here; small model as good as the big one, so prefer\nsimpler smaller model `rut.3`.\n\n## Find the largest P-value by eye:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Largest P-value is 0.78 for `pct.a.base`, not significant.\n- So remove this first, re-fit and re-assess.\n- Or, as over.\n\n## Get the computer to find the largest P-value for you\n\n- Output from `tidy` is itself a data frame, thus:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Largest P-value at the bottom.\n\n## Take out `pct.a.base`\n\n- Copy and paste the `lm` code and remove what you're removing:\n\n\\small\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + \n log(viscosity) + run, data = asphalt)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n- `fines` is next to go, P-value 0.32.\n\n## “Update”\n\nAnother way to do the same thing:\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- update(rut.2, . ~ . - pct.a.base)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Again, `fines` is the one to go. (Output identical as it should be.)\n\n## Take out fines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.5 <- update(rut.4, . ~ . - fines)\ntidy(rut.5) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\nCan’t take out intercept, so `run`, with P-value 0.36, goes next.\n\n## Take out run:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.6 <- update(rut.5, . ~ . - run)\ntidy(rut.6) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\nAgain, can’t take out intercept, so largest P-value is for `voids`, 0.044. But\nthis is significant, so we shouldn’t remove `voids`.\n\n## Comments\n- Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all\nmake fit significantly worse if removed. So they stay.\n- Different final result from taking things out one at a time (top), than\nby taking out 4 at once (bottom):\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf voids log(viscosity) \n -1.0207945 0.5554686 0.2447934 -0.6464911 \n```\n:::\n\n```{.r .cell-code}\ncoef(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf log(viscosity) \n 0.9001389 0.3911481 -0.6185628 \n```\n:::\n:::\n\n\n- Point: Can make difference which way we go.\n\n## Comments on variable selection\n- Best way to decide which $x$’s belong: expert knowledge: which of\nthem should be important.\n- Best automatic method: what we did, “backward selection”.\n- Do not learn about “stepwise regression”! [**eg. here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df)\n- R has function `step` that does backward selection, like this:\n\n::: {.cell}\n\n```{.r .cell-code}\nstep(rut.2, direction = \"backward\", test = \"F\")\n```\n:::\n\n\nGets same answer as we did (by removing least significant x). \n\n- Removing non-significant $x$’s may remove interesting ones whose\nP-values happened not to reach 0.05. Consider using less stringent\ncutoff like 0.20 or even bigger.\n- Can also fit all possible regressions, as over (may need to do\n`install.packages(\"leaps\")` first).\n\n## All possible regressions (output over)\n\nUses package `leaps`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleaps <- regsubsets(log(rut.depth) ~ pct.a.surf + \n pct.a.base + fines + voids + \n log(viscosity) + run, \n data = asphalt, nbest = 2)\ns <- summary(leaps)\nwith(s, data.frame(rsq, outmat)) -> d\n```\n:::\n\n\n## The output\n\n\n::: {.cell}\n\n:::\n\n\n\n\\scriptsize\n\n::: {.cell}\n\n```{.r .cell-code}\nd %>% rownames_to_column(\"model\") %>% arrange(desc(rsq))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n\n::: {.cell}\n\n:::\n\n\n\n## Comments\n- Problem: even adding a worthless x increases R-squared. So try for\nline where R-squared stops increasing “too much”, eg. top line (just\nlog.viscosity), first 3-variable line (backwards-elimination model).\nHard to judge.\n- One solution (STAC67): adjusted R-squared, where adding worthless\nvariable makes it go down.\n- `data.frame` rather than `tibble` because there are several columns in\n`outmat`. \n\n## All possible regressions, adjusted R-squared\n\n\n::: {.cell}\n\n:::\n\n\n\\scriptsize\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(s, data.frame(adjr2, outmat)) %>% \n rownames_to_column(\"model\") %>% \n arrange(desc(adjr2))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n\n\n::: {.cell}\n\n:::\n\n\n## Revisiting the best model\n- Best model was our rut.6:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.6)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Revisiting (2)\n- Regression slopes say that rut depth increases as log-viscosity\ndecreases, `pct.a.surf` increases and `voids` increases. This more or\nless checks out with out scatterplots against `log.viscosity`. \n- We should check residual plots again, though previous scatterplots say\nit’s unlikely that there will be a problem:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + \ngeom_point()\n```\n:::\n\n\n## Residuals against fitted values\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-39-1.png){width=960}\n:::\n:::\n\n\n## Plotting residuals against x’s\n- Do our trick again to put them all on one plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(rut.6, asphalt) %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\",\n ) %>%\n ggplot(aes(y = .resid, x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g2\n```\n:::\n\n\n## Residuals against the x’s\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng2\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-41-1.png){width=960}\n:::\n:::\n\n\n## Comments\n- None of the plots show any sort of pattern. The points all look\nrandom on each plot.\n- On the plot of fitted values (and on the one of log.viscosity), the\npoints seem to form a “left half” and a “right half” with a gap in the\nmiddle. This is not a concern.\n- One of the pct.a.surf values is low outlier (4), shows up top left of\nthat plot.\n- Only two possible values of run; the points in each group look\nrandomly scattered around 0, with equal spreads.\n- Residuals seem to go above zero further than below, suggesting a\nmild non-normality, but not enough to be a problem.\n\n## Variable-selection strategies\n- Expert knowledge.\n- Backward elimination.\n- All possible regressions.\n- Taking a variety of models to experts and asking their opinion.\n- Use a looser cutoff to eliminate variables in backward elimination (eg.\nonly if P-value greater than 0.20).\n- If goal is prediction, eliminating worthless variables less important.\n- If goal is understanding, want to eliminate worthless variables where\npossible.\n- Results of variable selection not always reproducible, so caution\nadvised.\n\n",
+ "markdown": "---\ntitle: \"Case study: asphalt\"\neditor: \n markdown: \n wrap: 72\n---\n\n\n## The asphalt data\n\n- 31 asphalt pavements prepared under different conditions. How does\n quality of pavement depend on these?\n- Variables:\n - `pct.a.surf` Percentage of asphalt in surface layer\n - `pct.a.base` Percentage of asphalt in base layer\n - `fines` Percentage of fines in surface layer\n - `voids` Percentage of voids in surface layer\n - `rut.depth` Change in rut depth per million vehicle passes\n - `viscosity` Viscosity of asphalt\n - `run` 2 data collection periods: 1 for run 1, 0 for run 2.\n- `rut.depth` response. Depends on other variables, how?\n\n## Packages for this section\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(MASS)\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(leaps)\n```\n:::\n\n\nMake sure to load `MASS` before `tidyverse` (for annoying technical\nreasons).\n\n## Getting set up\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_url <- \"http://ritsokiguess.site/datafiles/asphalt.txt\"\nasphalt <- read_delim(my_url, \" \")\n```\n:::\n\n\n- Quantitative variables with one response: multiple regression.\n- Some issues here that don't come up in \"simple\" regression; handle\n as we go. (STAB27/STAC67 ideas.)\n\n## The data (some)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Plotting response \"rut depth\" against everything else\n\nSame idea as for plotting separate predictions on one plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n pivot_longer(\n -rut.depth,\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = rut.depth)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n\"collect all the x-variables together into one column called x, with\nanother column xname saying which x they were, then plot these x's\nagainst rut.depth, a separate facet for each x-variable.\"\n\nI saved this graph to plot later (on the next page).\n\n## The plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-5-1.png){width=960}\n:::\n:::\n\n\n## Interpreting the plots\n\n- One plot of rut depth against each of the six other variables.\n- Get rough idea of what's going on.\n- Trends mostly weak.\n- `viscosity` has strong but non-linear trend.\n- `run` has effect but variability bigger when run is 1.\n- Weak but downward trend for `voids`.\n- Non-linearity of `rut.depth`-`viscosity` relationship should concern\n us.\n\n## Log of `viscosity`: more nearly linear?\n\n- Take this back to asphalt engineer: suggests log of `viscosity`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +\n geom_point() + geom_smooth(se = F) -> g\n```\n:::\n\n\n(plot overleaf)\n\n## Rut depth against log-viscosity\n\n\n::: {.cell}\n\n:::\n\n\n## Comments and next steps\n\n- Not very linear, but better than before.\n- In multiple regression, hard to guess which x's affect response. So\n typically start by predicting from everything else.\n- Model formula has response on left, squiggle, explanatories on right\n joined by plusses:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines +\n voids + log(viscosity) + run, data = asphalt)\nsummary(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = rut.depth ~ pct.a.surf + pct.a.base + fines + voids + \n log(viscosity) + run, data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.1211 -1.9075 -0.7175 1.6382 9.5947 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -12.9937 26.2188 -0.496 0.6247 \npct.a.surf 3.9706 2.4966 1.590 0.1248 \npct.a.base 1.2631 3.9703 0.318 0.7531 \nfines 0.1164 1.0124 0.115 0.9094 \nvoids 0.5893 1.3244 0.445 0.6604 \nlog(viscosity) -3.1515 0.9194 -3.428 0.0022 **\nrun -1.9655 3.6472 -0.539 0.5949 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 3.324 on 24 degrees of freedom\nMultiple R-squared: 0.806,\tAdjusted R-squared: 0.7575 \nF-statistic: 16.62 on 6 and 24 DF, p-value: 1.743e-07\n```\n:::\n:::\n\n\n## Regression output: `summary(rut.1)` or:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglance(rut.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n\n```{.r .cell-code}\ntidy(rut.1)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n## Comments\n\n- R-squared 81%, not so bad.\n\n- P-value in `glance` asserts that something helping to predict\n rut.depth.\n\n- Table of coefficients says `log(viscosity)`.\n\n- But confused by clearly non-significant variables: remove those to\n get clearer picture of what is helpful.\n\n- \n\n ## Before we do anything, look at residual plots:\n\n ``` \n (a) of residuals against fitted values (as usual)\n ```\n\n - \n\n (b) of residuals against each explanatory.\n\n- Problem fixes:\n\n - with (a): fix response variable;\n - with some plots in (b): fix those explanatory variables.\n\n## Plot fitted values against residuals\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-9-1.png){width=960}\n:::\n:::\n\n\n## Normal quantile plot of residuals\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/unnamed-chunk-1-1.png){width=960}\n:::\n:::\n\n\n## Plotting residuals against $x$ variables\n\n- Problem here is that residuals are in the fitted model, and the\n observed $x$-values are in the original data frame `asphalt`.\n- Package broom contains a function `augment` that combines these two\n together so that they can later be plotted: start with a model\n first, and then augment with a data frame:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 %>% augment(asphalt) -> rut.1a\n```\n:::\n\n\n## What does rut.1a contain?\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(rut.1a)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"pct.a.surf\" \"pct.a.base\" \"fines\" \"voids\" \"rut.depth\" \n [6] \"viscosity\" \"run\" \".fitted\" \".resid\" \".hat\" \n[11] \".sigma\" \".cooksd\" \".std.resid\"\n```\n:::\n:::\n\n\n- all the stuff in original data frame, plus:\n- quantities from regression (starting with a dot)\n\n## Plotting residuals against $x$-variables\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1a %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = .resid)) +\n geom_point() + facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n## The plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-14-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- There is serious curve in plot of residuals vs. fitted values.\n Suggests a transformation of $y$.\n- The residuals-vs-$x$'s plots don't show any serious trends. Worst\n probably that potential curve against log-viscosity.\n- Also, large positive residual, 10, that shows up on all plots.\n Perhaps transformation of $y$ will help with this too.\n- If residual-fitted plot OK, but some residual-$x$ plots not, try\n transforming those $x$'s, eg. by adding $x^2$ to help with curve.\n\n## Which transformation?\n\n- Best way: consult with person who brought you the data.\n- Can't do that here!\n- No idea what transformation would be good.\n- Let data choose: \"Box-Cox transformation\".\n- Scale is that of \"ladder of powers\": power transformation, but 0 is\n log.\n\n## Running Box-Cox\n\nFrom package `MASS`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +\n log(viscosity) + run, data = asphalt)\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-15-1.png){width=960}\n:::\n:::\n\n\n## Comments on Box-Cox plot\n\n- $\\lambda$ represents power to transform $y$ with.\n- Best single choice of transformation parameter $\\lambda$ is peak of\n curve, close to 0.\n- Vertical dotted lines give CI for $\\lambda$, about (−0.05, 0.2).\n- $\\lambda = 0$ means \"log\".\n- Narrowness of confidence interval mean that these not supported by\n data:\n - No transformation ($\\lambda = 1$)\n - Square root ($\\lambda = 0.5$)\n - Reciprocal ($\\lambda = −1$).\n\n## Relationships with explanatories\n\n- As before: plot response (now `log(rut.depth)`) against other\n explanatory variables, all in one shot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g3\n```\n:::\n\n\n## The new plots\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng3\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-17-1.png){width=960}\n:::\n:::\n\n\n## Modelling with transformed response\n\n- These trends look pretty straight, especially with `log.viscosity`.\n- Values of `log.rut.depth` for each `run` have same spread.\n- Other trends weak, but are straight if they exist.\n- Start modelling from the beginning again.\n- Model `log.rut.depth` in terms of everything else, see what can be\n removed:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base +\n fines + voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n- use `tidy` from `broom` to display just the coefficients.\n\n## Output\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n\n```{.r .cell-code}\nsummary(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + pct.a.base + fines + \n voids + log(viscosity) + run, data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.53072 -0.18563 -0.00003 0.20017 0.55079 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -1.57299 2.43617 -0.646 0.525 \npct.a.surf 0.58358 0.23198 2.516 0.019 * \npct.a.base -0.10337 0.36891 -0.280 0.782 \nfines 0.09775 0.09407 1.039 0.309 \nvoids 0.19885 0.12306 1.616 0.119 \nlog(viscosity) -0.55769 0.08543 -6.528 9.45e-07 ***\nrun 0.34005 0.33889 1.003 0.326 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3088 on 24 degrees of freedom\nMultiple R-squared: 0.961,\tAdjusted R-squared: 0.9512 \nF-statistic: 98.47 on 6 and 24 DF, p-value: 1.059e-15\n```\n:::\n:::\n\n\n## Taking out everything non-significant\n\n- Try: remove everything but pct.a.surf and log.viscosity:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\nsummary(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.61938 -0.21361 0.06635 0.14932 0.63012 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.90014 1.08059 0.833 0.4119 \npct.a.surf 0.39115 0.21879 1.788 0.0846 . \nlog(viscosity) -0.61856 0.02713 -22.797 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3208 on 28 degrees of freedom\nMultiple R-squared: 0.9509,\tAdjusted R-squared: 0.9474 \nF-statistic: 270.9 on 2 and 28 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\\normalsize\n\n\\footnotesize\n\n- Check that removing all those variables wasn't too much:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(rut.3, rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n- $H_0$ : two models equally good; $H_a$ : bigger model better.\n- Null not rejected here; small model as good as the big one, so\n prefer simpler smaller model `rut.3`.\n\n## Find the largest P-value by eye:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Largest P-value is 0.78 for `pct.a.base`, not significant.\n- So remove this first, re-fit and re-assess.\n- Or, as over.\n\n## Get the computer to find the largest P-value for you\n\n- Output from `tidy` is itself a data frame, thus:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Largest P-value at the bottom.\n\n## Take out `pct.a.base`\n\n- Copy and paste the `lm` code and remove what you're removing:\n\n\\small\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + \n log(viscosity) + run, data = asphalt)\ntidy(rut.4) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n- `fines` is next to go, P-value 0.32.\n\n## \"Update\"\n\nAnother way to do the same thing:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- update(rut.2, . ~ . - pct.a.base)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Again, `fines` is the one to go. (Output identical as it should be.)\n\n## Take out fines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.5 <- update(rut.4, . ~ . - fines)\ntidy(rut.5) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\nCan't take out intercept, so `run`, with P-value 0.36, goes next.\n\n## Take out run:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.6 <- update(rut.5, . ~ . - run)\ntidy(rut.6) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\nAgain, can't take out intercept, so largest P-value is for `voids`,\n0.044. But this is significant, so we shouldn't remove `voids`.\n\n## Comments\n\n- Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all\n make fit significantly worse if removed. So they stay.\n- Different final result from taking things out one at a time (top),\n than by taking out 4 at once (bottom):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + voids + log(viscosity), \n data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.53548 -0.20181 -0.01702 0.16748 0.54707 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -1.02079 1.36430 -0.748 0.4608 \npct.a.surf 0.55547 0.22044 2.520 0.0180 * \nvoids 0.24479 0.11560 2.118 0.0436 * \nlog(viscosity) -0.64649 0.02879 -22.458 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3025 on 27 degrees of freedom\nMultiple R-squared: 0.9579,\tAdjusted R-squared: 0.9532 \nF-statistic: 204.6 on 3 and 27 DF, p-value: < 2.2e-16\n```\n:::\n\n```{.r .cell-code}\ncoef(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf voids log(viscosity) \n -1.0207945 0.5554686 0.2447934 -0.6464911 \n```\n:::\n\n```{.r .cell-code}\ncoef(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf log(viscosity) \n 0.9001389 0.3911481 -0.6185628 \n```\n:::\n:::\n\n\n- Point: Can make difference which way we go.\n\n## Comments on variable selection\n\n- Best way to decide which $x$'s belong: expert knowledge: which of\n them should be important.\n- Best automatic method: what we did, \"backward selection\".\n- Do not learn about \"stepwise regression\"! [**eg.\n here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df)\n- R has function `step` that does backward selection, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstep(rut.2, direction = \"backward\", test = \"F\")\n```\n:::\n\n\nGets same answer as we did (by removing least significant x).\n\n- Removing non-significant $x$'s may remove interesting ones whose\n P-values happened not to reach 0.05. Consider using less stringent\n cutoff like 0.20 or even bigger.\n- Can also fit all possible regressions, as over (may need to do\n `install.packages(\"leaps\")` first).\n\n## All possible regressions (output over)\n\nUses package `leaps`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleaps <- regsubsets(log(rut.depth) ~ pct.a.surf + \n pct.a.base + fines + voids + \n log(viscosity) + run, \n data = asphalt, nbest = 2)\ns <- summary(leaps)\nwith(s, data.frame(rsq, outmat)) -> d\n```\n:::\n\n\n## The output\n\n\n::: {.cell}\n\n:::\n\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd %>% rownames_to_column(\"model\") %>% arrange(desc(rsq))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n\n::: {.cell}\n\n:::\n\n\n## Comments\n\n- Problem: even adding a worthless x increases R-squared. So try for\n line where R-squared stops increasing \"too much\", eg. top line (just\n log.viscosity), first 3-variable line (backwards-elimination model).\n Hard to judge.\n- One solution (STAC67): adjusted R-squared, where adding worthless\n variable makes it go down.\n- `data.frame` rather than `tibble` because there are several columns\n in `outmat`.\n\n## All possible regressions, adjusted R-squared\n\n\n::: {.cell}\n\n:::\n\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(s, data.frame(adjr2, outmat)) %>% \n rownames_to_column(\"model\") %>% \n arrange(desc(adjr2))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n\\normalsize\n\n\n::: {.cell}\n\n:::\n\n\n## Revisiting the best model\n\n- Best model was our rut.6:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.6)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Revisiting (2)\n\n- Regression slopes say that rut depth increases as log-viscosity\n decreases, `pct.a.surf` increases and `voids` increases. This more\n or less checks out with out scatterplots against `log.viscosity`.\n- We should check residual plots again, though previous scatterplots\n say it's unlikely that there will be a problem:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + \ngeom_point()\n```\n:::\n\n\n## Residuals against fitted values\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-39-1.png){width=960}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.6, aes(sample = .resid)) + stat_qq() + stat_qq_line()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/unnamed-chunk-2-1.png){width=960}\n:::\n:::\n\n\n## Plotting residuals against x's\n\n- Do our trick again to put them all on one plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(rut.6, asphalt) %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\",\n ) %>%\n ggplot(aes(y = .resid, x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g2\n```\n:::\n\n\n## Residuals against the x's\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng2\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-revealjs/asphalt-41-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- None of the plots show any sort of pattern. The points all look\n random on each plot.\n- On the plot of fitted values (and on the one of log.viscosity), the\n points seem to form a \"left half\" and a \"right half\" with a gap in\n the middle. This is not a concern.\n- One of the pct.a.surf values is low outlier (4), shows up top left\n of that plot.\n- Only two possible values of run; the points in each group look\n randomly scattered around 0, with equal spreads.\n- Residuals seem to go above zero further than below, suggesting a\n mild non-normality, but not enough to be a problem.\n\n## Variable-selection strategies\n\n- Expert knowledge.\n- Backward elimination.\n- All possible regressions.\n- Taking a variety of models to experts and asking their opinion.\n- Use a looser cutoff to eliminate variables in backward elimination\n (eg. only if P-value greater than 0.20).\n- If goal is prediction, eliminating worthless variables less\n important.\n- If goal is understanding, want to eliminate worthless variables\n where possible.\n- Results of variable selection not always reproducible, so caution\n advised.\n",
"supporting": [
"asphalt_files/figure-revealjs"
],
diff --git a/_freeze/asphalt/execute-results/tex.json b/_freeze/asphalt/execute-results/tex.json
index 014b9f4..2573462 100644
--- a/_freeze/asphalt/execute-results/tex.json
+++ b/_freeze/asphalt/execute-results/tex.json
@@ -1,7 +1,7 @@
{
- "hash": "25e7e208e7eefbed7de9dbae587937fa",
+ "hash": "6c8eba076b49aa7fbc7a066250547c9f",
"result": {
- "markdown": "---\ntitle: \"Case study: asphalt\"\n---\n\n\n\n## The asphalt data\n- 31 asphalt pavements prepared under different conditions. How does\nquality of pavement depend on these?\n- Variables:\n - `pct.a.surf` Percentage of asphalt in surface layer\n - `pct.a.base` Percentage of asphalt in base layer\n - `fines` Percentage of fines in surface layer\n - `voids` Percentage of voids in surface layer\n - `rut.depth` Change in rut depth per million vehicle passes\n - `viscosity` Viscosity of asphalt\n - `run` 2 data collection periods: 1 for run 1, 0 for run 2.\n- `rut.depth` response. Depends on other variables, how?\n\n## Packages for this section\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(MASS)\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(leaps)\n```\n:::\n\n\n\nMake sure to load `MASS` before `tidyverse` (for annoying technical reasons).\n\n## Getting set up \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_url <- \"http://ritsokiguess.site/datafiles/asphalt.txt\"\nasphalt <- read_delim(my_url, \" \")\n```\n:::\n\n\n\n- Quantitative variables with one response: multiple regression.\n- Some issues here that don’t come up in “simple” regression; handle as\nwe go. (STAB27/STAC67 ideas.)\n\n## The data (some)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 31 x 7\n pct.a.surf pct.a.base fines voids rut.depth viscosity run\n \n 1 4.68 4.87 8.4 4.92 6.75 2.8 1\n 2 5.19 4.5 6.5 4.56 13 1.4 1\n 3 4.82 4.73 7.9 5.32 14.8 1.4 1\n 4 4.85 4.76 8.3 4.86 12.6 3.3 1\n 5 4.86 4.95 8.4 3.78 8.25 1.7 1\n 6 5.16 4.45 7.4 4.40 10.7 2.9 1\n 7 4.82 5.05 6.8 4.87 7.28 3.7 1\n 8 4.86 4.7 8.6 4.83 12.7 1.7 1\n 9 4.78 4.84 6.7 4.86 12.6 0.92 1\n10 5.16 4.76 7.7 4.03 20.6 0.68 1\n# i 21 more rows\n```\n:::\n:::\n\n\n\n## Plotting response “rut depth” against everything else\n\nSame idea as for plotting separate predictions on one plot:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n pivot_longer(\n -rut.depth,\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = rut.depth)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n\n“collect all the x-variables together into one column called x, with another\ncolumn xname saying which x they were, then plot these x’s against\nrut.depth, a separate facet for each x-variable.”\n\nI saved this graph to plot later (on the next page).\n\n## The plot\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-5-1.pdf)\n:::\n:::\n\n\n\n## Interpreting the plots\n- One plot of rut depth against each of the six other variables.\n- Get rough idea of what’s going on.\n- Trends mostly weak.\n- `viscosity` has strong but non-linear trend.\n- `run` has effect but variability bigger when run is 1.\n- Weak but downward trend for `voids`.\n- Non-linearity of `rut.depth`-`viscosity` relationship should concern\nus.\n\n## Log of `viscosity`: more nearly linear?\n\n- Take this back to asphalt engineer: suggests log of `viscosity`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +\n geom_point() + geom_smooth(se = F)\n```\n:::\n\n\n\n(plot overleaf)\n\n## Rut depth against log-viscosity\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-6-1.pdf)\n:::\n:::\n\n\n\n## Comments and next steps\n- Not very linear, but better than before.\n- In multiple regression, hard to guess which x’s affect response. So\ntypically start by predicting from everything else.\n- Model formula has response on left, squiggle, explanatories on right\njoined by plusses:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines +\n voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n\n## Regression output: `summary(rut.1)` or:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglance(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 x 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.806 0.758 3.32 16.6 0.000000174 6 -77.3 171. 182.\n# i 3 more variables: deviance , df.residual , nobs \n```\n:::\n\n```{.r .cell-code}\ntidy(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -13.0 26.2 -0.496 0.625 \n2 pct.a.surf 3.97 2.50 1.59 0.125 \n3 pct.a.base 1.26 3.97 0.318 0.753 \n4 fines 0.116 1.01 0.115 0.909 \n5 voids 0.589 1.32 0.445 0.660 \n6 log(viscosity) -3.15 0.919 -3.43 0.00220\n7 run -1.97 3.65 -0.539 0.595 \n```\n:::\n:::\n\n\n\\normalsize\n\n## Comments\n- R-squared 81%, not so bad. \n- P-value in `glance` asserts that something helping to predict\nrut.depth.\n- Table of coefficients says `log(viscosity)`.\n- But confused by clearly non-significant variables: remove those to get\nclearer picture of what is helpful.\n- Before we do anything, look at residual plots:\n - (a) of residuals against fitted values (as usual)\n - (b) of residuals against each explanatory.\n- Problem fixes:\n - with (a): fix response variable; \n - with some plots in (b): fix those explanatory variables.\n\n## Plot fitted values against residuals\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-9-1.pdf)\n:::\n:::\n\n\n\n## Plotting residuals against $x$ variables\n- Problem here is that residuals are in the fitted model, and the\nobserved $x$-values are in the original data frame `asphalt`. \n- Package broom contains a function `augment` that combines these two\ntogether so that they can later be plotted: start with a model first, and then augment with a\ndata frame:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 %>% augment(asphalt) -> rut.1a\n```\n:::\n\n\n\n\n## What does rut.1a contain?\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(rut.1a)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"pct.a.surf\" \"pct.a.base\" \"fines\" \"voids\" \"rut.depth\" \n [6] \"viscosity\" \"run\" \".fitted\" \".resid\" \".hat\" \n[11] \".sigma\" \".cooksd\" \".std.resid\"\n```\n:::\n:::\n\n\n\n- all the stuff in original data frame, plus:\n- quantities from regression (starting with a dot)\n\n\n## Plotting residuals against $x$-variables \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1a %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = .resid)) +\n geom_point() + facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n\n## The plot\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-14-1.pdf)\n:::\n:::\n\n\n\n## Comments\n- There is serious curve in plot of residuals vs. fitted values. Suggests a\ntransformation of $y$. \n- The residuals-vs-$x$’s plots don’t show any serious trends. Worst\nprobably that potential curve against log-viscosity.\n- Also, large positive residual, 10, that shows up on all plots. Perhaps\ntransformation of $y$ will help with this too.\n- If residual-fitted plot OK, but some residual-$x$ plots not, try\ntransforming those $x$’s, eg. by adding $x^2$ to help with curve.\n\n## Which transformation?\n- Best way: consult with person who brought you the data.\n- Can’t do that here!\n- No idea what transformation would be good.\n- Let data choose: “Box-Cox transformation”.\n- Scale is that of “ladder of powers”: power transformation, but 0 is\nlog.\n\n\n## Running Box-Cox\n\nFrom package `MASS`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +\n log(viscosity) + run, data = asphalt)\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-15-1.pdf)\n:::\n:::\n\n\n\n## Comments on Box-Cox plot\n- $\\lambda$ represents power to transform $y$ with.\n- Best single choice of transformation parameter $\\lambda$ is peak of curve,\nclose to 0.\n- Vertical dotted lines give CI for $\\lambda$, about (−0.05, 0.2).\n- $\\lambda = 0$ means “log”.\n- Narrowness of confidence interval mean that these not supported by\ndata:\n - No transformation ($\\lambda = 1$)\n - Square root ($\\lambda = 0.5$)\n - Reciprocal ($\\lambda = −1$).\n\n## Relationships with explanatories\n- As before: plot response (now `log(rut.depth)`) against other\nexplanatory variables, all in one shot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g3\n```\n:::\n\n\n\n## The new plots\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng3\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-17-1.pdf)\n:::\n:::\n\n\n\n## Modelling with transformed response\n- These trends look pretty straight, especially with `log.viscosity`.\n- Values of `log.rut.depth` for each `run` have same spread.\n- Other trends weak, but are straight if they exist.\n- Start modelling from the beginning again.\n- Model `log.rut.depth` in terms of everything else, see what can be\nremoved:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base +\n fines + voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n\n- use `tidy` from `broom` to display just the coefficients.\n\n## Output\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.57 2.44 -0.646 0.525 \n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 pct.a.base -0.103 0.369 -0.280 0.782 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 voids 0.199 0.123 1.62 0.119 \n6 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n7 run 0.340 0.339 1.00 0.326 \n```\n:::\n:::\n\n\n\n## Taking out everything non-significant\n- Try: remove everything but pct.a.surf and log.viscosity:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\n```\n:::\n\n\n\\normalsize\n\n\\footnotesize\n- Check that removing all those variables wasn’t too much:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(rut.3, rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nModel 1: log(rut.depth) ~ pct.a.surf + log(viscosity)\nModel 2: log(rut.depth) ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + \n run\n Res.Df RSS Df Sum of Sq F Pr(>F)\n1 28 2.8809 \n2 24 2.2888 4 0.59216 1.5523 0.2191\n```\n:::\n:::\n\n\n\\normalsize\n\n- $H_0$ : two models equally good; $H_a$ : bigger model better.\n- Null not rejected here; small model as good as the big one, so prefer\nsimpler smaller model `rut.3`.\n\n## Find the largest P-value by eye:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.57 2.44 -0.646 0.525 \n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 pct.a.base -0.103 0.369 -0.280 0.782 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 voids 0.199 0.123 1.62 0.119 \n6 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n7 run 0.340 0.339 1.00 0.326 \n```\n:::\n:::\n\n\n\n- Largest P-value is 0.78 for `pct.a.base`, not significant.\n- So remove this first, re-fit and re-assess.\n- Or, as over.\n\n## Get the computer to find the largest P-value for you\n\n- Output from `tidy` is itself a data frame, thus:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 voids 0.199 0.123 1.62 0.119 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 run 0.340 0.339 1.00 0.326 \n6 (Intercept) -1.57 2.44 -0.646 0.525 \n7 pct.a.base -0.103 0.369 -0.280 0.782 \n```\n:::\n:::\n\n\n\n- Largest P-value at the bottom.\n\n## Take out `pct.a.base`\n\n- Copy and paste the `lm` code and remove what you're removing:\n\n\\small\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + \n log(viscosity) + run, data = asphalt)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.552 0.0818 -6.75 0.000000448\n2 pct.a.surf 0.593 0.225 2.63 0.0143 \n3 voids 0.200 0.121 1.66 0.109 \n4 (Intercept) -2.08 1.61 -1.29 0.208 \n5 run 0.360 0.325 1.11 0.279 \n6 fines 0.0889 0.0870 1.02 0.316 \n```\n:::\n:::\n\n\n\\normalsize\n\n- `fines` is next to go, P-value 0.32.\n\n## “Update”\n\nAnother way to do the same thing:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- update(rut.2, . ~ . - pct.a.base)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.552 0.0818 -6.75 0.000000448\n2 pct.a.surf 0.593 0.225 2.63 0.0143 \n3 voids 0.200 0.121 1.66 0.109 \n4 (Intercept) -2.08 1.61 -1.29 0.208 \n5 run 0.360 0.325 1.11 0.279 \n6 fines 0.0889 0.0870 1.02 0.316 \n```\n:::\n:::\n\n\n\n- Again, `fines` is the one to go. (Output identical as it should be.)\n\n## Take out fines:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.5 <- update(rut.4, . ~ . - fines)\ntidy(rut.5) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.580 0.0772 -7.52 0.0000000559\n2 pct.a.surf 0.548 0.221 2.48 0.0200 \n3 voids 0.232 0.117 1.99 0.0577 \n4 run 0.295 0.319 0.923 0.365 \n5 (Intercept) -1.26 1.39 -0.902 0.375 \n```\n:::\n:::\n\n\n\nCan’t take out intercept, so `run`, with P-value 0.36, goes next.\n\n## Take out run:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.6 <- update(rut.5, . ~ . - run)\ntidy(rut.6) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.646 0.0288 -22.5 5.29e-19\n2 pct.a.surf 0.555 0.220 2.52 1.80e- 2\n3 voids 0.245 0.116 2.12 4.36e- 2\n4 (Intercept) -1.02 1.36 -0.748 4.61e- 1\n```\n:::\n:::\n\n\n\nAgain, can’t take out intercept, so largest P-value is for `voids`, 0.044. But\nthis is significant, so we shouldn’t remove `voids`.\n\n## Comments\n- Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all\nmake fit significantly worse if removed. So they stay.\n- Different final result from taking things out one at a time (top), than\nby taking out 4 at once (bottom):\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoef(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf voids log(viscosity) \n -1.0207945 0.5554686 0.2447934 -0.6464911 \n```\n:::\n\n```{.r .cell-code}\ncoef(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf log(viscosity) \n 0.9001389 0.3911481 -0.6185628 \n```\n:::\n:::\n\n\n\n- Point: Can make difference which way we go.\n\n## Comments on variable selection\n- Best way to decide which $x$’s belong: expert knowledge: which of\nthem should be important.\n- Best automatic method: what we did, “backward selection”.\n- Do not learn about “stepwise regression”! [**eg. here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df)\n- R has function `step` that does backward selection, like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstep(rut.2, direction = \"backward\", test = \"F\")\n```\n:::\n\n\n\nGets same answer as we did (by removing least significant x). \n\n- Removing non-significant $x$’s may remove interesting ones whose\nP-values happened not to reach 0.05. Consider using less stringent\ncutoff like 0.20 or even bigger.\n- Can also fit all possible regressions, as over (may need to do\n`install.packages(\"leaps\")` first).\n\n## All possible regressions (output over)\n\nUses package `leaps`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleaps <- regsubsets(log(rut.depth) ~ pct.a.surf + \n pct.a.base + fines + voids + \n log(viscosity) + run, \n data = asphalt, nbest = 2)\ns <- summary(leaps)\nwith(s, data.frame(rsq, outmat)) -> d\n```\n:::\n\n\n\n## The output\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd %>% rownames_to_column(\"model\") %>% arrange(desc(rsq))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n model rsq pct.a.surf pct.a.base fines voids log.viscosity. run\n1 6 ( 1 ) 0.9609642 * * * * * *\n2 5 ( 1 ) 0.9608365 * * * * *\n3 5 ( 2 ) 0.9593265 * * * * * \n4 4 ( 1 ) 0.9591996 * * * *\n5 4 ( 2 ) 0.9589206 * * * * \n6 3 ( 1 ) 0.9578631 * * * \n7 3 ( 2 ) 0.9534561 * * * \n8 2 ( 1 ) 0.9508647 * * \n9 2 ( 2 ) 0.9479541 * * \n10 1 ( 1 ) 0.9452562 * \n11 1 ( 2 ) 0.8624107 *\n```\n:::\n:::\n\n\n\\normalsize\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n## Comments\n- Problem: even adding a worthless x increases R-squared. So try for\nline where R-squared stops increasing “too much”, eg. top line (just\nlog.viscosity), first 3-variable line (backwards-elimination model).\nHard to judge.\n- One solution (STAC67): adjusted R-squared, where adding worthless\nvariable makes it go down.\n- `data.frame` rather than `tibble` because there are several columns in\n`outmat`. \n\n## All possible regressions, adjusted R-squared\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\\scriptsize\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(s, data.frame(adjr2, outmat)) %>% \n rownames_to_column(\"model\") %>% \n arrange(desc(adjr2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n model adjr2 pct.a.surf pct.a.base fines voids log.viscosity. run\n1 3 ( 1 ) 0.9531812 * * * \n2 5 ( 1 ) 0.9530038 * * * * *\n3 4 ( 1 ) 0.9529226 * * * *\n4 4 ( 2 ) 0.9526007 * * * * \n5 6 ( 1 ) 0.9512052 * * * * * *\n6 5 ( 2 ) 0.9511918 * * * * * \n7 3 ( 2 ) 0.9482845 * * * \n8 2 ( 1 ) 0.9473550 * * \n9 2 ( 2 ) 0.9442365 * * \n10 1 ( 1 ) 0.9433685 * \n11 1 ( 2 ) 0.8576662 *\n```\n:::\n:::\n\n\n\\normalsize\n\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Revisiting the best model\n- Best model was our rut.6:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.02 1.36 -0.748 4.61e- 1\n2 pct.a.surf 0.555 0.220 2.52 1.80e- 2\n3 voids 0.245 0.116 2.12 4.36e- 2\n4 log(viscosity) -0.646 0.0288 -22.5 5.29e-19\n```\n:::\n:::\n\n\n\n## Revisiting (2)\n- Regression slopes say that rut depth increases as log-viscosity\ndecreases, `pct.a.surf` increases and `voids` increases. This more or\nless checks out with out scatterplots against `log.viscosity`. \n- We should check residual plots again, though previous scatterplots say\nit’s unlikely that there will be a problem:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + \ngeom_point()\n```\n:::\n\n\n\n## Residuals against fitted values\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-39-1.pdf)\n:::\n:::\n\n\n\n## Plotting residuals against x’s\n- Do our trick again to put them all on one plot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(rut.6, asphalt) %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\",\n ) %>%\n ggplot(aes(y = .resid, x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g2\n```\n:::\n\n\n\n## Residuals against the x’s\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng2\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-41-1.pdf)\n:::\n:::\n\n\n\n## Comments\n- None of the plots show any sort of pattern. The points all look\nrandom on each plot.\n- On the plot of fitted values (and on the one of log.viscosity), the\npoints seem to form a “left half” and a “right half” with a gap in the\nmiddle. This is not a concern.\n- One of the pct.a.surf values is low outlier (4), shows up top left of\nthat plot.\n- Only two possible values of run; the points in each group look\nrandomly scattered around 0, with equal spreads.\n- Residuals seem to go above zero further than below, suggesting a\nmild non-normality, but not enough to be a problem.\n\n## Variable-selection strategies\n- Expert knowledge.\n- Backward elimination.\n- All possible regressions.\n- Taking a variety of models to experts and asking their opinion.\n- Use a looser cutoff to eliminate variables in backward elimination (eg.\nonly if P-value greater than 0.20).\n- If goal is prediction, eliminating worthless variables less important.\n- If goal is understanding, want to eliminate worthless variables where\npossible.\n- Results of variable selection not always reproducible, so caution\nadvised.\n\n",
+ "markdown": "---\ntitle: \"Case study: asphalt\"\neditor: \n markdown: \n wrap: 72\n---\n\n\n\n## The asphalt data\n\n- 31 asphalt pavements prepared under different conditions. How does\n quality of pavement depend on these?\n- Variables:\n - `pct.a.surf` Percentage of asphalt in surface layer\n - `pct.a.base` Percentage of asphalt in base layer\n - `fines` Percentage of fines in surface layer\n - `voids` Percentage of voids in surface layer\n - `rut.depth` Change in rut depth per million vehicle passes\n - `viscosity` Viscosity of asphalt\n - `run` 2 data collection periods: 1 for run 1, 0 for run 2.\n- `rut.depth` response. Depends on other variables, how?\n\n## Packages for this section\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(MASS)\nlibrary(tidyverse)\nlibrary(broom)\nlibrary(leaps)\n```\n:::\n\n\n\nMake sure to load `MASS` before `tidyverse` (for annoying technical\nreasons).\n\n## Getting set up\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_url <- \"http://ritsokiguess.site/datafiles/asphalt.txt\"\nasphalt <- read_delim(my_url, \" \")\n```\n:::\n\n\n\n- Quantitative variables with one response: multiple regression.\n- Some issues here that don't come up in \"simple\" regression; handle\n as we go. (STAB27/STAC67 ideas.)\n\n## The data (some)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 31 x 7\n pct.a.surf pct.a.base fines voids rut.depth viscosity run\n \n 1 4.68 4.87 8.4 4.92 6.75 2.8 1\n 2 5.19 4.5 6.5 4.56 13 1.4 1\n 3 4.82 4.73 7.9 5.32 14.8 1.4 1\n 4 4.85 4.76 8.3 4.86 12.6 3.3 1\n 5 4.86 4.95 8.4 3.78 8.25 1.7 1\n 6 5.16 4.45 7.4 4.40 10.7 2.9 1\n 7 4.82 5.05 6.8 4.87 7.28 3.7 1\n 8 4.86 4.7 8.6 4.83 12.7 1.7 1\n 9 4.78 4.84 6.7 4.86 12.6 0.92 1\n10 5.16 4.76 7.7 4.03 20.6 0.68 1\n# i 21 more rows\n```\n:::\n:::\n\n\n\n## Plotting response \"rut depth\" against everything else\n\nSame idea as for plotting separate predictions on one plot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n pivot_longer(\n -rut.depth,\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = rut.depth)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n\n\"collect all the x-variables together into one column called x, with\nanother column xname saying which x they were, then plot these x's\nagainst rut.depth, a separate facet for each x-variable.\"\n\nI saved this graph to plot later (on the next page).\n\n## The plot\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-5-1.pdf)\n:::\n:::\n\n\n\n## Interpreting the plots\n\n- One plot of rut depth against each of the six other variables.\n- Get rough idea of what's going on.\n- Trends mostly weak.\n- `viscosity` has strong but non-linear trend.\n- `run` has effect but variability bigger when run is 1.\n- Weak but downward trend for `voids`.\n- Non-linearity of `rut.depth`-`viscosity` relationship should concern\n us.\n\n## Log of `viscosity`: more nearly linear?\n\n- Take this back to asphalt engineer: suggests log of `viscosity`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(asphalt, aes(y = rut.depth, x = log(viscosity))) +\n geom_point() + geom_smooth(se = F) -> g\n```\n:::\n\n\n\n(plot overleaf)\n\n## Rut depth against log-viscosity\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Comments and next steps\n\n- Not very linear, but better than before.\n- In multiple regression, hard to guess which x's affect response. So\n typically start by predicting from everything else.\n- Model formula has response on left, squiggle, explanatories on right\n joined by plusses:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 <- lm(rut.depth ~ pct.a.surf + pct.a.base + fines +\n voids + log(viscosity) + run, data = asphalt)\nsummary(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = rut.depth ~ pct.a.surf + pct.a.base + fines + voids + \n log(viscosity) + run, data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-4.1211 -1.9075 -0.7175 1.6382 9.5947 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -12.9937 26.2188 -0.496 0.6247 \npct.a.surf 3.9706 2.4966 1.590 0.1248 \npct.a.base 1.2631 3.9703 0.318 0.7531 \nfines 0.1164 1.0124 0.115 0.9094 \nvoids 0.5893 1.3244 0.445 0.6604 \nlog(viscosity) -3.1515 0.9194 -3.428 0.0022 **\nrun -1.9655 3.6472 -0.539 0.5949 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 3.324 on 24 degrees of freedom\nMultiple R-squared: 0.806,\tAdjusted R-squared: 0.7575 \nF-statistic: 16.62 on 6 and 24 DF, p-value: 1.743e-07\n```\n:::\n:::\n\n\n\n## Regression output: `summary(rut.1)` or:\n\n\\footnotesize\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglance(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 x 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.806 0.758 3.32 16.6 0.000000174 6 -77.3 171. 182.\n# i 3 more variables: deviance , df.residual , nobs \n```\n:::\n\n```{.r .cell-code}\ntidy(rut.1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -13.0 26.2 -0.496 0.625 \n2 pct.a.surf 3.97 2.50 1.59 0.125 \n3 pct.a.base 1.26 3.97 0.318 0.753 \n4 fines 0.116 1.01 0.115 0.909 \n5 voids 0.589 1.32 0.445 0.660 \n6 log(viscosity) -3.15 0.919 -3.43 0.00220\n7 run -1.97 3.65 -0.539 0.595 \n```\n:::\n:::\n\n\n\n\\normalsize\n\n## Comments\n\n- R-squared 81%, not so bad.\n\n- P-value in `glance` asserts that something helping to predict\n rut.depth.\n\n- Table of coefficients says `log(viscosity)`.\n\n- But confused by clearly non-significant variables: remove those to\n get clearer picture of what is helpful.\n\n- \n\n ## Before we do anything, look at residual plots:\n\n ``` \n (a) of residuals against fitted values (as usual)\n ```\n\n - \n\n (b) of residuals against each explanatory.\n\n- Problem fixes:\n\n - with (a): fix response variable;\n - with some plots in (b): fix those explanatory variables.\n\n## Plot fitted values against residuals\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(x = .fitted, y = .resid)) + geom_point()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-9-1.pdf)\n:::\n:::\n\n\n\n## Normal quantile plot of residuals\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.1, aes(sample = .resid)) + stat_qq() + stat_qq_line()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/unnamed-chunk-1-1.pdf)\n:::\n:::\n\n\n\n## Plotting residuals against $x$ variables\n\n- Problem here is that residuals are in the fitted model, and the\n observed $x$-values are in the original data frame `asphalt`.\n- Package broom contains a function `augment` that combines these two\n together so that they can later be plotted: start with a model\n first, and then augment with a data frame:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1 %>% augment(asphalt) -> rut.1a\n```\n:::\n\n\n\n## What does rut.1a contain?\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnames(rut.1a)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] \"pct.a.surf\" \"pct.a.base\" \"fines\" \"voids\" \"rut.depth\" \n [6] \"viscosity\" \"run\" \".fitted\" \".resid\" \".hat\" \n[11] \".sigma\" \".cooksd\" \".std.resid\"\n```\n:::\n:::\n\n\n\n- all the stuff in original data frame, plus:\n- quantities from regression (starting with a dot)\n\n## Plotting residuals against $x$-variables\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.1a %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(x = x, y = .resid)) +\n geom_point() + facet_wrap(~xname, scales = \"free\") -> g\n```\n:::\n\n\n\n## The plot\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-14-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- There is serious curve in plot of residuals vs. fitted values.\n Suggests a transformation of $y$.\n- The residuals-vs-$x$'s plots don't show any serious trends. Worst\n probably that potential curve against log-viscosity.\n- Also, large positive residual, 10, that shows up on all plots.\n Perhaps transformation of $y$ will help with this too.\n- If residual-fitted plot OK, but some residual-$x$ plots not, try\n transforming those $x$'s, eg. by adding $x^2$ to help with curve.\n\n## Which transformation?\n\n- Best way: consult with person who brought you the data.\n- Can't do that here!\n- No idea what transformation would be good.\n- Let data choose: \"Box-Cox transformation\".\n- Scale is that of \"ladder of powers\": power transformation, but 0 is\n log.\n\n## Running Box-Cox\n\nFrom package `MASS`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxcox(rut.depth ~ pct.a.surf + pct.a.base + fines + voids +\n log(viscosity) + run, data = asphalt)\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-15-1.pdf)\n:::\n:::\n\n\n\n## Comments on Box-Cox plot\n\n- $\\lambda$ represents power to transform $y$ with.\n- Best single choice of transformation parameter $\\lambda$ is peak of\n curve, close to 0.\n- Vertical dotted lines give CI for $\\lambda$, about (−0.05, 0.2).\n- $\\lambda = 0$ means \"log\".\n- Narrowness of confidence interval mean that these not supported by\n data:\n - No transformation ($\\lambda = 1$)\n - Square root ($\\lambda = 0.5$)\n - Reciprocal ($\\lambda = −1$).\n\n## Relationships with explanatories\n\n- As before: plot response (now `log(rut.depth)`) against other\n explanatory variables, all in one shot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nasphalt %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\"\n ) %>%\n ggplot(aes(y = log(rut.depth), x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g3\n```\n:::\n\n\n\n## The new plots\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng3\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-17-1.pdf)\n:::\n:::\n\n\n\n## Modelling with transformed response\n\n- These trends look pretty straight, especially with `log.viscosity`.\n- Values of `log.rut.depth` for each `run` have same spread.\n- Other trends weak, but are straight if they exist.\n- Start modelling from the beginning again.\n- Model `log.rut.depth` in terms of everything else, see what can be\n removed:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.2 <- lm(log(rut.depth) ~ pct.a.surf + pct.a.base +\n fines + voids + log(viscosity) + run, data = asphalt)\n```\n:::\n\n\n\n- use `tidy` from `broom` to display just the coefficients.\n\n## Output\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.57 2.44 -0.646 0.525 \n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 pct.a.base -0.103 0.369 -0.280 0.782 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 voids 0.199 0.123 1.62 0.119 \n6 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n7 run 0.340 0.339 1.00 0.326 \n```\n:::\n\n```{.r .cell-code}\nsummary(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + pct.a.base + fines + \n voids + log(viscosity) + run, data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.53072 -0.18563 -0.00003 0.20017 0.55079 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -1.57299 2.43617 -0.646 0.525 \npct.a.surf 0.58358 0.23198 2.516 0.019 * \npct.a.base -0.10337 0.36891 -0.280 0.782 \nfines 0.09775 0.09407 1.039 0.309 \nvoids 0.19885 0.12306 1.616 0.119 \nlog(viscosity) -0.55769 0.08543 -6.528 9.45e-07 ***\nrun 0.34005 0.33889 1.003 0.326 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3088 on 24 degrees of freedom\nMultiple R-squared: 0.961,\tAdjusted R-squared: 0.9512 \nF-statistic: 98.47 on 6 and 24 DF, p-value: 1.059e-15\n```\n:::\n:::\n\n\n\n## Taking out everything non-significant\n\n- Try: remove everything but pct.a.surf and log.viscosity:\n\n\\footnotesize\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.3 <- lm(log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\nsummary(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + log(viscosity), data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.61938 -0.21361 0.06635 0.14932 0.63012 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.90014 1.08059 0.833 0.4119 \npct.a.surf 0.39115 0.21879 1.788 0.0846 . \nlog(viscosity) -0.61856 0.02713 -22.797 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3208 on 28 degrees of freedom\nMultiple R-squared: 0.9509,\tAdjusted R-squared: 0.9474 \nF-statistic: 270.9 on 2 and 28 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n\\normalsize\n\n\\footnotesize\n\n- Check that removing all those variables wasn't too much:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nanova(rut.3, rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nAnalysis of Variance Table\n\nModel 1: log(rut.depth) ~ pct.a.surf + log(viscosity)\nModel 2: log(rut.depth) ~ pct.a.surf + pct.a.base + fines + voids + log(viscosity) + \n run\n Res.Df RSS Df Sum of Sq F Pr(>F)\n1 28 2.8809 \n2 24 2.2888 4 0.59216 1.5523 0.2191\n```\n:::\n:::\n\n\n\n\\normalsize\n\n- $H_0$ : two models equally good; $H_a$ : bigger model better.\n- Null not rejected here; small model as good as the big one, so\n prefer simpler smaller model `rut.3`.\n\n## Find the largest P-value by eye:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.57 2.44 -0.646 0.525 \n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 pct.a.base -0.103 0.369 -0.280 0.782 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 voids 0.199 0.123 1.62 0.119 \n6 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n7 run 0.340 0.339 1.00 0.326 \n```\n:::\n:::\n\n\n\n- Largest P-value is 0.78 for `pct.a.base`, not significant.\n- So remove this first, re-fit and re-assess.\n- Or, as over.\n\n## Get the computer to find the largest P-value for you\n\n- Output from `tidy` is itself a data frame, thus:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.2) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.558 0.0854 -6.53 0.000000945\n2 pct.a.surf 0.584 0.232 2.52 0.0190 \n3 voids 0.199 0.123 1.62 0.119 \n4 fines 0.0978 0.0941 1.04 0.309 \n5 run 0.340 0.339 1.00 0.326 \n6 (Intercept) -1.57 2.44 -0.646 0.525 \n7 pct.a.base -0.103 0.369 -0.280 0.782 \n```\n:::\n:::\n\n\n\n- Largest P-value at the bottom.\n\n## Take out `pct.a.base`\n\n- Copy and paste the `lm` code and remove what you're removing:\n\n\\small\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- lm(log(rut.depth) ~ pct.a.surf + fines + voids + \n log(viscosity) + run, data = asphalt)\ntidy(rut.4) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 x 2\n term p.value\n \n1 log(viscosity) 0.000000448\n2 pct.a.surf 0.0143 \n3 voids 0.109 \n4 (Intercept) 0.208 \n5 run 0.279 \n6 fines 0.316 \n```\n:::\n:::\n\n\n\n\\normalsize\n\n- `fines` is next to go, P-value 0.32.\n\n## \"Update\"\n\nAnother way to do the same thing:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.4 <- update(rut.2, . ~ . - pct.a.base)\ntidy(rut.4) %>% arrange(p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 x 5\n term estimate std.error statistic p.value\n \n1 log(viscosity) -0.552 0.0818 -6.75 0.000000448\n2 pct.a.surf 0.593 0.225 2.63 0.0143 \n3 voids 0.200 0.121 1.66 0.109 \n4 (Intercept) -2.08 1.61 -1.29 0.208 \n5 run 0.360 0.325 1.11 0.279 \n6 fines 0.0889 0.0870 1.02 0.316 \n```\n:::\n:::\n\n\n\n- Again, `fines` is the one to go. (Output identical as it should be.)\n\n## Take out fines:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.5 <- update(rut.4, . ~ . - fines)\ntidy(rut.5) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 x 2\n term p.value\n \n1 log(viscosity) 0.0000000559\n2 pct.a.surf 0.0200 \n3 voids 0.0577 \n4 run 0.365 \n5 (Intercept) 0.375 \n```\n:::\n:::\n\n\n\nCan't take out intercept, so `run`, with P-value 0.36, goes next.\n\n## Take out run:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrut.6 <- update(rut.5, . ~ . - run)\ntidy(rut.6) %>% arrange(p.value) %>% dplyr::select(term, p.value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 2\n term p.value\n \n1 log(viscosity) 5.29e-19\n2 pct.a.surf 1.80e- 2\n3 voids 4.36e- 2\n4 (Intercept) 4.61e- 1\n```\n:::\n:::\n\n\n\nAgain, can't take out intercept, so largest P-value is for `voids`,\n0.044. But this is significant, so we shouldn't remove `voids`.\n\n## Comments\n\n- Here we stop: `pct.a.surf`, `voids` and `log.viscosity` would all\n make fit significantly worse if removed. So they stay.\n- Different final result from taking things out one at a time (top),\n than by taking out 4 at once (bottom):\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = log(rut.depth) ~ pct.a.surf + voids + log(viscosity), \n data = asphalt)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.53548 -0.20181 -0.01702 0.16748 0.54707 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -1.02079 1.36430 -0.748 0.4608 \npct.a.surf 0.55547 0.22044 2.520 0.0180 * \nvoids 0.24479 0.11560 2.118 0.0436 * \nlog(viscosity) -0.64649 0.02879 -22.458 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3025 on 27 degrees of freedom\nMultiple R-squared: 0.9579,\tAdjusted R-squared: 0.9532 \nF-statistic: 204.6 on 3 and 27 DF, p-value: < 2.2e-16\n```\n:::\n\n```{.r .cell-code}\ncoef(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf voids log(viscosity) \n -1.0207945 0.5554686 0.2447934 -0.6464911 \n```\n:::\n\n```{.r .cell-code}\ncoef(rut.3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n (Intercept) pct.a.surf log(viscosity) \n 0.9001389 0.3911481 -0.6185628 \n```\n:::\n:::\n\n\n\n- Point: Can make difference which way we go.\n\n## Comments on variable selection\n\n- Best way to decide which $x$'s belong: expert knowledge: which of\n them should be important.\n- Best automatic method: what we did, \"backward selection\".\n- Do not learn about \"stepwise regression\"! [**eg.\n here**](https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df)\n- R has function `step` that does backward selection, like this:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstep(rut.2, direction = \"backward\", test = \"F\")\n```\n:::\n\n\n\nGets same answer as we did (by removing least significant x).\n\n- Removing non-significant $x$'s may remove interesting ones whose\n P-values happened not to reach 0.05. Consider using less stringent\n cutoff like 0.20 or even bigger.\n- Can also fit all possible regressions, as over (may need to do\n `install.packages(\"leaps\")` first).\n\n## All possible regressions (output over)\n\nUses package `leaps`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nleaps <- regsubsets(log(rut.depth) ~ pct.a.surf + \n pct.a.base + fines + voids + \n log(viscosity) + run, \n data = asphalt, nbest = 2)\ns <- summary(leaps)\nwith(s, data.frame(rsq, outmat)) -> d\n```\n:::\n\n\n\n## The output\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\\scriptsize\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd %>% rownames_to_column(\"model\") %>% arrange(desc(rsq))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n model rsq pct.a.surf pct.a.base fines voids log.viscosity. run\n1 6 ( 1 ) 0.9609642 * * * * * *\n2 5 ( 1 ) 0.9608365 * * * * *\n3 5 ( 2 ) 0.9593265 * * * * * \n4 4 ( 1 ) 0.9591996 * * * *\n5 4 ( 2 ) 0.9589206 * * * * \n6 3 ( 1 ) 0.9578631 * * * \n7 3 ( 2 ) 0.9534561 * * * \n8 2 ( 1 ) 0.9508647 * * \n9 2 ( 2 ) 0.9479541 * * \n10 1 ( 1 ) 0.9452562 * \n11 1 ( 2 ) 0.8624107 *\n```\n:::\n:::\n\n\n\n\\normalsize\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Comments\n\n- Problem: even adding a worthless x increases R-squared. So try for\n line where R-squared stops increasing \"too much\", eg. top line (just\n log.viscosity), first 3-variable line (backwards-elimination model).\n Hard to judge.\n- One solution (STAC67): adjusted R-squared, where adding worthless\n variable makes it go down.\n- `data.frame` rather than `tibble` because there are several columns\n in `outmat`.\n\n## All possible regressions, adjusted R-squared\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\\scriptsize\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwith(s, data.frame(adjr2, outmat)) %>% \n rownames_to_column(\"model\") %>% \n arrange(desc(adjr2))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n model adjr2 pct.a.surf pct.a.base fines voids log.viscosity. run\n1 3 ( 1 ) 0.9531812 * * * \n2 5 ( 1 ) 0.9530038 * * * * *\n3 4 ( 1 ) 0.9529226 * * * *\n4 4 ( 2 ) 0.9526007 * * * * \n5 6 ( 1 ) 0.9512052 * * * * * *\n6 5 ( 2 ) 0.9511918 * * * * * \n7 3 ( 2 ) 0.9482845 * * * \n8 2 ( 1 ) 0.9473550 * * \n9 2 ( 2 ) 0.9442365 * * \n10 1 ( 1 ) 0.9433685 * \n11 1 ( 2 ) 0.8576662 *\n```\n:::\n:::\n\n\n\n\\normalsize\n\n\n\n::: {.cell}\n\n:::\n\n\n\n## Revisiting the best model\n\n- Best model was our rut.6:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy(rut.6)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 5\n term estimate std.error statistic p.value\n \n1 (Intercept) -1.02 1.36 -0.748 4.61e- 1\n2 pct.a.surf 0.555 0.220 2.52 1.80e- 2\n3 voids 0.245 0.116 2.12 4.36e- 2\n4 log(viscosity) -0.646 0.0288 -22.5 5.29e-19\n```\n:::\n:::\n\n\n\n## Revisiting (2)\n\n- Regression slopes say that rut depth increases as log-viscosity\n decreases, `pct.a.surf` increases and `voids` increases. This more\n or less checks out with out scatterplots against `log.viscosity`.\n- We should check residual plots again, though previous scatterplots\n say it's unlikely that there will be a problem:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng <- ggplot(rut.6, aes(y = .resid, x = .fitted)) + \ngeom_point()\n```\n:::\n\n\n\n## Residuals against fitted values\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-39-1.pdf)\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(rut.6, aes(sample = .resid)) + stat_qq() + stat_qq_line()\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/unnamed-chunk-2-1.pdf)\n:::\n:::\n\n\n\n## Plotting residuals against x's\n\n- Do our trick again to put them all on one plot:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(rut.6, asphalt) %>%\n mutate(log_vis=log(viscosity)) %>% \n pivot_longer(\n c(pct.a.surf:voids, run, log_vis),\n names_to=\"xname\", values_to=\"x\",\n ) %>%\n ggplot(aes(y = .resid, x = x)) + geom_point() +\n facet_wrap(~xname, scales = \"free\") -> g2\n```\n:::\n\n\n\n## Residuals against the x's\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng2\n```\n\n::: {.cell-output-display}\n![](asphalt_files/figure-beamer/asphalt-41-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- None of the plots show any sort of pattern. The points all look\n random on each plot.\n- On the plot of fitted values (and on the one of log.viscosity), the\n points seem to form a \"left half\" and a \"right half\" with a gap in\n the middle. This is not a concern.\n- One of the pct.a.surf values is low outlier (4), shows up top left\n of that plot.\n- Only two possible values of run; the points in each group look\n randomly scattered around 0, with equal spreads.\n- Residuals seem to go above zero further than below, suggesting a\n mild non-normality, but not enough to be a problem.\n\n## Variable-selection strategies\n\n- Expert knowledge.\n- Backward elimination.\n- All possible regressions.\n- Taking a variety of models to experts and asking their opinion.\n- Use a looser cutoff to eliminate variables in backward elimination\n (eg. only if P-value greater than 0.20).\n- If goal is prediction, eliminating worthless variables less\n important.\n- If goal is understanding, want to eliminate worthless variables\n where possible.\n- Results of variable selection not always reproducible, so caution\n advised.\n",
"supporting": [
"asphalt_files/figure-beamer"
],
diff --git a/_freeze/asphalt/figure-beamer/asphalt-14-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-14-1.pdf
index 3baafc2..e1b7f38 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-14-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-14-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-15-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-15-1.pdf
index ec3331e..38d548a 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-15-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-15-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-17-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-17-1.pdf
index 926dc04..2ec33c3 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-17-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-17-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-39-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-39-1.pdf
index 219cf50..500ec33 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-39-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-39-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-41-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-41-1.pdf
index 9caaef5..0413692 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-41-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-41-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-5-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-5-1.pdf
index 39a1856..6e7e830 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-5-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-5-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/asphalt-9-1.pdf b/_freeze/asphalt/figure-beamer/asphalt-9-1.pdf
index e08a3ed..e225e96 100644
Binary files a/_freeze/asphalt/figure-beamer/asphalt-9-1.pdf and b/_freeze/asphalt/figure-beamer/asphalt-9-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/unnamed-chunk-1-1.pdf b/_freeze/asphalt/figure-beamer/unnamed-chunk-1-1.pdf
new file mode 100644
index 0000000..a5f420f
Binary files /dev/null and b/_freeze/asphalt/figure-beamer/unnamed-chunk-1-1.pdf differ
diff --git a/_freeze/asphalt/figure-beamer/unnamed-chunk-2-1.pdf b/_freeze/asphalt/figure-beamer/unnamed-chunk-2-1.pdf
new file mode 100644
index 0000000..9a6ddd5
Binary files /dev/null and b/_freeze/asphalt/figure-beamer/unnamed-chunk-2-1.pdf differ
diff --git a/_freeze/asphalt/figure-revealjs/unnamed-chunk-1-1.png b/_freeze/asphalt/figure-revealjs/unnamed-chunk-1-1.png
new file mode 100644
index 0000000..cf883ce
Binary files /dev/null and b/_freeze/asphalt/figure-revealjs/unnamed-chunk-1-1.png differ
diff --git a/_freeze/asphalt/figure-revealjs/unnamed-chunk-2-1.png b/_freeze/asphalt/figure-revealjs/unnamed-chunk-2-1.png
new file mode 100644
index 0000000..5dd9912
Binary files /dev/null and b/_freeze/asphalt/figure-revealjs/unnamed-chunk-2-1.png differ
diff --git a/_freeze/bootstrap_R/execute-results/html.json b/_freeze/bootstrap_R/execute-results/html.json
index 89f90fb..f323c82 100644
--- a/_freeze/bootstrap_R/execute-results/html.json
+++ b/_freeze/bootstrap_R/execute-results/html.json
@@ -1,7 +1,7 @@
{
- "hash": "b39d3bfbe34481136f1fea8772965337",
+ "hash": "7c19f4123ee146ced18cb67d9535aa9a",
"result": {
- "markdown": "---\ntitle: \"Bootstrap for sampling distribution of sample mean\"\n---\n\n\n\n## Assessing assumptions\n\n- Our $t$-tests assume normality of variable being tested\n- but, Central Limit Theorem says that normality matters less if sample is \"large\"\n- in practice \"approximate normality\" is enough, but how do we assess whether what we have is normal enough?\n- so far, use histogram/boxplot and make a call, allowing for sample size.\n\n## What actually has to be normal\n\n- is: **sampling distribution of sample mean**\n- the distribution of sample mean over *all possible samples*\n- but we only have *one* sample!\n- Idea: assume our sample is representative of the population, and draw samples from our sample (!), with replacement.\n- This gives an idea of what different samples from the population might look like.\n- Called *bootstrap*, after expression \"to pull yourself up by your own bootstraps\".\n\n## Packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\n## Blue Jays attendances\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\njays$attendance\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 48414 17264 15086 14433 21397 34743 44794 14184 15606 18581 19217 21519\n[13] 21312 30430 42917 42419 29306 15062 16402 19014 21195 33086 37929 15168\n[25] 17276\n```\n:::\n:::\n\n\n- A bootstrap sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- sample(jays$attendance, replace = TRUE)\ns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 21195 34743 21312 44794 16402 19014 34743 21195 17264 18581 19014 19217\n[13] 34743 19217 14433 15062 16402 15062 34743 15062 15086 15168 15086 48414\n[25] 30430\n```\n:::\n:::\n\n\n## Getting mean of bootstrap sample\n\n- A bootstrap sample is same size as original, but contains repeated values (eg. 15062) and missing ones (42917).\n- We need the mean of our bootstrap sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 23055.28\n```\n:::\n:::\n\n\n- This is a little different from the mean of our actual sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25070.16\n```\n:::\n:::\n\n\n- Want a sense of how the sample mean might vary, if we were able to take repeated samples from our population.\n- Idea: take lots of *bootstrap* samples, and see how *their* sample means vary.\n\n## Setting up bootstrap sampling\n\n- Begin by setting up a dataframe that contains a row for each bootstrap sample. I usually call this column `sim`. Do just 4 to get the idea:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Drawing the bootstrap samples\n\n- Then set up to work one row at a time, and draw a bootstrap sample of the attendances in each row:\n\n\\small\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n- Each row of our dataframe contains *all* of a bootstrap sample of 25 observations drawn with replacement from the attendances.\n\n\\normalsize\n\n## Sample means\n\n- Find the mean of each sample:\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\\normalsize\n\n- These are (four simulated values of) the bootstrapped sampling distribution of the sample mean.\n\n## Make a histogram of them\n\n- rather pointless here, but to get the idea:\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g\n```\n:::\n\n\\normalsize\n\n## The (pointless) histogram\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-10-1.png){width=960}\n:::\n:::\n\n\n## Now do again with a decent number of bootstrap samples\n\n- say 1000, and put a decent number of bins on the histogram also:\n\n\\footnotesize\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g\n```\n:::\n\n\\normalsize\n\n## The (better) histogram\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-12-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- This is very close to normal\n- The bootstrap says that the sampling distribution of the sample mean is close to normal, even though the distribution of the data is not\n- A sample size of 25 is big enough to overcome the skewness that we saw\n- This is the Central Limit Theorem in practice\n- It is surprisingly powerful.\n- Thus, the $t$-test is actually perfectly good here.\n\n## Comments on the code 1/2\n\n- You might have been wondering about this:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Comments on the code 2/2\n\n\n- how did we squeeze all 25 sample values into one cell?\n - sample is a so-called \"list-column\" that can contain anything.\n- why did we have to put `list()` around the `sample()`?\n - because `sample` produces a collection of numbers, not just a single one\n - the `list()` signals this: \"make a list-column of samples\".\n \n \n## Two samples\n\n- Assumption: *both* samples are from a normal distribution.\n- In practice, each sample is \"normal enough\" given its sample size, since Central Limit Theorem will help.\n- Use bootstrap on each group independently, as above.\n\n## Kids learning to read\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(kids, aes(x=group, y=score)) + geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-15-1.png){width=960}\n:::\n:::\n\n\n\n## Getting just the control group \n\n- Use `filter` to select rows where something is true:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"c\") -> controls\ncontrols\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Bootstrap these\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(controls$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-17-1.png){width=960}\n:::\n:::\n\n\n## ... and the treatment group:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"t\") -> treats\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(treats$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-19-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- sampling distributions of sample means both look pretty normal\n- as we thought, no problems with our two-sample $t$ at all.\n\n",
+ "markdown": "---\ntitle: \"Bootstrap for sampling distribution of sample mean\"\n---\n\n\n## Assessing assumptions\n\n- Our $t$-tests assume normality of variable being tested\n- but, Central Limit Theorem says that normality matters less if\n sample is \"large\"\n- in practice \"approximate normality\" is enough, but how do we assess\n whether what we have is normal enough?\n- so far, use histogram/boxplot and make a call, allowing for sample\n size.\n\n## What actually has to be normal\n\n- is: **sampling distribution of sample mean**\n- the distribution of sample mean over *all possible samples*\n- but we only have *one* sample!\n- Idea: assume our sample is representative of the population, and\n draw samples from our sample (!), with replacement.\n- This gives an idea of what different samples from the population\n might look like.\n- Called *bootstrap*, after expression \"to pull yourself up by your\n own bootstraps\".\n\n## Packages\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n## Blue Jays attendances\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\njays$attendance\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 48414 17264 15086 14433 21397 34743 44794 14184 15606 18581 19217 21519\n[13] 21312 30430 42917 42419 29306 15062 16402 19014 21195 33086 37929 15168\n[25] 17276\n```\n:::\n:::\n\n\n- A bootstrap sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- sample(jays$attendance, replace = TRUE)\ns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 21195 34743 21312 44794 16402 19014 34743 21195 17264 18581 19014 19217\n[13] 34743 19217 14433 15062 16402 15062 34743 15062 15086 15168 15086 48414\n[25] 30430\n```\n:::\n:::\n\n\n- It is easier to see what is happening if we sort both the actual\n attendances and the bootstrap sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 14184 14433 15062 15086 15168 15606 16402 17264 17276 18581 19014 19217\n[13] 21195 21312 21397 21519 29306 30430 33086 34743 37929 42419 42917 44794\n[25] 48414\n```\n:::\n\n```{.r .cell-code}\nsort(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 14433 15062 15062 15062 15086 15086 15168 16402 16402 17264 18581 19014\n[13] 19014 19217 19217 21195 21195 21312 30430 34743 34743 34743 34743 44794\n[25] 48414\n```\n:::\n:::\n\n\n## Getting mean of bootstrap sample\n\n- A bootstrap sample is same size as original, but contains repeated\n values (eg. 15062) and missing ones (42917).\n- We need the mean of our bootstrap sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 23055.28\n```\n:::\n:::\n\n\n- This is a little different from the mean of our actual sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25070.16\n```\n:::\n:::\n\n\n- Want a sense of how the sample mean might vary, if we were able to\n take repeated samples from our population.\n- Idea: take lots of *bootstrap* samples, and see how *their* sample\n means vary.\n\n## Setting up bootstrap sampling\n\n- Begin by setting up a dataframe that contains a row for each\n bootstrap sample. I usually call this column `sim`. Do just 4 to get\n the idea:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4)\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Drawing the bootstrap samples\n\n- Then set up to work one row at a time, and draw a bootstrap sample\n of the attendances in each row:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- Each row of our dataframe contains *all* of a bootstrap sample of 25\n observations drawn with replacement from the attendances.\n\n## Sample means\n\n- Find the mean of each sample:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n- These are (four simulated values of) the bootstrapped sampling\n distribution of the sample mean.\n\n## Make a histogram of them\n\n- rather pointless here, but to get the idea:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g\n```\n:::\n\n\n## The (pointless) histogram\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-10-1.png){width=960}\n:::\n:::\n\n\n## Now do again with a decent number of bootstrap samples\n\n- say 1000, and put a decent number of bins on the histogram also:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g\n```\n:::\n\n\n## The (better) histogram\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-12-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- This is very close to normal\n- The bootstrap says that the sampling distribution of the sample mean\n is close to normal, even though the distribution of the data is not\n- A sample size of 25 is big enough to overcome the skewness that we\n saw\n- This is the Central Limit Theorem in practice\n- It is surprisingly powerful.\n- Thus, the $t$-test is actually perfectly good here.\n\n## Comments on the code 1/2\n\n- You might have been wondering about this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Comments on the code 2/2\n\n- how did we squeeze all 25 sample values into one cell?\n - sample is a so-called \"list-column\" that can contain anything.\n- why did we have to put `list()` around the `sample()`?\n - because `sample` produces a collection of numbers, not just a\n single one\n - the `list()` signals this: \"make a list-column of samples\".\n\n## Two samples\n\n- Assumption: *both* samples are from a normal distribution.\n- In this case, each sample should be \"normal enough\" given its sample\n size, since Central Limit Theorem will help.\n- Use bootstrap on each group independently, as above.\n\n## Kids learning to read\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(kids, aes(x=group, y=score)) + geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-15-1.png){width=960}\n:::\n:::\n\n\n## Getting just the control group\n\n- Use `filter` to select rows where something is true:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"c\") -> controls\ncontrols\n```\n\n::: {.cell-output-display}\n`````{=html}\n\n \n
\n`````\n:::\n:::\n\n\n## Bootstrap these\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(controls$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-17-1.png){width=960}\n:::\n:::\n\n\n## ... and the treatment group:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"t\") -> treats\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(treats$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-revealjs/bootstrap-R-19-1.png){width=960}\n:::\n:::\n\n\n## Comments\n\n- sampling distributions of sample means both look pretty normal\n- as we thought, no problems with our two-sample $t$ at all.\n",
"supporting": [
"bootstrap_R_files/figure-revealjs"
],
diff --git a/_freeze/bootstrap_R/execute-results/tex.json b/_freeze/bootstrap_R/execute-results/tex.json
index efd6dd4..d9ae06d 100644
--- a/_freeze/bootstrap_R/execute-results/tex.json
+++ b/_freeze/bootstrap_R/execute-results/tex.json
@@ -1,7 +1,7 @@
{
- "hash": "b39d3bfbe34481136f1fea8772965337",
+ "hash": "7c19f4123ee146ced18cb67d9535aa9a",
"result": {
- "markdown": "---\ntitle: \"Bootstrap for sampling distribution of sample mean\"\n---\n\n\n\n\n## Assessing assumptions\n\n- Our $t$-tests assume normality of variable being tested\n- but, Central Limit Theorem says that normality matters less if sample is \"large\"\n- in practice \"approximate normality\" is enough, but how do we assess whether what we have is normal enough?\n- so far, use histogram/boxplot and make a call, allowing for sample size.\n\n## What actually has to be normal\n\n- is: **sampling distribution of sample mean**\n- the distribution of sample mean over *all possible samples*\n- but we only have *one* sample!\n- Idea: assume our sample is representative of the population, and draw samples from our sample (!), with replacement.\n- This gives an idea of what different samples from the population might look like.\n- Called *bootstrap*, after expression \"to pull yourself up by your own bootstraps\".\n\n## Packages\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\n\n## Blue Jays attendances\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\njays$attendance\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 48414 17264 15086 14433 21397 34743 44794 14184 15606 18581 19217 21519\n[13] 21312 30430 42917 42419 29306 15062 16402 19014 21195 33086 37929 15168\n[25] 17276\n```\n:::\n:::\n\n\n\n- A bootstrap sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- sample(jays$attendance, replace = TRUE)\ns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 21195 34743 21312 44794 16402 19014 34743 21195 17264 18581 19014 19217\n[13] 34743 19217 14433 15062 16402 15062 34743 15062 15086 15168 15086 48414\n[25] 30430\n```\n:::\n:::\n\n\n\n## Getting mean of bootstrap sample\n\n- A bootstrap sample is same size as original, but contains repeated values (eg. 15062) and missing ones (42917).\n- We need the mean of our bootstrap sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 23055.28\n```\n:::\n:::\n\n\n\n- This is a little different from the mean of our actual sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25070.16\n```\n:::\n:::\n\n\n\n- Want a sense of how the sample mean might vary, if we were able to take repeated samples from our population.\n- Idea: take lots of *bootstrap* samples, and see how *their* sample means vary.\n\n## Setting up bootstrap sampling\n\n- Begin by setting up a dataframe that contains a row for each bootstrap sample. I usually call this column `sim`. Do just 4 to get the idea:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 1\n sim\n \n1 1\n2 2\n3 3\n4 4\n```\n:::\n:::\n\n\n\n## Drawing the bootstrap samples\n\n- Then set up to work one row at a time, and draw a bootstrap sample of the attendances in each row:\n\n\\small\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 2\n# Rowwise: \n sim sample \n \n1 1 \n2 2 \n3 3 \n4 4 \n```\n:::\n:::\n\n\n\\normalsize\n\n- Each row of our dataframe contains *all* of a bootstrap sample of 25 observations drawn with replacement from the attendances.\n\n\\normalsize\n\n## Sample means\n\n- Find the mean of each sample:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 3\n# Rowwise: \n sim sample my_mean\n \n1 1 28472.\n2 2 28648.\n3 3 23329.\n4 4 24808.\n```\n:::\n:::\n\n\n\\normalsize\n\n- These are (four simulated values of) the bootstrapped sampling distribution of the sample mean.\n\n## Make a histogram of them\n\n- rather pointless here, but to get the idea:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g\n```\n:::\n\n\n\\normalsize\n\n## The (pointless) histogram\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-10-1.pdf)\n:::\n:::\n\n\n\n## Now do again with a decent number of bootstrap samples\n\n- say 1000, and put a decent number of bins on the histogram also:\n\n\\footnotesize\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g\n```\n:::\n\n\n\\normalsize\n\n## The (better) histogram\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-12-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- This is very close to normal\n- The bootstrap says that the sampling distribution of the sample mean is close to normal, even though the distribution of the data is not\n- A sample size of 25 is big enough to overcome the skewness that we saw\n- This is the Central Limit Theorem in practice\n- It is surprisingly powerful.\n- Thus, the $t$-test is actually perfectly good here.\n\n## Comments on the code 1/2\n\n- You might have been wondering about this:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 2\n# Rowwise: \n sim sample \n \n1 1 \n2 2 \n3 3 \n4 4 \n```\n:::\n:::\n\n\n\n## Comments on the code 2/2\n\n\n- how did we squeeze all 25 sample values into one cell?\n - sample is a so-called \"list-column\" that can contain anything.\n- why did we have to put `list()` around the `sample()`?\n - because `sample` produces a collection of numbers, not just a single one\n - the `list()` signals this: \"make a list-column of samples\".\n \n \n## Two samples\n\n- Assumption: *both* samples are from a normal distribution.\n- In practice, each sample is \"normal enough\" given its sample size, since Central Limit Theorem will help.\n- Use bootstrap on each group independently, as above.\n\n## Kids learning to read\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(kids, aes(x=group, y=score)) + geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-15-1.pdf)\n:::\n:::\n\n\n\n\n## Getting just the control group \n\n- Use `filter` to select rows where something is true:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"c\") -> controls\ncontrols\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 23 x 2\n group score\n \n 1 c 42\n 2 c 33\n 3 c 46\n 4 c 37\n 5 c 43\n 6 c 41\n 7 c 10\n 8 c 42\n 9 c 55\n10 c 19\n# i 13 more rows\n```\n:::\n:::\n\n\n\n## Bootstrap these\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(controls$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-17-1.pdf)\n:::\n:::\n\n\n\n## ... and the treatment group:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"t\") -> treats\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(treats$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-19-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- sampling distributions of sample means both look pretty normal\n- as we thought, no problems with our two-sample $t$ at all.\n\n",
+ "markdown": "---\ntitle: \"Bootstrap for sampling distribution of sample mean\"\n---\n\n\n\n## Assessing assumptions\n\n- Our $t$-tests assume normality of variable being tested\n- but, Central Limit Theorem says that normality matters less if\n sample is \"large\"\n- in practice \"approximate normality\" is enough, but how do we assess\n whether what we have is normal enough?\n- so far, use histogram/boxplot and make a call, allowing for sample\n size.\n\n## What actually has to be normal\n\n- is: **sampling distribution of sample mean**\n- the distribution of sample mean over *all possible samples*\n- but we only have *one* sample!\n- Idea: assume our sample is representative of the population, and\n draw samples from our sample (!), with replacement.\n- This gives an idea of what different samples from the population\n might look like.\n- Called *bootstrap*, after expression \"to pull yourself up by your\n own bootstraps\".\n\n## Packages\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\n## Blue Jays attendances\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\njays$attendance\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 48414 17264 15086 14433 21397 34743 44794 14184 15606 18581 19217 21519\n[13] 21312 30430 42917 42419 29306 15062 16402 19014 21195 33086 37929 15168\n[25] 17276\n```\n:::\n:::\n\n\n\n- A bootstrap sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ns <- sample(jays$attendance, replace = TRUE)\ns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 21195 34743 21312 44794 16402 19014 34743 21195 17264 18581 19014 19217\n[13] 34743 19217 14433 15062 16402 15062 34743 15062 15086 15168 15086 48414\n[25] 30430\n```\n:::\n:::\n\n\n\n- It is easier to see what is happening if we sort both the actual\n attendances and the bootstrap sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsort(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 14184 14433 15062 15086 15168 15606 16402 17264 17276 18581 19014 19217\n[13] 21195 21312 21397 21519 29306 30430 33086 34743 37929 42419 42917 44794\n[25] 48414\n```\n:::\n\n```{.r .cell-code}\nsort(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 14433 15062 15062 15062 15086 15086 15168 16402 16402 17264 18581 19014\n[13] 19014 19217 19217 21195 21195 21312 30430 34743 34743 34743 34743 44794\n[25] 48414\n```\n:::\n:::\n\n\n\n## Getting mean of bootstrap sample\n\n- A bootstrap sample is same size as original, but contains repeated\n values (eg. 15062) and missing ones (42917).\n- We need the mean of our bootstrap sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(s)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 23055.28\n```\n:::\n:::\n\n\n\n- This is a little different from the mean of our actual sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(jays$attendance)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 25070.16\n```\n:::\n:::\n\n\n\n- Want a sense of how the sample mean might vary, if we were able to\n take repeated samples from our population.\n- Idea: take lots of *bootstrap* samples, and see how *their* sample\n means vary.\n\n## Setting up bootstrap sampling\n\n- Begin by setting up a dataframe that contains a row for each\n bootstrap sample. I usually call this column `sim`. Do just 4 to get\n the idea:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 1\n sim\n \n1 1\n2 2\n3 3\n4 4\n```\n:::\n:::\n\n\n\n## Drawing the bootstrap samples\n\n- Then set up to work one row at a time, and draw a bootstrap sample\n of the attendances in each row:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 2\n# Rowwise: \n sim sample \n \n1 1 \n2 2 \n3 3 \n4 4 \n```\n:::\n:::\n\n\n\n- Each row of our dataframe contains *all* of a bootstrap sample of 25\n observations drawn with replacement from the attendances.\n\n## Sample means\n\n- Find the mean of each sample:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 3\n# Rowwise: \n sim sample my_mean\n \n1 1 28472.\n2 2 28648.\n3 3 23329.\n4 4 24808.\n```\n:::\n:::\n\n\n\n- These are (four simulated values of) the bootstrapped sampling\n distribution of the sample mean.\n\n## Make a histogram of them\n\n- rather pointless here, but to get the idea:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 3) -> g\n```\n:::\n\n\n\n## The (pointless) histogram\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-10-1.pdf)\n:::\n:::\n\n\n\n## Now do again with a decent number of bootstrap samples\n\n- say 1000, and put a decent number of bins on the histogram also:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) -> g\n```\n:::\n\n\n\n## The (better) histogram\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ng\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-12-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- This is very close to normal\n- The bootstrap says that the sampling distribution of the sample mean\n is close to normal, even though the distribution of the data is not\n- A sample size of 25 is big enough to overcome the skewness that we\n saw\n- This is the Central Limit Theorem in practice\n- It is surprisingly powerful.\n- Thus, the $t$-test is actually perfectly good here.\n\n## Comments on the code 1/2\n\n- You might have been wondering about this:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:4) %>% \n rowwise() %>% \n mutate(sample = list(sample(jays$attendance, replace = TRUE)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 2\n# Rowwise: \n sim sample \n \n1 1 \n2 2 \n3 3 \n4 4 \n```\n:::\n:::\n\n\n\n## Comments on the code 2/2\n\n- how did we squeeze all 25 sample values into one cell?\n - sample is a so-called \"list-column\" that can contain anything.\n- why did we have to put `list()` around the `sample()`?\n - because `sample` produces a collection of numbers, not just a\n single one\n - the `list()` signals this: \"make a list-column of samples\".\n\n## Two samples\n\n- Assumption: *both* samples are from a normal distribution.\n- In this case, each sample should be \"normal enough\" given its sample\n size, since Central Limit Theorem will help.\n- Use bootstrap on each group independently, as above.\n\n## Kids learning to read\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 44 x 2\n group score\n \n 1 t 24\n 2 t 61\n 3 t 59\n 4 t 46\n 5 t 43\n 6 t 44\n 7 t 52\n 8 t 43\n 9 t 58\n10 t 67\n# i 34 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nggplot(kids, aes(x=group, y=score)) + geom_boxplot()\n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-15-1.pdf)\n:::\n:::\n\n\n\n## Getting just the control group\n\n- Use `filter` to select rows where something is true:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"c\") -> controls\ncontrols\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 23 x 2\n group score\n \n 1 c 42\n 2 c 33\n 3 c 46\n 4 c 37\n 5 c 43\n 6 c 41\n 7 c 10\n 8 c 42\n 9 c 55\n10 c 19\n# i 13 more rows\n```\n:::\n:::\n\n\n\n## Bootstrap these\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(controls$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-17-1.pdf)\n:::\n:::\n\n\n\n## ... and the treatment group:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkids %>% filter(group==\"t\") -> treats\ntibble(sim = 1:1000) %>% \n rowwise() %>% \n mutate(sample = list(sample(treats$score, replace = TRUE))) %>% \n mutate(my_mean = mean(sample)) %>% \n ggplot(aes(x = my_mean)) + geom_histogram(bins = 10) \n```\n\n::: {.cell-output-display}\n![](bootstrap_R_files/figure-beamer/bootstrap-R-19-1.pdf)\n:::\n:::\n\n\n\n## Comments\n\n- sampling distributions of sample means both look pretty normal\n- as we thought, no problems with our two-sample $t$ at all.\n",
"supporting": [
"bootstrap_R_files/figure-beamer"
],
diff --git a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-10-1.pdf b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-10-1.pdf
index 4ad461b..2f877c4 100644
Binary files a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-10-1.pdf and b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-10-1.pdf differ
diff --git a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-12-1.pdf b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-12-1.pdf
index 1782906..8b5adc1 100644
Binary files a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-12-1.pdf and b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-12-1.pdf differ
diff --git a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-15-1.pdf b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-15-1.pdf
index 7cd2a7c..c0a5e5e 100644
Binary files a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-15-1.pdf and b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-15-1.pdf differ
diff --git a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-17-1.pdf b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-17-1.pdf
index 3bc795e..db21cdd 100644
Binary files a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-17-1.pdf and b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-17-1.pdf differ
diff --git a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-19-1.pdf b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-19-1.pdf
index 4de1a18..f4ae3fd 100644
Binary files a/_freeze/bootstrap_R/figure-beamer/bootstrap-R-19-1.pdf and b/_freeze/bootstrap_R/figure-beamer/bootstrap-R-19-1.pdf differ
diff --git a/_freeze/choosing/execute-results/tex.json b/_freeze/choosing/execute-results/tex.json
index b6df1ec..d682186 100644
--- a/_freeze/choosing/execute-results/tex.json
+++ b/_freeze/choosing/execute-results/tex.json
@@ -1,7 +1,7 @@
{
- "hash": "1a27c83e2f5ea4aee31a32c65b194163",
+ "hash": "8a1b88b16954fb13b78c31ee298ba634",
"result": {
- "markdown": "---\ntitle: \"Choosing things in dataframes\"\n---\n\n\n\n## Packages\n\nThe usual:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n\n\n## Doing things with data frames\nLet’s go back to our Australian athletes: \n\n\n\n::: {.cell}\n\n:::\n\n\n\n\\scriptsize\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 13\n Sex Sport RCC WCC Hc Hg Ferr BMI SSF `%Bfat` LBM\n \n 1 female Netba~ 4.56 13.3 42.2 13.6 20 19.2 49 11.3 53.1\n 2 female Netba~ 4.15 6 38 12.7 59 21.2 110. 25.3 47.1\n 3 female Netba~ 4.16 7.6 37.5 12.3 22 21.4 89 19.4 53.4\n 4 female Netba~ 4.32 6.4 37.7 12.3 30 21.0 98.3 19.6 48.8\n 5 female Netba~ 4.06 5.8 38.7 12.8 78 21.8 122. 23.1 56.0\n 6 female Netba~ 4.12 6.1 36.6 11.8 21 21.4 90.4 16.9 56.4\n 7 female Netba~ 4.17 5 37.4 12.7 109 21.5 107. 21.3 53.1\n 8 female Netba~ 3.8 6.6 36.5 12.4 102 24.4 157. 26.6 54.4\n 9 female Netba~ 3.96 5.5 36.3 12.4 71 22.6 101. 17.9 56.0\n10 female Netba~ 4.44 9.7 41.4 14.1 64 22.8 126. 25.0 51.6\n# i 192 more rows\n# i 2 more variables: Ht , Wt \n```\n:::\n:::\n\n\n\\normalsize\n\n## Choosing a column\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(Sport)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 1\n Sport \n \n 1 Netball\n 2 Netball\n 3 Netball\n 4 Netball\n 5 Netball\n 6 Netball\n 7 Netball\n 8 Netball\n 9 Netball\n10 Netball\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Choosing several columns\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(Sport, Hg, BMI)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 3\n Sport Hg BMI\n \n 1 Netball 13.6 19.2\n 2 Netball 12.7 21.2\n 3 Netball 12.3 21.4\n 4 Netball 12.3 21.0\n 5 Netball 12.8 21.8\n 6 Netball 11.8 21.4\n 7 Netball 12.7 21.5\n 8 Netball 12.4 24.4\n 9 Netball 12.4 22.6\n10 Netball 14.1 22.8\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Choosing consecutive columns\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(Sex:WCC)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 4\n Sex Sport RCC WCC\n \n 1 female Netball 4.56 13.3\n 2 female Netball 4.15 6 \n 3 female Netball 4.16 7.6\n 4 female Netball 4.32 6.4\n 5 female Netball 4.06 5.8\n 6 female Netball 4.12 6.1\n 7 female Netball 4.17 5 \n 8 female Netball 3.8 6.6\n 9 female Netball 3.96 5.5\n10 female Netball 4.44 9.7\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Choosing all-but some columns\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(-(RCC:LBM))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 4\n Sex Sport Ht Wt\n \n 1 female Netball 177. 59.9\n 2 female Netball 173. 63 \n 3 female Netball 176 66.3\n 4 female Netball 170. 60.7\n 5 female Netball 183 72.9\n 6 female Netball 178. 67.9\n 7 female Netball 177. 67.5\n 8 female Netball 174. 74.1\n 9 female Netball 174. 68.2\n10 female Netball 174. 68.8\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Select-helpers\nOther ways to select columns: those whose name:\n\n- `starts_with` something\n- `ends_with` something\n- `contains` something\n- `matches` a “regular expression”\n- `everything()` select all the columns\n\n## Columns whose names begin with S \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(starts_with(\"S\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 3\n Sex Sport SSF\n \n 1 female Netball 49 \n 2 female Netball 110. \n 3 female Netball 89 \n 4 female Netball 98.3\n 5 female Netball 122. \n 6 female Netball 90.4\n 7 female Netball 107. \n 8 female Netball 157. \n 9 female Netball 101. \n10 female Netball 126. \n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Columns whose names end with C\n\neither uppercase or lowercase:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(ends_with(\"c\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 3\n RCC WCC Hc\n \n 1 4.56 13.3 42.2\n 2 4.15 6 38 \n 3 4.16 7.6 37.5\n 4 4.32 6.4 37.7\n 5 4.06 5.8 38.7\n 6 4.12 6.1 36.6\n 7 4.17 5 37.4\n 8 3.8 6.6 36.5\n 9 3.96 5.5 36.3\n10 4.44 9.7 41.4\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Case-sensitive\n\nThis works with any of the select-helpers:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(ends_with(\"C\", ignore.case=FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 2\n RCC WCC\n \n 1 4.56 13.3\n 2 4.15 6 \n 3 4.16 7.6\n 4 4.32 6.4\n 5 4.06 5.8\n 6 4.12 6.1\n 7 4.17 5 \n 8 3.8 6.6\n 9 3.96 5.5\n10 4.44 9.7\n# i 192 more rows\n```\n:::\n:::\n\n\n\n\n## Column names containing letter R\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(contains(\"r\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 3\n Sport RCC Ferr\n \n 1 Netball 4.56 20\n 2 Netball 4.15 59\n 3 Netball 4.16 22\n 4 Netball 4.32 30\n 5 Netball 4.06 78\n 6 Netball 4.12 21\n 7 Netball 4.17 109\n 8 Netball 3.8 102\n 9 Netball 3.96 71\n10 Netball 4.44 64\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Exactly two characters, ending with T\n\nIn regular expression terms, this is `^.t$`:\n\n- `^` means “start of text”\n- `.` means “exactly one character, but could be anything”\n- `$` means “end of text”.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(matches(\"^.t$\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 2\n Ht Wt\n \n 1 177. 59.9\n 2 173. 63 \n 3 176 66.3\n 4 170. 60.7\n 5 183 72.9\n 6 178. 67.9\n 7 177. 67.5\n 8 174. 74.1\n 9 174. 68.2\n10 174. 68.8\n# i 192 more rows\n```\n:::\n:::\n\n\n\n## Choosing columns by property\n\n- Use `where` as with summarizing several columns\n- eg, to choose text columns:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% select(where(is.character))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 202 x 2\n Sex Sport \n \n 1 female Netball\n 2 female Netball\n 3 female Netball\n 4 female Netball\n 5 female Netball\n 6 female Netball\n 7 female Netball\n 8 female Netball\n 9 female Netball\n10 female Netball\n# i 192 more rows\n```\n:::\n:::\n\n\n\n\n## Choosing rows by number \n\n\\tiny\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% slice(16:25)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 x 13\n Sex Sport RCC WCC Hc Hg Ferr BMI SSF `%Bfat` LBM\n \n 1 female Netba~ 4.25 10.7 39.5 13.2 127 24.5 157. 26.5 54.5\n 2 female Netba~ 4.46 10.9 39.7 13.7 102 24.0 116. 23.0 57.2\n 3 female Netba~ 4.4 9.3 40.4 13.6 86 26.2 182. 30.1 54.4\n 4 female Netba~ 4.83 8.4 41.8 13.4 40 20.0 71.6 13.9 57.6\n 5 female Netba~ 4.23 6.9 38.3 12.6 50 25.7 144. 26.6 61.5\n 6 female Netba~ 4.24 8.4 37.6 12.5 58 25.6 201. 35.5 53.5\n 7 female Netba~ 3.95 6.6 38.4 12.8 33 19.9 68.9 15.6 54.1\n 8 female Netba~ 4.03 8.5 37.7 13 51 23.4 104. 19.6 55.4\n 9 female BBall 3.96 7.5 37.5 12.3 60 20.6 109. 19.8 63.3\n10 female BBall 4.41 8.3 38.2 12.7 68 20.7 103. 21.3 58.6\n# i 2 more variables: Ht , Wt \n```\n:::\n:::\n\n\n\\normalsize\n\n\n\n## Non-consecutive rows \n\n\\tiny\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% \n slice(10, 13, 17, 42)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 x 13\n Sex Sport RCC WCC Hc Hg Ferr BMI SSF `%Bfat` LBM\n \n1 female Netball 4.44 9.7 41.4 14.1 64 22.8 126. 25.0 51.6\n2 female Netball 4.02 9.1 37.7 12.7 107 23.0 77 18.1 57.3\n3 female Netball 4.46 10.9 39.7 13.7 102 24.0 116. 23.0 57.2\n4 female Row 4.37 8.1 41.8 14.3 53 23.5 98 21.8 63.0\n# i 2 more variables: Ht , Wt \n```\n:::\n:::\n\n\n\n\\normalsize\n\n## A random sample of rows\n\n\\tiny\n\n\n::: {.cell}\n\n```{.r .cell-code}\nathletes %>% slice_sample(n=8)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8 x 13\n Sex Sport RCC WCC Hc Hg Ferr BMI SSF `%Bfat` LBM\n