-
Notifications
You must be signed in to change notification settings - Fork 41
/
99-23-model_basics.Rmd
377 lines (273 loc) · 10.7 KB
/
99-23-model_basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
# (PART\*) Removed chapters {-}
# Model basics {-}
These chapters have been removed from the book since the time that older cohorts covered them. We're keeping the old notes in case anybody finds them useful.
All code blocks for these chapters have been disabled to avoid mystery bugs.
```{r 99-eval-off, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```
**Learning objectives:**
- Evaluate a set of **simple linear models.**
- Evaluate **model fits** through **visualization.**
- Use R's **formula syntax** to **specify models.**
- Recognize **model families.**
## A bit of Mathematics:
Our model is a function of the observed data. This is what we aim to achieve:
$$Y=f(x)$$
The function is made of some coefficients and predictors:
$$f(x)=\beta_{0}+\beta_{1}x$$
When we make a model, we attempt to replicate $Y=f(x)$ by applying mathematical models to our observed data.
The final objective could be the prediction of an outcome.
$\hat{Y}$ is the result of our model, and this values contain some noise, or residual values which make the model to be slightly different from real values.
$$\hat{Y}=\hat{\beta_{0}}+\hat{\beta_{1}}x+\epsilon$$
The residuals are identified by the difference between $Y$ and $\hat{Y}$:
$$Y-\hat{Y}=\epsilon$$
What we want is to reduce as much as possible this amount of residuals by selecting different type of models and assessing them on different parameter levels.
For this purpose we use some metrics to identify the residual level, such as:
- the r squared (rsq) $R^2$
- $\mathrm{adjustedR^2}$
- the residual sum of squares $RSS$
- and others
To make the formula in Rmarkdown have a look at this resource: [markdown-extensions](https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html)
----------
Here is the visualization of simulated data from {modelr} package, we have a look at different slope levels.
```{r 99-23-plot1, message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(manipulate)
library(modelr)
data(sim1)
mod1 <- lm(y~x,sim1)
ggplot(sim1,aes(x,y))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()
```
We can use the function `manipulate()` from {manipulate} package, to assess the level of the slope to identify the model line:
manipulate(ggplot(sim1,aes(x,y))+
geom_point()+
geom_smooth(method="lm")+
geom_abline(intercept=mod1$coefficients[1],
slope=r)+
theme_bw(),
r=slider(min=1,max=3,step=0.1))
-------------
## Linear models and non linear models
**Linear models** assume a relationship of the form:
y = a_1 * x1 + a_2 * x2 + ... + a_n * xn
and assume that the residuals, the distances between the observed and predicted values, are generally normal distributed or have a normal distribution.
Types of Linear models:
- Linear models - `stats::lm()`
- Generalised linear models - `stats::glm()`
- Generalised additive models - `mgcv::gam()`
- Penalised linear models - `glmnet::glmnet()`
- Robust linear models - `MASS::rlm()`
- Trees - `rpart::rpart()`
**Non linear models** are models with a non-linear trend.
There are some models that require predictors that have been centered and scaled:
- neural networks
- K-nearest neighbors
- support vector machines SVG
while others require a traditional response surface design model expansion (quadratic and two-way interactions).
### Transformations
We can switch between linear and non-linear models with some transformations:
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
# weighted regression
data(sim3)
sim3_2<-sim3%>%mutate(x3=case_when(x2=="a"~1,
x2=="b"~2,
x2=="c"~3,
x2=="d"~4))
mod3w <- lm(y~x1, sim3_2, weights = x3)
p1<- ggplot(sim3_2, aes(x1, y)) +
geom_point()+
geom_smooth()+
labs(title="Linear model")
# polynomial transformation
mod3t <- lm(y~poly(x1,3),sim3_2,weights = x3)
p2 <- ggplot(sim3_2, aes(x1, y)) +
geom_point()+
geom_smooth(method="lm", se=TRUE, fill=NA,
formula=y ~ poly(x, 3, raw=TRUE),colour="red" )+
labs(title="Polynomial transf.")
# splines
library(splines)
mod3s <- lm(y ~ bs(x1,3),sim3_2,weights = x3)
p3 <- ggplot(sim3_2, aes(x1, y)) +
geom_point()+
geom_smooth(method="lm", se=TRUE, fill=NA,
formula=y~splines::bs(x, 3),colour="red" )+
labs(title="Spline transf.")
library(patchwork)
p1+p2+p3
```
When you fit a model, you apply the estimates coefficients of your observed data to the model f
$$y = a_1 + a_2x$$
$$y = 7 + 3x$$
So, the conversion of the linear model formula $y\sim{x}$ is $y = a_1 + a_2x$
Behind the scenes, what happens is:
```{r eval=FALSE, include=T}
model1 <- function(a, data) {
a[1] + data$x * a[2] + a[3]
}
```
And we can see it with the function `model_matrix()`:
```{r}
model_matrix(sim3, y ~ x1)
```
## Prediction
To make the prediction of our model,
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```
we construct a `data_grid()` for simulating new data to be predicted:
```{r}
grid <- sim1 %>%
data_grid(x)
```
and add the prediction with `add_predictions()` function:
```{r}
data_pred <- grid %>%
add_predictions(sim1_mod)
head(data_pred)
```
```{r}
ggplot(sim1, aes(x,y)) +
geom_point() +
geom_line(aes(y = pred), data = data_pred, colour = "red", linewidth = 2)+
geom_smooth(method="lm",se=F)
```
as well as before we can add residuals to our data to visualize their trend with `add_residuals()` function:
```{r}
data_res <- sim1 %>%
add_residuals(sim1_mod)
head(data_res)
```
```{r}
ggplot(data_res, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
```
### Interaction
When we need more investigations:
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
In this case, as we have two models to compare we `gather_predictions`:
```{r}
grid2 <- sim3 %>%
data_grid(x1, x2) %>%
gather_predictions(mod1, mod2)
grid2%>%count(model)
head(grid2)
```
```{r}
ggplot(sim3, aes(x1, y, colour = x2)) +
geom_point() +
geom_line(data = grid2, aes(y = pred)) +
facet_wrap(~ model)
```
as well as `gather_residuals()`:
```{r}
sim3 %>%
gather_residuals(mod1, mod2)%>%
ggplot(aes(x1, resid, colour = x2)) +
geom_point() +
facet_grid(model ~ x2)
```
To conclude we look at the output of different transformations:
```{r}
sim5 <- tibble(
x = seq(0, 3.5 * pi, length = 50),
y = 4 * sin(x) + rnorm(length(x))
)
ggplot(sim5, aes(x, y)) +
geom_point()
```
```{r}
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
grid <- sim5 %>%
data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red") +
facet_wrap(~ model)
```
## Resources:
- [tidymodels](https://www.tidymodels.org/start/models/)
- [markdown-extensions-by-bookdown](https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html)
- [geom_smooth](https://ggplot2.tidyverse.org/reference/geom_smooth.html)
## Meeting Videos
### Cohort 5
`r knitr::include_url("https://www.youtube.com/embed/hEjE7PYLCL8")`
<details>
<summary> Meeting chat log </summary>
```
00:10:00 Jon Harmon (jonthegeek): https://www.youtube.com/c/TidyX_screencast/playlists?app=desktop
00:10:13 Jon Harmon (jonthegeek): https://twitter.com/kierisi
00:10:42 Jon Harmon (jonthegeek): https://twitter.com/FGazzelloni
00:10:48 Federica Gazzelloni: @fgazzelloni
00:11:10 Jon Harmon (jonthegeek): https://twitter.com/search?q=%23TidyTuesday
00:12:17 Jon Harmon (jonthegeek): https://twitter.com/RLadiesGlobal
00:12:36 Federica Gazzelloni: https://rladies.org/
00:13:03 Jon Harmon (jonthegeek): https://twitter.com/RLadiesLima
00:38:04 Njoki Njuki Lucy: for the linear regression model, what's the difference having the interaction effect as * or :? like y~x1*x2/ y~x1:x2
I have seen both being used but I haven't understood the difference yet 😄
Or for one (*) we must include the main effects?
00:39:50 Njoki Njuki Lucy: aah okay :)
00:48:05 Jon Harmon (jonthegeek): * includes : plus the terms without interaction, so in most cases it makes sense to just use *:
y ~ x1*x2 = y ~ x1 + x2 + x1:x2
00:49:50 Njoki Njuki Lucy: thanks!
01:05:06 Ryan Metcalf: Maybe another thought towards differences between `baseR`, `tidyverse`, or `tidy models` is like having a simple toolkit for your home…something small like Hammer, Screwdriver, and wrench set versus having an entire tool box of specific tools. It gives more granular choice for specific jobs.
01:05:28 Ryan Metcalf: Or an entire workshop with power tools too!!!
01:06:59 Becki R. (she/her): That's a helpful description, Ryan, thanks!
01:07:33 Ryan Metcalf: Both kids live on their own. I hate showing up to hang a picture and realize the tool kits we gave for Xmas doesn’t have a tool to make the job easier.
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/9DZV6bvifHA")`
<details>
<summary> Meeting chat log </summary>
```
00:51:07 Becki R. (she/her): i'm fading fast - catch up with you all next week.
00:52:36 Federica Gazzelloni: https://www.tidymodels.org/start/
00:55:00 Federica Gazzelloni: https://rpruim.github.io/s341/S19/from-class/MathinRmd.html
00:55:46 Federica Gazzelloni: https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html
00:56:00 Federica Gazzelloni: https://ggplot2.tidyverse.org/reference/geom_smooth.html
00:56:49 Ryan Metcalf: Another LaTeX code editor: https://latex.codecogs.com/eqneditor/editor.php
00:57:14 Federica Gazzelloni: thanks ryan
```
</details>
### Cohort 6
`r knitr::include_url("https://www.youtube.com/embed/GDIOmvhOgNk")`
<details>
<summary> Meeting chat log </summary>
```
00:44:13 Adeyemi Olusola: Yeah
00:44:41 Marielena Soilemezidi: great!
00:44:43 Adeyemi Olusola: Thank you Daniel
00:44:55 Marielena Soilemezidi: thanks for the presentation Daniel!
00:45:14 Daniel Adereti: https://personal.math.ubc.ca/~anstee/math104/newtonmethod.pdf
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/LD2xOX_svU0")`
<details>
<summary> Meeting chat log </summary>
```
00:17:56 Daniel Adereti: http://vita.had.co.nz/papers/model-vis.html
00:28:26 Daniel Adereti: https://www.jstor.org/stable/2346786
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/VrFzuaGDiJo")`
<details>
<summary> Meeting chat log </summary>
```
00:16:05 Daniel: http://bit.ly/wilkrog
00:40:08 Adeyemi Olusola: Apologies for coming late.
00:40:30 Marielena Soilemezidi: no worries! :)
```
</details>