99-23-model_basics.Rmd

# (PART\*) Removed chapters {-}

# Model basics {-}

These chapters have been removed from the book since the time that older cohorts covered them. We're keeping the old notes in case anybody finds them useful.

All code blocks for these chapters have been disabled to avoid mystery bugs.

```{r 99-eval-off, include=FALSE}
knitr::opts_chunk$set(eval = FALSE)
```

**Learning objectives:**

- Evaluate a set of **simple linear models.**
- Evaluate **model fits** through **visualization.**
- Use R's **formula syntax** to **specify models.**
- Recognize **model families.**

## A bit of Mathematics:

Our model is a function of the observed data. This is what we aim to achieve:

$$Y=f(x)$$

The function is made of some coefficients and predictors:

$$f(x)=\beta_{0}+\beta_{1}x$$

When we make a model, we attempt to replicate $Y=f(x)$ by applying mathematical models to our observed data. 
The final objective could be the prediction of an outcome. 

$\hat{Y}$ is the result of our model, and this values contain some noise, or residual values which make the model to be slightly different from real values.

$$\hat{Y}=\hat{\beta_{0}}+\hat{\beta_{1}}x+\epsilon$$
The residuals are identified by the difference between $Y$ and $\hat{Y}$:

$$Y-\hat{Y}=\epsilon$$

What we want is to reduce as much as possible this amount of residuals by selecting different type of models and assessing them on different parameter levels.
For this purpose we use some metrics to identify the residual level, such as:

- the r squared (rsq) $R^2$ 
- $\mathrm{adjustedR^2}$
- the residual sum of squares $RSS$ 
- and others

To make the formula in Rmarkdown have a look at this resource: [markdown-extensions](https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html)


----------

Here is the visualization of simulated data from {modelr} package, we have a look at different slope levels.
```{r 99-23-plot1, message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(manipulate)
library(modelr)
data(sim1)  
mod1 <- lm(y~x,sim1)

ggplot(sim1,aes(x,y))+
             geom_point()+
             geom_smooth(method="lm")+
             theme_bw()
```

We can use the function `manipulate()` from {manipulate} package, to assess the level of the slope to identify the model line:

    manipulate(ggplot(sim1,aes(x,y))+
             geom_point()+
             geom_smooth(method="lm")+
             geom_abline(intercept=mod1$coefficients[1],
                         slope=r)+
             theme_bw(),
             r=slider(min=1,max=3,step=0.1))

-------------

## Linear models and non linear models


**Linear models** assume a relationship of the form:

      y = a_1 * x1 + a_2 * x2 + ... + a_n * xn

and assume that the residuals, the distances between the observed and predicted values, are generally normal distributed or have a normal distribution.


Types of Linear models:

- Linear models - `stats::lm()`
- Generalised linear models - `stats::glm()`
- Generalised additive models - `mgcv::gam()`
- Penalised linear models - `glmnet::glmnet()`
- Robust linear models - `MASS::rlm()` 
- Trees - `rpart::rpart()`


**Non linear models** are models with a non-linear trend. 

There are some models that require predictors that have been centered and scaled:

- neural networks
- K-nearest neighbors
- support vector machines SVG

while others require a traditional response surface design model expansion (quadratic and two-way interactions).


### Transformations

We can switch between linear and non-linear models with some transformations:

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
# weighted regression
data(sim3)
sim3_2<-sim3%>%mutate(x3=case_when(x2=="a"~1,
                                   x2=="b"~2,
                                   x2=="c"~3,
                                   x2=="d"~4))

mod3w <- lm(y~x1, sim3_2, weights = x3)
p1<- ggplot(sim3_2, aes(x1, y)) + 
  geom_point()+
  geom_smooth()+
  labs(title="Linear model")

# polynomial transformation
mod3t <- lm(y~poly(x1,3),sim3_2,weights = x3)
p2 <- ggplot(sim3_2, aes(x1, y)) + 
  geom_point()+
  geom_smooth(method="lm", se=TRUE, fill=NA,
                formula=y ~ poly(x, 3, raw=TRUE),colour="red" )+
  labs(title="Polynomial transf.")

# splines
library(splines)
mod3s <- lm(y ~ bs(x1,3),sim3_2,weights = x3)

p3 <- ggplot(sim3_2, aes(x1, y)) + 
  geom_point()+
  geom_smooth(method="lm", se=TRUE, fill=NA,
                formula=y~splines::bs(x, 3),colour="red" )+
  labs(title="Spline transf.")

library(patchwork)
p1+p2+p3
```

When you fit a model, you apply the estimates coefficients of your observed data to the model f

$$y = a_1 + a_2x$$
    
$$y = 7 + 3x$$
    
So, the conversion of the linear model formula $y\sim{x}$ is $y = a_1 + a_2x$    

Behind the scenes, what happens is:
```{r eval=FALSE, include=T}
model1 <- function(a, data) {
  a[1] + data$x * a[2] + a[3]
}
```


And we can see it with the function `model_matrix()`:

```{r}
model_matrix(sim3, y ~ x1)
```

## Prediction

To make the prediction of our model,

```{r}
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```

we construct a `data_grid()` for simulating new data to be predicted:

```{r}
grid <- sim1 %>% 
  data_grid(x) 
```

and add the prediction with `add_predictions()` function:
```{r}
data_pred <- grid %>% 
  add_predictions(sim1_mod)

head(data_pred)
```
    
```{r}
ggplot(sim1, aes(x,y)) +
  geom_point() +
  geom_line(aes(y = pred), data = data_pred, colour = "red", linewidth = 2)+
  geom_smooth(method="lm",se=F)
```
    
as well as before we can add residuals to our data to visualize their trend with `add_residuals()` function:
```{r}
data_res <- sim1 %>% 
  add_residuals(sim1_mod)

head(data_res)
```

```{r}
ggplot(data_res, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point() 
```


### Interaction

When we need more investigations:

```{r}
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```


In this case, as we have two models to compare we `gather_predictions`:
```{r}
grid2 <- sim3 %>% 
  data_grid(x1, x2) %>% 
  gather_predictions(mod1, mod2)


grid2%>%count(model)

head(grid2)
```


```{r}
ggplot(sim3, aes(x1, y, colour = x2)) + 
  geom_point() + 
  geom_line(data = grid2, aes(y = pred)) + 
  facet_wrap(~ model)
```

as well as `gather_residuals()`:
```{r}
sim3 %>% 
  gather_residuals(mod1, mod2)%>%
  ggplot(aes(x1, resid, colour = x2)) + 
  geom_point() + 
  facet_grid(model ~ x2)
```

To conclude we look at the output of different transformations:
```{r}
sim5 <- tibble(
  x = seq(0, 3.5 * pi, length = 50),
  y = 4 * sin(x) + rnorm(length(x))
)

ggplot(sim5, aes(x, y)) +
  geom_point()
```

```{r}
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)

grid <- sim5 %>% 
  data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>% 
  gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")

ggplot(sim5, aes(x, y)) + 
  geom_point() +
  geom_line(data = grid, colour = "red") +
  facet_wrap(~ model)
```


## Resources:

- [tidymodels](https://www.tidymodels.org/start/models/)
- [markdown-extensions-by-bookdown](https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html)
- [geom_smooth](https://ggplot2.tidyverse.org/reference/geom_smooth.html)

## Meeting Videos

### Cohort 5

`r knitr::include_url("https://www.youtube.com/embed/hEjE7PYLCL8")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:10:00	Jon Harmon (jonthegeek):	https://www.youtube.com/c/TidyX_screencast/playlists?app=desktop
00:10:13	Jon Harmon (jonthegeek):	https://twitter.com/kierisi
00:10:42	Jon Harmon (jonthegeek):	https://twitter.com/FGazzelloni
00:10:48	Federica Gazzelloni:	@fgazzelloni
00:11:10	Jon Harmon (jonthegeek):	https://twitter.com/search?q=%23TidyTuesday
00:12:17	Jon Harmon (jonthegeek):	https://twitter.com/RLadiesGlobal
00:12:36	Federica Gazzelloni:	https://rladies.org/
00:13:03	Jon Harmon (jonthegeek):	https://twitter.com/RLadiesLima
00:38:04	Njoki Njuki Lucy:	for the linear regression model, what's the difference having the interaction effect as * or :? like y~x1*x2/ y~x1:x2
I have seen both being used but I haven't understood the difference yet 😄
Or for one (*) we must include the main effects?
00:39:50	Njoki Njuki Lucy:	aah okay :)
00:48:05	Jon Harmon (jonthegeek):	* includes : plus the terms without interaction, so in most cases it makes sense to just use *:
y ~ x1*x2 = y ~ x1 + x2 + x1:x2
00:49:50	Njoki Njuki Lucy:	thanks!
01:05:06	Ryan Metcalf:	Maybe another thought towards differences between `baseR`, `tidyverse`, or `tidy models` is like having a simple toolkit for your home…something small like Hammer, Screwdriver, and wrench set versus having an entire tool box of specific tools. It gives more granular choice for specific jobs.
01:05:28	Ryan Metcalf:	Or an entire workshop with power tools too!!!
01:06:59	Becki R. (she/her):	That's a helpful description, Ryan, thanks!
01:07:33	Ryan Metcalf:	Both kids live on their own. I hate showing up to hang a picture and realize the tool kits we gave for Xmas doesn’t have a tool to make the job easier.
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/9DZV6bvifHA")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:51:07	Becki R. (she/her):	i'm fading fast - catch up with you all next week.
00:52:36	Federica Gazzelloni:	https://www.tidymodels.org/start/
00:55:00	Federica Gazzelloni:	https://rpruim.github.io/s341/S19/from-class/MathinRmd.html
00:55:46	Federica Gazzelloni:	https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html
00:56:00	Federica Gazzelloni:	https://ggplot2.tidyverse.org/reference/geom_smooth.html
00:56:49	Ryan Metcalf:	Another LaTeX code editor: https://latex.codecogs.com/eqneditor/editor.php
00:57:14	Federica Gazzelloni:	thanks ryan
```
</details>

### Cohort 6

`r knitr::include_url("https://www.youtube.com/embed/GDIOmvhOgNk")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:44:13	Adeyemi Olusola:	Yeah
00:44:41	Marielena Soilemezidi:	great!
00:44:43	Adeyemi Olusola:	Thank you Daniel
00:44:55	Marielena Soilemezidi:	thanks for the presentation Daniel!
00:45:14	Daniel Adereti:	https://personal.math.ubc.ca/~anstee/math104/newtonmethod.pdf
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/LD2xOX_svU0")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:17:56	Daniel Adereti:	http://vita.had.co.nz/papers/model-vis.html
00:28:26	Daniel Adereti:	https://www.jstor.org/stable/2346786
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/VrFzuaGDiJo")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:16:05	Daniel:	http://bit.ly/wilkrog
00:40:08	Adeyemi Olusola:	Apologies for coming late.
00:40:30	Marielena Soilemezidi:	no worries! :)
```
</details>