Skip to content

Commit

Permalink
added notes from chapter 5 video (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
schafert authored Jul 24, 2024
1 parent 37e68de commit 233b9e0
Show file tree
Hide file tree
Showing 5 changed files with 1,874 additions and 5 deletions.
145 changes: 141 additions & 4 deletions 05_from-scratch-model.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,149 @@

**Learning objectives:**

- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
- Build a tabular model from "scratch"

## SLIDE 1 {-}
## Getting Started {-}

- ADD SLIDES AS SECTIONS (`##`).
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
- Titanic data from kaggle
- clean notebook from github
- Jeremy uses paperspace, I uploaded to kaggle in the titanic competition

```{r, message=FALSE}
library(tidyverse)
df <- read_csv("titanic/train.csv")
df |>
is.na() |>
colSums()
## only matches with default of read_csv
```


## Cleaning the data {-}

- Impute missing values with mode
- Discussion on imputation
+ good enough for baseline method
+ better than throwing away data
+ Jeremy "doesn't throw out rows and doesn't throw out columns"

```{r}
df <- df |>
replace_na(map(df, \(x)
ifelse(is.numeric(x),
median(x, na.rm = TRUE),
table(x) |> which.max() |> names())))
df |>
is.na() |>
colSums()
summary(df)
```


- skewed data not easily handled by regression, suggest log transform

```{r}
hist(df$Fare)
df$LogFare <- log(df$Fare + 1)
hist(df$LogFare)
```


- dummy variables for categorical variables; fastai creates an other which allows for new levels to show up in testing data

```{r, message = FALSE}
unique(df$Pclass) |> sort()
unique(df$Embarked) |> sort()
df <- df |>
fastDummies::dummy_cols(select_columns = c("Sex", "Pclass", "Embarked"))
head(df)
```

```{r}
t_dep <- df$Survived
t_indep <- df |>
select(Age, SibSp, Parch, LogFare, Sex_female:Embarked_S) |>
as.matrix()
head(t_indep)
dim(t_indep)
```


## Setting up linear model {-}

- initialize coefficients with seed

```{r}
set.seed(442)
n_coeff <- ncol(t_indep)
coeffs <- runif(n_coeff) - 0.5
```

- broadcasting in numpy (and R): more concise, readable, optimized. I think it is more strict in python than R

```{r}
(t(t_indep)*coeffs) |>
t() |>
head()
```

- normalize columns: two most common ways is dividing by the maximum or subtract mean divide by standard deviation

```{r}
t_indep <- t(t(t_indep)/apply(t_indep,2,max))
(t(t_indep)*coeffs) |>
t() |>
head()
```

- decide on a loss function

```{r}
preds <- t_indep%*%coeffs
loss <- abs(preds - t_dep) |>
mean()
loss
```

- save useful functions for repition

```{r}
calc_preds <- function(coeffs, indeps){
indeps%*%coeffs
}
calc_loss<- function(coeffs, indeps, deps){
abs(calc_preds(coeffs, indeps) - deps) |>
mean()
}
```

## Training the linear model {-}

- First, set up the gradient descent step
- Create validation split
- Using sigmoid for binary independent variables on final activation
- Let's experiment with the deep learning code section as suggested

## Jeremy's opinions {-}

- Generally for tabular data, feature engineering requires more thinking than using image data
- Start lazy
- use a framework

## Meeting Videos {-}

Expand Down
4 changes: 3 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@ Authors@R:
URL: https://r4ds.github.io/bookclub-pdl,
https://github.com/r4ds/bookclub-pdl
Depends:
R (>= 3.1.0)
R (>= 3.1.0),
tidyverse
Imports:
bookdown,
fastDummies,
rmarkdown
Encoding: UTF-8
Loading

0 comments on commit 233b9e0

Please sign in to comment.