-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexercise_01.Rmd
259 lines (201 loc) · 9.94 KB
/
exercise_01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "mlr3 - key concepts"
output:
learnr::tutorial:
progressive: yes
allow_skip : false
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
set.seed(123)
library(learnr)
library(rchallenge)
library(mlr3)
library(mlr3learners)
library(skimr)
library(mlr3viz)
library(rpart.plot)
library(ranger)
source("test_functions.R", local = knitr::knit_global())
tutorial_options(
exercise.checker = checker_endpoint,
exercise.reveal_solution = FALSE,
exercise.completion = FALSE,
)
options(
tutorial.storage = list(
save_object = function(...) { },
get_object = function(...) NULL,
get_objects = function(...) list(),
remove_all_objects = function(...) { }
)
)
tutorial(
allow_skip = FALSE,
progressive = TRUE
)
```
## 1 Data {data-progressive=FALSE}
In this tutorial, we want to give you an idea of how you can use mlr3 to build your own applied machine learning project. mlr3 is an open-source collection of packages providing a unified interface for machine learning in R.
Throughout the exercise, we will use two well-known data sets: "Boston housing", a regression problem, and "German credit", a classification problem. In general, we will use "Boston housing" as an example and ask you to implement something similar for "German credit".
### 1.1 Boston housing
[Boston housing](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) contains information on housing for 506 census tracts of the city of Boston from the 1970 census. In this use case, we want to build a model that predicts the median value (in USD 1,000s) of owner-occupied homes in each census tract (medv). This is a regression problem.
```{r}
data("BostonHousing", package = "mlbench")
skim(BostonHousing)
```
### 1.2 German credit
For the [German credit](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) data, we are interested in predicting a person's credit risk (good or bad), by using information contained in 20 features describing an individual's personal, demographic and financial status. This is a binary classification problem.
```{r}
data("german", package = "rchallenge")
skim(german)
```
## 2 Tasks
In mlr3, we usually start by specifying *what* should be modeled. This is called **Task**. A task describes the type of machine learning problem, with the most common ones being regression and classification. Furthermore, the task contains the data, the target variable, and a few other elements.
### 2.1 Defining a task
In order to define a task with mlr3, we need to create an object: for regression problems, this is an object of class **TaskRegr**, for classification problems, an object of class **TaskClassif**. There are several ways to define a task in mlr3. If we wanted to instantiate the task for our regression problem, we could use the following code:
```{r}
task_regr <- as_task_regr(x = BostonHousing, id = "BostonHousing", target = "medv")
```
In a similar manner, we want to define a task for our classification problem. Create a new classification task for the German credit problem!
<div id="task-formal">
**Note:** You can proceed in a way similar to the regression task, but there is more than one option to do this with mlr3. The only thing that matters is that you instantiate an object called `task_classif` of class TaskClassif with the correct data backend and `credit_risk` as target variable. Take a look into the mlr3 documentation of tasks with '?mlr3::Task'
</div>
To test your code click "Submit Answer". To run your code click "Run Code".
```{r task, exercise=TRUE}
task_classif <- "your code"
```
```{r task-check}
checker_endpoint()
```
```{r task-hint-1}
task_classif <- as_task_classif("your code")
```
### 2.2 Questions
```{r quiz_task, echo=FALSE}
question("Which of the following statements are true?",
answer("Tasks objects contain the data backend.", correct = TRUE),
answer("When we instantiate a task, we have to specify the learner we want to solve it with."),
answer("Tasks store meta-information about our data (e.g., the names and types of the feature variables in the data)", correct = TRUE),
allow_retry = TRUE
)
```
### 2.3 Creating a train/test split
Often, it is not a good idea to evaluate a model's performance on the same observations that we use for training the model. In mlr3, we can use the `partition()`-function to assign the ids of a task object to either the train or the test set. Here, we use 80% of the observations for training and 20% for testing (selected randomly):
```{r}
split_regr <- mlr3::partition(task_regr, ratio = 0.8)
```
Now it's your turn. Split the ids of the backend of `task_classif` into ids for training and testing using a ratio of 0.8.
<div id="split-hint">
**Note:** You don't have to use the `partition()`-function!
The only thing that matters is that you have a named list (variable name: `split_classif`) consisting of two integer vectors (names: `train` and `test`) which contain the corresponding ids of the train/test observations.
</div>
```{r data, exercise=TRUE, exercise.reveal_solution = FALSE}
split_classif <- "your code"
```
```{r data-check}
checker_endpoint()
```
## 3 Learner
After we have defined *what* should be modeled with a task, we can now specify *how*. The actual learning is done via the so called **Learner** class in mlr3.
### 3.1 Defining a learner
For this exercise,we will continue with the train-test split provided by us:
```{r, eval=FALSE}
split_classif <- mlr3::partition(task_classif, ratio = 0.8)
```
In the code chunk below, we have already defined our learner for the regression problem. Here, we use `regr.ranger` to specify that we want to fit a random forest with the `ranger` package.
```{r}
lrn_regr <- lrn("regr.ranger")
```
For our classification task we want to fit a classification tree with the `rpart` package. Create a new classification tree learner for the German credit classification problem!
```{r learner, exercise=TRUE}
lrn_classif <- "Your code"
```
```{r learner-hint-1}
#Have a look at the learners available in mlr3 and find the one for rpart: https://mlr3.mlr-org.com/reference/mlr_learners.html
```
```{r learner-check}
checker_endpoint()
```
### 3.2 Setting hyperparameters
Now, we want to further specify certain hyperparameters of our learners. In general, you can specify these hyperparameters during the instantiation of the learner. However, you can also access the values of the parameter set of a learner via `$param_set$values` and set the hyperparameters after you have already defined your learner. For example, we have set the `mtry`-parameter of the regression random forest to 3:
```{r, eval=FALSE}
lrn_regr$param_set$values$mtry <- 3
```
For our classification tree we want to limit the maximum depth of the tree to 3. Set the relevant hyperparameter of our learner to 3!
```{r hp, exercise=TRUE}
"your code"
```
```{r hp-hint-1}
#You can look at the available hyperparameters of a learner via <name_of_the_learner>$param_set$ids()
```
```{r hp-check}
checker_endpoint()
```
### 3.3 Training a model
Now it's time to use our learner to train a model! Again, we have already trained the regression model on the corresponding task, using only the observations reserved for training in our train/test split:
```{r}
lrn_regr$train(task = task_regr, row_ids = split_regr$train)
```
Now it's your turn to use the classification learner we have defined in the previously to fit a model on the corresponding task! For training, make sure to only use the observations you have assigned to the training set.
```{r train, exercise=TRUE}
"your code"
```
```{r train-check}
checker_endpoint()
```
### 3.4 A look into the fitted model
Calling the `train()`-function on a learner produces a fitted model. This model is stored within the learner object itself and can be accessed via the "$model"-field of a learner.
Lets' have a look into our random forest:
```{r pg3, exercise = TRUE}
print(lrn_classif$model)
```
Using the `rpart.plot`-package we can also visualize our fitted tree. Just hit the "Run"-Button and plot the tree!
```{r pg4, exercise=TRUE}
rpart.plot(lrn_classif$model)
```
Look into the information of the fitted classification model and check the correct answer:
```{r quiz_model, echo=FALSE}
question("What percentage of observations is predicted as \"bad\"?",
answer("36%", correct = TRUE),
answer("12%"),
answer("8%"),
answer("9%")
)
```
### 3.5 Prediction
Now it's time to make some predictions using our test data. This can be done through the `predict()`-function of a (trained) learner, which returns an object of class **Prediction**. If we print this object, it returns a `data.table` containing information on the row ids, the true values and predicted response variable.
```{r}
pred_regr <- lrn_regr$predict(task_regr, row_ids = split_regr$test)
print(pred_regr)
```
Generate predictions for the classification problem using the classification tree model trained before! Use the test data defined previously!
```{r predict, exercise=TRUE}
pred_classif <- "your code"
```
```{r predict-check}
checker_endpoint()
```
## 4 Performance evaluation
To estimate the performance of a model on new unseen data, we can use the test data to compare our model's predictions with the true values (holdout splitting). Later in the tutorial, we will consider other (more refined) methods for performance evaluation.
In mlr3, we can use the `score()`-function to retrieve performance measures from a **Prediction** object. For the regression problem, the default measure is the MSE:
```{r}
pred_regr$score()
```
Other measures can be retrieved by specifying the function call. For example, the mean absolute error (MAE):
```{r}
pred_regr$score(measures = msr("regr.mae"))
```
Similarly, evaluate our classification model with respect to the accuracy of its predictions on the test data!
```{r pe, exercise=TRUE}
"your code"
```
```{r pe-check}
checker_endpoint()
```
```{r pe-hint-1}
#Have a look at the measures available in mlr3 and find the one for the accuracy using ?mlr3::Measure
```
```{r pe-solution}
pred_classif$score(measures = msr("classif.acc"))
```