08_notes.qmd

# Notes {-}

## Introduction: Tree-based methods

- Involve **stratifying** or **segmenting** the predictor space into a number of simple regions
- Are simple and useful for interpretation
- However, basic decision trees are NOT competitive with the best supervised learning approaches in terms of prediction accuracy
- Thus, we also discuss **bagging**, **random forests**, and **boosting** (i.e., tree-based ensemble methods) to grow multiple trees which are then combined to yield a single consensus prediction
- These can result in dramatic improvements in prediction accuracy (but some loss of interpretability)
- Can be applied to both regression and classification

## Regression Trees

First, let's take a look at `Hitters` dataset.
```{r}
#| label: 08-hitters-dataset
#| echo: false
library(dplyr)
library(tidyr)
library(readr)

df <- read_csv('./data/Hitters.csv') %>% 
select(Names, Hits, Years, Salary) %>% 
drop_na() %>% 
mutate(log_Salary = log(Salary))

df
```

```{r}
#| label: 08-reg-trees-intro
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_1_salary_data.png")

knitr::include_graphics("images/08_2_basic_tree.png")
```

- For the Hitters data, a regression tree for predicting the log salary of a baseball player based on:

    1. number of years that he has played in the major leagues
    2. number of hits that he made in the previous year

## Terminology

```{r}
#| label: 08-decision-trees-terminology-1
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_3_basic_tree_term.png")
```

```{r}
#| label: 08-decision-trees-terminology-2
#| echo: false
#| fig-cap: The three-region partition for the Hitters data set from the regression tree
#| out-width: 100%
knitr::include_graphics("images/08_4_hitters_predictor_space.png")
```

- Overall, the tree stratifies or segments the players into three regions of predictor space:
  - R1 = {X \| Years\< 4.5}
  - R2 = {X \| Years\>=4.5, Hits\<117.5}
  - R3 = {X \| Years\>=4.5, Hits\>=117.5}
  
  where R1, R2, and R3 are **terminal nodes** (leaves) and green lines (where the predictor space is split) are the **internal nodes**

- The number in each leaf/terminal node is the mean of the response for the observations that fall there

## Interpretation of results: regression tree (Hitters data)

```{r}
#| label: 08-reg-trees-interpreration
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_2_basic_tree.png")
```

1. `Years` is the most important factor in determining `Salary`: players with less experience earn lower salaries than more experienced players
2. Given that a player is less experienced, the number of `Hits` that he made in the previous year seems to play little role in his `Salary`
3. But among players who have been in the major leagues for 5 or more years, the number of Hits made in the previous year does affect Salary: players who made more Hits last year tend to have higher salaries
4. This is surely an over-simplification, but compared to a regression model, it is easy to display, interpret and explain

## Tree-building process (regression)

1. Divide the predictor space --- that is, the set of possible values for $X_1,X_2, . . . ,X_p$ --- into $J$ distinct and **non-overlapping** regions, $R_1,R_2, . . . ,R_J$
 - Regions can have ANY shape - they don't have to be boxes
2. For every observation that falls into the region $R_j$, we make the same prediction: the **mean** of the response values in $R_j$
3. The goal is to find regions (here boxes) $R_1, . . . ,R_J$ that **minimize** the $RSS$, given by

$$\mathrm{RSS}=\sum_{j=1}^{J}\sum_{i{\in}R_j}^{}(y_i - \hat{y}_{R_j})^2$$

where $\hat{y}_{R_j}$ is the **mean** response for the training observations within the $j$th box

- Unfortunately, it is **computationally infeasible** to consider every possible partition of the feature space into $J$ boxes.

## Recursive binary splitting

So, take a top-down, greedy approach known as recursive binary splitting:

- **top-down** because it begins at the top of the tree and then successively splits the predictor space
- **greedy** because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step

1. First, select the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions ${\{X|X_j<s\}}$ and ${\{X|X_j{\ge}s}\}$ leads to the greatest possible reduction in RSS
2. Repeat the process looking for the best predictor and best cutpoint to split data further (i.e., split one of the 2 previously identified regions - not the entire predictor space) minimizing the RSS within each of the resulting regions
3. Continue until a stopping criterion is reached, e.g., no region contains more than five observations
4. Again, we predict the response for a given test observation using the **mean of the training observations** in the region to which that test observation belongs

but ...

- The previous method may result in a tree that **overfits** the data. Why?
- Tree is too leafy (complex)
- A better strategy is to have a smaller tree with fewer splits, which will reduce variance and lead to better interpretation of results (at the cost of a little bias)
- So we will prune

## Pruning a tree

1. Grow a very large tree $T_0$ as before
2. Apply cost-complexity pruning to $T_0$ to obtain a sequence of BEST subtrees, as a function of $\alpha$

Cost complexity pruning minimizes (Eq. 8.4)
$\sum_{m=1}^{|T|}\sum_{x_i{\in}R_m}(y_i-\hat{y}_{R_m})^2 + \alpha|T|$

where

$\alpha$ $\geq$ 0

$|T|$ is the number of **terminal nodes** the sub tree $|T|$ holds

$R_m$ is the rectangle/region (i.e., the subset of predictor space) corresponding to the $m$th terminal node

$\hat{y}_{R_m}$ is the **mean** response for the training observations in $R_m$

- the tuning parameter $\alpha$ controls:

    a. a trade-off between the subtree's complexity (the number of terminal nodes)
    b. the subtree's fit to the training data

3. Choose $\alpha$ using K-fold cross-validation

    - repeat steps 1) and 2) for each $K-1/K$th fraction of training data
    - average the results and pick $\alpha$ to minimize the average MSE
    - recall that in K-folds cross-validation (say K = 5): the model is estimated on 80% of the data five different times, the predictions are made for the remaining 20%, and the test MSEs are averaged

4. Return to the subtree from Step 2) that corresponds to the chosen value of $\alpha$

## An example: tree pruning (Hitters dataset)

- Results of fitting and pruning a regression tree on the Hitters data using 9 of the features
- Randomly divided the data set in half (132 observations in training, 131 observations in the test set)
- Built large regression tree on training data and varied $\alpha$ in Eq. 8.4 to create subtrees with different numbers of terminal nodes
- Finally, performed 6-fold cross-validation to estimate the cross-validated MSE of the trees as a function of $\alpha$

```{r}
#| label: 08-purning-a-tree
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_5_hitters_unpruned_tree.png")
```

```{r}
#| label: 08-mse-cross-validation
#| echo: false
#| out-width: 100%
#| fig-cap: 'Training, cross-validation, and test MSE are shown as a function of the number of terminal nodes in the pruned tree. Standard error bands are displayed. The minimum cross-validation error occurs at a tree size of 3.'
knitr::include_graphics("images/08_6_hitters_mse.png")
```

## Classification trees

- Very similar to a regression tree except it predicts a qualitative (vs quantitative) response
- We predict that each observation belongs to the **most commonly occurring class** of training observations in the region to which it belongs
- In the classification setting, RSS cannot be used as a criterion for making the binary splits
- A natural alternative to RSS is the classification **error rate**, i.e., the fraction of the training observations in that region that do not belong to the most common class:

$$E = 1 - \max_k(\hat{p}_{mk})$$

where $\hat{p}_{mk}$ is the **proportion of training observations** in the $m$th region that are from the $k$th class

- However, this error rate is unsuited for tree-based classification because $E$ does not change much as the tree grows (**lacks sensitivity**)
- So, 2 other measures are preferable:

    - The **Gini Index** defined by $$G = \sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})$$ is a measure of total variance across the K classes
    - The Gini index takes on a small value if all of the $\hat{p}_{mk}$'s are close to 0 or 1
    - For this reason the Gini index is referred to as a measure of node **purity** - a small value indicates that a node contains predominantly observations from a single class
    - An alternative to the Gini index is **cross-entropy** given by
  
  $$D = - \sum_{k=1}^{K}\hat{p}_{mk}\log(\hat{p}_{mk})$$

- The Gini index and cross-entropy are very similar numerically

## Example: classification tree (Heart dataset)

- Data contain a binary outcome HD (heart disease Y or N based on angiographic test) for 303 patients who presented with chest pain
- 13 predictors including Age, Sex, Chol (a cholesterol measurement), and other heart and lung function measurements
- Cross-validation yields a tree with six terminal nodes
```{r}
#| label: 08-heart-dataet-cross-valudation
#| echo: false
#| out-width: 100%
#| fig-cap: 'Heart data. Top: The unpruned tree. Bottom Left: Cross-validation error, training, and test error, for different sizes of the pruned tree. Bottom Right: The pruned tree corresponding to the minimal cross-validation error.'
knitr::include_graphics("images/08_7_classif_tree_heart.png")
```

- **Comment**: Classification trees can be constructed if categorical PREDICTORS are present e.g., the first split: Thal is categorical (the 'a' in Thal:a indicates the first level of the predictor, i.e. Normal levels)
- Additionally, notice that some of the splits yield two terminal nodes that have the same predicted value (see red box)
- Regardless of the value of RestECG, a response value of *Yes* is predicted for those observations
- Why is the split performed at all?
  - Because it leads to increased node purity: all 9 of the observations corresponding to the right-hand leaf have a response value of *Yes*, whereas 7/11 of those corresponding to the left-hand leaf have a response value of *Yes*
- Why is node purity important?
  - Suppose that we have a test observation that belongs to the region given by that right-hand leaf. Then we can be pretty certain that its response value is *Yes*. In contrast, if a test observation belongs to the region given by the left-hand leaf, then its response value is **probably** *Yes*, but we are much less certain
- Even though the split RestECG\<1 does not reduce the classification error, it improves the Gini index and the entropy, which are more sensitive to node purity

## Advantages/Disadvantages of decision trees

- Trees can be displayed graphically and are **very easy to explain** to people
- They mirror human decision-making
- Can handle qualitative predictors without the need for dummy variables

but,

- They do not have the same level of predictive accuracy
- Can be very non-robust (i.e., a small change in the data can cause large change in the final estimated tree)
- To improve performance, we can use an **ensemble** method, which combines many simple 'buidling blocks' (i.e., regression or classification trees) to obtain a single and potentially very powerful model
- **ensemble** methods include: bagging, random forests, boosting, and Bayesian additive regression trees

## Bagging

- Also known as **bootstrap aggregation** is a general-purpose procedure for reducing the variance of a statistical learning method
- It's useful and frequently used in the context of decision trees
- Recall that given a set of $n$ independent observations $Z_1,..., Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar{Z}$ of the observations is given by $\sigma^2/n$
- So, **averaging a set of observations** reduces variance
- But, this is not practical because we generally do not have access to multiple training sets!
- What can we do?

- Cue the bootstrap, i.e., take repeated samples from the single training set
- Generate $B$ different bootstrapped training data set
- Then train our method on the $b$th bootstrapped training set to get $\hat{f}^{*b}$, the prediction at a point x
- Average all the predictions to obtain $$\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)$$
- In the case of classification trees:
  - for each test observation:
    - record the class predicted by each of the $B$ trees
    - take a **majority vote**: the overall prediction is the most commonly occurring class among the $B$ predictions

**Comment**: The number of trees $B$ is not a critical parameter with bagging - a large $B$ will not lead to overfitting

## Out-of-bag error estimation

- But how do we estimate the test error of a bagged model?
- It's pretty straightforward:
  1. Because trees are repeatedly fit to bootstrapped subsets of observations, on average each bagged tree uses about 2/3 of the observations
  2. The leftover 1/3 not used to fit a given bagged tree are called **out-of-bag** (OOB) observations
  3. We can predict the response for the $i$th observation using each of the trees in which that observation was OOB. Gives around B/3 predictions for the $i$th observation (which we then average)
  4. This estimate is essentially the LOO cross-validation error for bagging (if $B$ is large)

## Variable importance measures

- Bagging results in improved accuracy over prediction using a single tre
- But, it can be difficult to interpret the resulting model:
  - we can't represent the statistical learning procedure using a single tree
  - it's not clear which variables are most important to the procedure (i.e., we have many trees each of which may give a differing view on the importance of a given predictor)
- So, which predictors are important?
  - An overall summary of the importance of each predictor can be achieved by recording how much the average $RSS$ or Gini index **improves (or decreases)** when each tree is split over a given predictor (averaged over all $B$ trees)
    - a large value = important predictor

```{r}
#| label: 08-variable-importance
#| echo: false
#| out-width: 100%
#| fig-cap: 'A variable importance plot for the Heart data. Variable importance is computed using the mean decrease in Gini index, and expressed relative to the maximum.'
knitr::include_graphics("images/08_8_var_importance.png")
```

## Random forests

- A problem with bagging is that bagged trees may be **highly similar** to each other.
- For example, if there is a strong predictor in the data set, most of the bagged trees will **use this strong predictor** in the top split so that
  - the trees will look quite similar
  - predictions from the bagged trees will be highly correlated
- Averaging many highly correlated quantities does not lead to as large a reduction in variance as averaging many uncorrelated quantities

## Random forests: advantages over bagging

- Random forests overcome this problem by forcing each split to consider only a **subset** of the predictors (typically a random sample $m \approx \sqrt{p}$)
- Thus at each split, the algorithm is NOT ALLOWED to consider a majority of the available predictors (essentially $(p - m)/p$ of the splits will not even consider the strong predictor, giving other predictors a chance)
- This *decorrelates* the trees and makes the average of the resulting trees less variable (more reliable)
- Only difference between bagging and random forests is the choice of predictor subset size $m$ at each split: if a random forest is built using $m = p$ that's just bagging
- For both, we build a number of decision trees on bootstrapped training samples

## Example: Random forests versus bagging (gene expression data)

- High-dimensional biological data set: contains gene expression measurements of 4,718 genes measured on tissue samples from 349 patients
- Each of the patient samples has a qualitative label with 15 different levels: *Normal* or one of 14 different cancer types
- Want to predict cancer type based on the 500 genes that have the largest variance in the training set
- Randomly divided the observations into training/test and applied random forests (or bagging) to the training set for 3 different values of $m$ (the number of predictors available at each split)

```{r}
#| label: 08-random-forest
#| echo: false
#| out-width: 100%
#| fig-cap: 'Results from random forests for the 15-class gene expression data set with p = 500 predictors. The test error is displayed as a function of the number of trees. Random forests (m < p) lead to a slight improvement over bagging (m = p). A single classification tree has an error rate of 45.7%.'
knitr::include_graphics("images/08_9_rand_forest_gene_exp.png")
```

## Boosting

- Yet another approach to improve prediction accuracy from a decision tree
- Can also be applied to many statistical learning methods for regression or classification
- Recall that in bagging each tree is built on a bootstrap training data set
- In boosting, each tree is grown sequentially using information from previously grown trees:
  - given the current model, we fit a decision tree to the residuals of the model (rather than the outcome *Y*) as the response
  - we then add this new decision tree into the fitted function (model) in order to update the residuals
  - Why? this way each tree is built on information that the previous trees were unable to 'catch'

## Boosting algorithm

```{r}
#| label: 08-boosting-algo
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_10_boosting_algorithm.png")
```

where:

$\hat{f}(x)$ is the decision tree (model)

$r$ = residuals

$d$ = number of splits in each tree (controls the complexity of the boosted ensemble)

$\lambda$ = shrinkage parameter (a small positive number that controls the rate at which boosting learns; typically 0.01 or 0.001 but right choice can depend on the problem)

- Each of the trees can be small, with just a few terminal nodes (determined by $d$)
- By fitting small trees to the residuals, we slowly improve our model ($\hat{f}$) in areas where it doesn't perform well
- The shrinkage parameter $\lambda$ slows the process down further, allowing more and different shaped trees to 'attack' the residuals
- Unlike bagging and random forests, boosting can OVERFIT if $B$ is too large. $B$ is selected via cross-validation

## Example: Boosting versus random forests

```{r}
#| label: 08-boosting-vs-rf
#| echo: false
#| out-width: 100%
#| fig-cap: 'Results from performing boosting and random forests on the 15-class gene expression data set in order to predict cancer versus normal. The test error is displayed as a function of the number of trees. For the two boosted models, lambda = 0.01. Depth-1 trees slightly outperform depth-2 trees, and both outperform the random forest, although the standard errors are around 0.02, making none of these differences significant. The test error rate for a single tree is 24 %.'
knitr::include_graphics("images/08_11_boosting_gene_exp_data.png")
```

- Notice that because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient in boosting (versus random forests)
- Random forests and boosting are among the state-of-the-art methods for supervised learning (but, their results can be difficult to interpret)

## Bayesian additive regression trees (BART)

- Recall that in bagging and random forests, each tree is built on a **random sample of data and/or predictors** and each tree is built **independently** of the others
- BART is related to both - what is new is HOW the new trees are generated
- **NOTE**: only BART for regression is described in the book

## BART notation

- Let $K$ be the total **number of regression trees** and
- $B$ be the **number of iterations** the BART algorithm will run for
- Let $\hat{f}^b_k(x)$ be the **prediction** at $x$ for the $k$th regression tree used in the $b$th iteration of the BART algorithm
- At the end of each iteration, the $K$ trees from that iteration will be summed:

$$\hat{f}^b(x) = \sum_{k=1}^{K}\hat{f}^b_k(x)$$ for $b=1,...,B$

## BART algorithm

- In the first iteration of the BART algorithm, all $K$ trees are initialized to have 1 root node, with $\hat{f}^1_k(x) = \frac{1}{nK}\sum_{i=1}^{n}y_i$
  - i.e., the mean of the response values divided by the total number of trees
- Thus, for the first iteration ($b = 1$), the prediction for all $K$ trees is just the mean of the response

$\hat{f}^1(x) = \sum_{k=1}^K\hat{f}^1_k(x) = \sum_{k=1}^K\frac{1}{nK}\sum_{i=1}^{n}y_i = \frac{1}{n}\sum_{i=1}^{n}y_i$

## BART algorithm: iteration 2 and on

- In subsequent iterations, BART updates each of the $K$ trees one at a time
- In the $b$th iteration to update the $k$th tree, we subtract from each response value the predictions from all but the $k$th tree, to obtain a partial residual:

$r_i = y_i - \sum_{k'<k}\hat{f}^b_{k'}(x_i) - \sum_{k'>k}\hat{f}^{b-1}_{k'}(x_i)$

for the $i$th observation, $i = 1, …, n$

- Rather than fitting a new tree to this partial residual, BART chooses a perturbation to the tree from a previous iteration $\hat{f}^{b-1}_{k}$ favoring perturbations that improve the fit to the partial residual
- To perturb trees:
  - change the structure of the tree by adding/pruning branches
  - change the prediction in each terminal node of the tree
- The output of BART is a collection of prediction models:

$\hat{f}^b(x) = \sum_{k=1}^{K}\hat{f}^b_k(x)$

for $b = 1, 2,…, B$

## BART algorithm: figure

```{r}
#| label: 08-bart-algo
#| echo: false
#| out-width: 100%
knitr::include_graphics("images/08_12_bart_algorithm.png")
```

- **Comment**: the first few prediction models obtained in the earlier iterations (known as the $burn-in$ period; denoted by $L$) are typically thrown away since they tend to not provide very good results, like you throw away the first pancake of the batch

## BART: additional details

- A key element of BART is that a fresh tree is NOT fit to the current partial residual: instead, we improve the fit to the current partial residual by slightly modifying the tree obtained in the previous iteration (Step 3(a)ii)
- This guards against overfitting since it limits how "hard" the data is fit in each iteration
- Additionally, the individual trees are typically pretty small
- BART, as the name suggests, can be viewed as a *Bayesian* approach to fitting an ensemble of trees:
  - each time a tree is randomly perturbed to fit the residuals = drawing a new tree from a *posterior* distribution

## To apply BART:

- We must select the number of trees $K$, the number of iterations $B$ and the number of burn-in iterations $L$
- Typically, large values are chosen for $B$ and $K$ and a moderate value for $L$: e.g. $K$ = 200, $B$ = 1,000 and $L$ = 100
- BART has been shown to have impressive out-of-box performance - i.e., it performs well with minimal tuning