loss_functions.qmd

# Loss functions {#sec:loss-functions}

The concept of a loss function is essential to machine learning. At any iteration, the current loss value indicates how far the estimate is from the target. It is then used to update the parameters in a direction that will decrease the loss.

In our applied example, we already have made use of a loss function: mean squared error, computed manually as

```{r}
library(torch)

loss <- (y_pred - y)$pow(2)$sum()
```

As you might expect, here is another area where this kind of manual effort is not needed.

In this final conceptual chapter before we re-factor our running examples, we want to talk about two things: First, how to make use of `torch`'s built-in loss functions\index{loss functions (built into torch)}. And second, what function to choose.

## `torch` loss functions

In `torch`, loss functions start with `nn_` or `nnf_`.

Using `nnf_`, you directly *call a function*. Correspondingly, its arguments (estimate and target) both are tensors. For example, here is `nnf_mse_loss()`, the built-in analog to what we coded manually:

```{r}
nnf_mse_loss(torch_ones(2, 2), torch_zeros(2, 2) + 0.1)
```

    torch_tensor
    0.81
    [ CPUFloatType{} ]

With `nn_`, in contrast, you create an object:

```{r}
l <- nn_mse_loss()
```

This object can then be called on tensors to yield the desired loss:

```{r}
l(torch_ones(2, 2),torch_zeros(2, 2) + 0.1)
```

    torch_tensor
    0.81
    [ CPUFloatType{} ]

Whether to choose object or function is mainly a matter of preference and context. In larger models, you may end up combining several loss functions, and then, creating loss objects can result in more modular, and more maintainable code. In this book, I'll mainly use the first way, unless there are compelling reasons to do otherwise.

On to the second question.

## What loss function should I choose?

In deep learning, or machine learning overall, most applications aim to do one (or both) of two things: predict a numerical value, or estimate a probability. The regression task of our running example does the former; real-world applications might forecast temperatures, infer employee churn, or predict sales. In the second group, the prototypical task is *classification*. To categorize, say, an image according to its most salient content, we really compute the respective probabilities. Then, when the probability for "dog" is 0.7, while that for "cat" is 0.3, we say it's a dog.

### Maximum likelihood

In both classification and regression, the mostly used loss functions are built on the *maximum likelihood* principle. Maximum likelihood means: We want to choose model parameters in a way that the *data*, the things we have observed or could have observed, are maximally likely. This principle is not "just" fundamental, it is also intuitively appealing. Imagine a simple example.

Say we have the values 7.1, 22.14, and 11.3, and we know that the underlying process follows a normal distribution. Then it is much more likely that these data have been generated by a distribution with mean 14 and standard deviation 7 than by one with mean 20 and standard deviation 1.

### Regression

In regression (that implicitly assumes the target distribution to be normal[^loss_functions-1]), to maximize likelihood, we just keep using mean squared error -- the loss we've been computing all along. Maximum likelihood estimators have all kinds of desirable statistical properties. However, in concrete applications, there may be reasons to use different ones.

[^loss_functions-1]: For cases where that assumption seems unlikely, distribution-adequate loss functions are provided (e.g., Poisson negative log likelihood, available as `nnf_poisson_nll_loss()` .

For example, say a dataset has outliers where, for some reason, prediction and target are found to be deviating substantially. Mean squared error will allocate high importance to these outliers. In such cases, possible alternatives are mean absolute error (`nnf_l1_loss()`) and smooth L1 loss (`nn_smooth_l1_loss()`). The latter is a mixture type that, by default, computes the absolute (L1) error, but switches to squared (L2) error whenever the absolute errors get very small.

### Classification

In classification, we are comparing two *distributions*. The estimate is a probability by design, and the target can be viewed as one, too. In that light, maximum likelihood estimation is equivalent to minimizing the Kullback-Leibler divergence (KL divergence).

KL divergence is a measure of how two distributions differ. It depends on two things: the likelihood of the data, as determined by some data-generating process, and the likelihood of the data under the model. In the machine learning scenario, however, we are concerned only with the latter. In that case, the criterion to be minimized reduces to the *cross-entropy*\index{cross entropy} between the two distributions. And cross-entropy loss is exactly what is commonly used in classification tasks.

In `torch`, there are several variants of loss functions that calculate cross-entropy. With this topic, it's nice to have a quick reference around; so here is a quick lookup table (@tbl-loss-funcs-features abbreviates the -- rather long-ish -- function names; see @tbl-loss-abbrevs for the mapping):

|        |          |             |            |               |           |
|--------|----------|-------------|------------|---------------|-----------|
|        | **Data** |             | **Input**  |               |           |
|        | binary   | multi-class | raw scores | probabilities | log probs |
| *BCeL* | Y        |             | Y          |               |           |
| *Ce*   |          | Y           | Y          |               |           |
| *BCe*  | Y        |             |            | Y             |           |
| *Nll*  |          | Y           |            |               | Y         |

: Loss functions, by type of data they work on (binary vs. multi-class) and expected input (raw scores, probabilities, or log probabilities). {#tbl-loss-funcs-features}

|        |                                          |
|--------|------------------------------------------|
| *BCeL* | `nnf_binary_cross_entropy_with_logits()` |
| *Ce*   | `nnf_cross_entropy()`                    |
| *BCe*  | `nnf_binary_cross_entropy()`             |
| *Nll*  | `nnf_nll_loss()`                         |

: Abbreviations used to refer to `torch` loss functions. {#tbl-loss-abbrevs}

To pick the function applicable to your use case, there are two things to consider.

First, are there just two possible classes ("dog vs. cat", "person present / person absent", etc.), or are there several?

And second, what is the type of the estimated values? Are they raw scores (in theory, any value between plus and minus infinity)? Are they probabilities (values between 0 and 1)? Or (finally) are they log probabilities, that is, probabilities to which a logarithm has been applied? (In the final case, all values should be either negative or equal to zero.)

#### Binary data\index{cross entropy!binary}

Starting with binary data, our example classification vector is a sequence of zeros and ones. When thinking in terms of probabilities, it is most intuitive to imagine the ones standing for presence, the zeros for absence of one of the classes in question -- cat or no cat, say.

```{r}
target <- torch_tensor(c(1, 0, 0, 1, 1))
```

The raw scores could be anything. For example:

```{r}
unnormalized_estimate <-
  torch_tensor(c(3, 2.7, -1.2, 7.7, 1.9))
```

To turn these into probabilities, all we need to do is pass them to `nnf_sigmoid()`. `nnf_sigmoid()` squishes its argument to values between zero and one:

```{r}
probability_estimate <- nnf_sigmoid(unnormalized_estimate)
probability_estimate
```

    torch_tensor
     0.9526
     0.9370
     0.2315
     0.9995
     0.8699
    [ CPUFloatType{5} ]

From the above table, we see that given `unnormalized_estimate` and `probability_estimate`, we can use both as inputs to a loss function -- but we have to choose the appropriate one. Provided we do that, the output has to be the same in both cases.

Let's see (raw scores first):

```{r}
nnf_binary_cross_entropy_with_logits(
  unnormalized_estimate, target
)
```

    torch_tensor
    0.643351
    [ CPUFloatType{} ]

And now, probabilities:\index{\texttt{nnf{\textunderscore}binary{\textunderscore}cross{\textunderscore}entropy()}}

```{r}
nnf_binary_cross_entropy(probability_estimate, target)
```

    torch_tensor
    0.643351
    [ CPUFloatType{} ]

That worked as expected. What does this mean in practice? It means that when we build a model for binary classification, and the final layer computes an un-normalized score, we don't need to attach a sigmoid layer to obtain probabilities. We can just call `nnf_binary_cross_entropy_with_logits()` when training the network. In fact, doing so is the preferred way, also due to reasons of numerical stability.

#### Multi-class data\index{cross entropy!multi-class}

Moving on to multi-class data, the most intuitive framing now really is in terms of (several) *classes*, not presence or absence of a single class. Think of classes as class indices (maybe indexing into some look-up table). Being indices, technically, classes start at 1:

```{r}
target <- torch_tensor(c(2, 1, 3, 1, 3), dtype = torch_long())
```

In the multi-class scenario, raw scores are a two-dimensional tensor. Each row contains the scores for one observation, and each column corresponds to one of the classes. Here's how the raw estimates could look:

```{r}
unnormalized_estimate <- torch_tensor(
  rbind(c(1.2, 7.7, -1),
    c(1.2, -2.1, -1),
    c(0.2, -0.7, 2.5),
    c(0, -0.3, -1),
    c(1.2, 0.1, 3.2)
  )
)
```

As per the above table, given this estimate, we should be calling `nnf_cross_entropy()` (and we will, when below we compare results).

So that's the first option, and it works exactly as with binary data. For the second, there is an additional step.

First, we again turn raw scores into probabilities, using `nnf_softmax()`. For most practical purposes, `nnf_softmax()` can be seen as the multi-class equivalent of `nnf_sigmoid()`. Strictly though, their effects are not the same. In a nutshell, `nnf_sigmoid()` treats low-score and high-score values equivalently, while `nnf_softmax()` exacerbates the distances between the top score and the remaining ones ("winner takes all").

```{r}
probability_estimate <- nnf_softmax(unnormalized_estimate,
  dim = 2
)
probability_estimate

```

    torch_tensor
     0.0015  0.9983  0.0002
     0.8713  0.0321  0.0965
     0.0879  0.0357  0.8764
     0.4742  0.3513  0.1745
     0.1147  0.0382  0.8472
    [ CPUFloatType{5,3} ]

The second step, the one that was not required in the binary case, consists in transforming the probabilities to log probabilities. In our example, this could be accomplished by calling `torch_log()` on the `probability_estimate` we just computed. Alternatively, both steps together are taken care of by `nnf_log_softmax()`:

```{r}
logprob_estimate <- nnf_log_softmax(unnormalized_estimate,
  dim = 2
)
logprob_estimate
```

    torch_tensor
    -6.5017 -0.0017 -8.7017
    -0.1377 -3.4377 -2.3377
    -2.4319 -3.3319 -0.1319
    -0.7461 -1.0461 -1.7461
    -2.1658 -3.2658 -0.1658
    [ CPUFloatType{5,3} ]

Now that we have estimates in both possible forms, we can again compare results from applicable loss functions. First, `nnf_cross_entropy()` on the raw scores:\index{\texttt{nnf{\textunderscore}cross{\textunderscore}entropy()}}

```{r}
nnf_cross_entropy(unnormalized_estimate, target)
```

    torch_tensor
    0.23665
    [ CPUFloatType{} ]

And second, `nnf_nll_loss()` on the log probabilities:\index{\texttt{nnf{\textunderscore}nll{\textunderscore}loss()}}

```{r}
nnf_nll_loss(logprob_estimate, target)
```

    torch_tensor
    0.23665
    [ CPUFloatType{} ]

Application-wise, what was said for the binary case applies here as well: In a multi-class classification network, there is no need to have a softmax layer at the end.

Before we end this chapter, let's address a question that might have come to mind. Is not binary classification a sub-type of the multi-class setup? Should we not, in that case, arrive at the same result, whatever the method chosen?

#### Check: Binary data, multi-class method

Let's see. We re-use the binary-classification scenario employed above. Here it is again:

```{r}
target <- torch_tensor(c(1, 0, 0, 1, 1))

unnormalized_estimate <- 
  torch_tensor(c(3, 2.7, -1.2, 7.7, 1.9))

probability_estimate <- nnf_sigmoid(unnormalized_estimate)

nnf_binary_cross_entropy(probability_estimate, target)

```

    torch_tensor
    0.64335
    [ CPUFloatType{} ]

We hope to get the same value doing things the multi-class way. We already have the probabilities (namely, `probability_estimate`); we just need to put them into the "observation by class" format expected by `nnf_nll_loss()`:

```{r}
# logits
multiclass_probability <- torch_tensor(rbind(
  c(1 - 0.9526, 0.9526),
  c(1 - 0.9370, 0.9370),
  c(1 - 0.2315, 0.2315),
  c(1 - 0.9995, 0.9995),
  c(1 - 0.8699, 0.8699)
))
```

Now, we still want to apply the logarithm. And there is one other thing to be taken care of: In the binary setup, classes were coded as probabilities (either 0 or 1); now, we're dealing with indices. This means we add 1 to the `target` tensor:

```{r}
target <- target + 1
```

Finally, we can call `nnf_nll_loss()`:

```{r}
nnf_nll_loss(
  torch_log(multiclass_probability),
  target$to(dtype = torch_long())
)
```

    torch_tensor
    0.643275
    [ CPUFloatType{} ]

There we go. The results are indeed the same.