Skip to content

Latest commit

 

History

History
260 lines (198 loc) · 9 KB

README.md

File metadata and controls

260 lines (198 loc) · 9 KB

Neural Networks Notes

Contents

List of Symbols

$\large s$ = sample
$\large S$ = number of samples in training batch
$\large l$ = layer
$\large L$ = number of layers
$\large n_l$ = neuron at layer l
$\large N_l$ = number of neurons in layer l
$\large w_{n_{l-1}n_l}$ = weight between neurons $n_{l-1}$ and $n_l$
$\large b_{n_l}$ = bias of neuron $n_l$
$\large z_{n_l}$ = intermediate quantity of neuron $n_l$
$\large y_{n_l}$ = output of neuron $n_l$
$\large \hat y_{n_l}$ = target output of neuron $n_l$
$\large A_{n_l}$ = activation function at neuron $n_l$ {Binary Step, Linear, ReLU, Sigmoid, Tanh...}
$\large C$ = cost function {MSE, SSE, WSE, NSE...}
$\large O$ = optimization Algorithm {Gradient Descend, ADAM, Quasi Newton Method...}
$\large α$ = learning rate

Neuron Equations

Neuron Intermediate Quantity

$$ \large z_{n_l} = \sum_{n_{l-1}}^{N_{l-1}}(w_{n_{l-1}n_l} \cdot y_{n_{l-1}} + b_{n_l}) $$

Neuron Output

$$ \large y_{n_l} = A_{n_l}\big(z_{n_l}\big) $$

Training Algorithm

In order to reduce the errors of the network, weights and biases are adjusted to minimize the cost function $C$. This is done by an optimization algorithm $O$, that adjust the network parameters periodically after running a certain number of training samples. Weights and biases are modified depending on their influence in the cost function, measured by the derivatives ${\partial C}/{\partial {w_{n_{l-1}n_l}}}$ and ${\partial C}/{\partial {b_{n_l}}}$.

$$ \large \Delta w_{n_{l-1}n_l} = - α \cdot O\big(\frac {\partial C}{\partial {w_{n_{l-1}n_l}}}\big) $$

$$ \large \Delta b_{n_l} = - α \cdot O\big(\frac {\partial C}{\partial {b_{n_l}}}\big) $$

Chain Rule

The chain rule allows to separate the derivatives of the cost function into components:

$$ \large \frac {\partial C}{\partial {w_{n_{l-1}n_l}}} = \frac{\partial C}{\partial z_{n_l}} \cdot \frac{\partial z_{n_l}}{\partial {w_{n_{l-1}n_l}}} = \frac{\partial C}{\partial z_{n_l}} \cdot y_{n_{l-1}} = \frac{\partial C}{\partial y_{n_l}} \cdot \frac{\partial y_{n_l}}{\partial z_{n_l}} \cdot y_{n_{l-1}} = \dot C\big(y_{n_l}, \hat y_{n_l}\big) \cdot \dot A_{n_l}\big(z_{n_l}\big) \cdot y_{n_{l-1}} $$

$$ \large \frac {\partial C}{\partial {b_{n_l}}} = \frac{\partial C}{\partial z_{n_l}} = \frac{\partial C}{\partial y_{n_l}} \cdot \frac{\partial y_{n_l}}{\partial z_{n_l}} = \dot C\big(y_{n_l}, \hat y_{n_l}\big) \cdot \dot A_{n_l}\big(z_{n_l}\big) $$

Backpropagation

The terms $\dot C (y_{n_l} \hat y_{n_l})$ depend on the output target value for each neuron $\hat y_{n_l}$. However, a training data set only counts on the value of $\hat y_{n_l}$ for the last layer $l = L$. Instead, for all previous layers $l < L$, components $\dot C ( y_{n_l}, \hat y_{n_l})$ are computed as a weighted sum of the components previously calculated for the next layer $\dot C (y_{n_{l+1}}, \hat y_{n_{l+1}})$ :

$$ \large \dot C \big( y_{n_l}, \hat y_{n_l} \big) = \sum_{n_{l+1}}^{N_{l+1}} w_{n_{l}n_{l+1}} \cdot \dot C \big( y_{n_{l+1}}, \hat y_{n_{l+1}} \big) $$

Activation Functions

Binary Step

drawing

$$ \large \begin{split}A \big(z\big) = \begin{Bmatrix} 1 & z ≥ 0 \\ 0 & z < 0 \end{Bmatrix}\end{split} $$

$$ \large \dot A \big(z\big) = 0 $$

Linear

drawing

$$ \large A \big(z\big) = z $$

$$ \large \dot A \big(z\big) = 1 $$

ReLU (Rectified Linear Unit)

drawing

$$ \large \begin{split}A \big(z\big) = \begin{Bmatrix} z & z > 0 \\ 0 & z ≤ 0 \end{Bmatrix}\end{split} $$

$$ \large \begin{split}\dot A \big(z\big) = \begin{Bmatrix} 1 & z > 0 \\ 0 & z ≤ 0 \end{Bmatrix}\end{split} $$

Leaky ReLU

drawing

$$ \large \begin{split}A \big(z, \tau \big) = \begin{Bmatrix} z & z > 0 \\ \tau \cdot z & z ≤ 0 \end{Bmatrix}\end{split} $$

$$ \large \begin{split}\dot A \big(z, \tau \big) = \begin{Bmatrix} 1 & z > 0 \\ \tau & z ≤ 0 \end{Bmatrix}\end{split} $$

where typically:

$$ \large \tau=0.01 $$

Sigmoid

drawing

$$ \large A \big(z\big) = \frac{1} {1 + e^{-z}} $$

$$ \large \dot A \big(z\big) = A(z) \cdot (1-A(z)) $$

Tanh (Hyperbolic Tangent)

drawing

$$ \large A \big(z\big) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} $$

$$ \large \dot A \big(z\big) = 1 - {A(z)}^2 $$

Cost Functions

Quadratic Cost

$$\large C\big(y, \hat y\big) = 1/2 \cdot {\big(y - \hat y\big)^{\small 2}}$$

$$\large\dot C\big(y, \hat y\big) = \big(y - \hat y\big)$$

Cross Entropy Cost

$$ \large C\big(y, \hat y\big) = -\big({\hat y} \text{ ln } y + (1 - {\hat y}) \cdot \text{ ln }(1-y)\big) $$

$$ \large \dot C\big(y, \hat y\big) = \frac{y - \hat y}{(1-y) \cdot y} $$

Regularization

Extra terms are added to the cost function in order to address overfitting.

L2

$$ \large C_{n_l} \big(y, \hat y\big) = C \big(y, \hat y\big) + \frac{\lambda}{2 \cdot N_{l-1}} \cdot \sum_{n_{l-1}}^{N_{l-1}} w_{n_{l-1}n_l}^2 $$

$$ \large \dot C_{n_l} \big(y, \hat y\big) = \dot C \big(y, \hat y\big) + \lambda \cdot \sum_{n_{l-1}}^{N_{l-1}} w_{n_{l-1}n_l} $$

L1

$$ \large C_{n_l} \big(y, \hat y\big) = C \big(y, \hat y\big) + \lambda \cdot \sum_{n_{l-1}}^{N_{l-1}} |w_{n_{l-1}n_l}| $$

$$ \large \dot C_{n_l} \big(y, \hat y\big) = \dot C \big(y, \hat y\big) ± \lambda $$

Optimization Algorithms

Gradient Descend

Network parameters are updated after every training batch $S$, averaging across all training samples.

$$ \large O \big( \frac{\partial C}{\partial {w_{n_{l-1}n_l}}} \big) = \frac{1}{S} \cdot \sum_{s}^S{\frac{\partial C}{\partial {w_{n_{l-1}n_l}}}} $$

Stochastic Gradient Descend

It is a gradient descend performed after every training sample $s$.

$$ \large O \big( \frac{\partial C}{\partial {w_{n_{l-1}n_l}}} \big) = \frac{\partial C}{\partial {w_{n_{l-1}n_l}}} $$

ADAM (Adaptive Moment Estimation)

Network parameters are updated after every training batch $S$, with an adapted value of the cost function derivatives.

$$ \large O \big( \frac{\partial C}{\partial {w_{n_{l-1}n_l}}} \big) = \frac{m_t}{\sqrt{v_t}+\epsilon} $$

where:

$$ \large m_t = \beta_1 \cdot m_{t-1} + (1+\beta_1) \cdot \big( \frac{1}{S} \cdot \sum_{s}^S{\frac{\partial C}{\partial {w_{n_{l-1}n_l}}}} \big) $$

$$ \large v_t = \beta_2 \cdot v_{t-1} + (1+\beta_2) \cdot \big( \frac{1}{S} \cdot \sum_{s}^S{\frac{\partial C}{\partial {w_{n_{l-1}n_l}}}} \big) $$

where typically:

$$ \large m_0 = 0 $$

$$ \large v_0 = 0 $$

$$ \large \epsilon = 1/10^{\small 8} $$

$$ \large \beta_1 = 0.9 $$

$$ \large \beta_2 = 0.999 $$

References