Practical aspects of Deep Learning

[TOC]

Train/dev/test set

Machine learning is highly iterative process, based on the outcome you change your hyperparameters until find the best. Due to that, you need an hold out dataset to evaluate the model.

It is convenient to divede your dataset to

Train: for train the model
Hold out/cross validation/development set: use dev set to see which of many different models performs best on your dev set
Test set: when you have the final model, take the best model you have found and evaluate on test set

Dataset size	Train	Dev	Test
100 to 10000	60%	20%	20%
more than 10000	98%	1%	1%

make sure dev and test come from same distribution
It may be ok not have test set: train on training set, evaluate in dev set.

Bias/Variance

High bias: underfitting, not fit well in train neither in test
High variance: overfitting: fit well in train but not in test

	High variance	High bias	High bias & variance	Low bias & variance
Training error	1%	15%	15%	0.5%
Test error	11%	16%	30%	1%

Basic recipe for ML

Regularizarizing your neural network

To avoid overfitting, there are some strategies that can be applied, such as:

Penalize model complecity-> regularization.
Dropout
Early stopping: stop training before your really converge in the training data
Data augmentation

Regularization

How we evaluate the model complexity? We use L1 or L2 regularization. According to bayesian prior, our weight are centered in 0 and not too big. Given a cost function, we add regularization term: $$ J(w,b)=\frac{1}{m}\sum_{i=1}^m (\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m}||w||^2_2 $$ where $||w||^2_2 $ is the L2 norm. We also can use L1 norm(less frequent),$||w||_1 $ . When use L1 w will be sparse, w will have a lot of z

Normally we dont add regularization for b because w is high dimensional vector, with a high variance problem, whereas b is just a single number. Add it just dont make difference.

When we implement cost function for neural network, $w:(n^{[l]}, n^{[l-1]})$, so $$ ||w^{[l]}||^2_2 = \sum_{i=1}^{l-1} \sum_{j=1}^{l} w_{ij}^2 $$ This called Frobenius norm of a matrix instead of L2 norm of a matrix.

Gradient descend

The original backward propagation is $dw^{[l]} = \frac{\partial J}{\partial w^{[l]}}$, but down we also have to derivate the regularization term,

$dw^{[l]}= dw^{[l]}_{orig}+ \frac{\lambda}{m}w^{[l]}$
Update weight: $w^{[l]} = w^{[l]} - \alpha dw^{[l]} = w^{[l]} - \alpha ( dw^{[l]}{orig}+\frac{\lambda}{m}w^{[l]}) = w^{[l]} - \alpha dw^{[l]}{orig} - \frac{\alpha \lambda}{m}dw^{[l]}_{orig}$

Observe the previous equation, dw is increased due to the term $\frac{\lambda}{m}w^{[l]}$, so w is decreased as result of $\alpha dw^{[l]} $, for this reason, L2 norm is also called weight decay. $\frac{\alpha \lambda}{m}<1$

L2 regularization $$ J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{L}\right) + (1-y^{(i)})\log\left(1- a^{L}\right) \large{)} }\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2} $$

Why regularization reduce overfitting?

If $\lambda$ is large, they will be incentivized to set the weight matrices W to be reasonably close to zero (As it will be penalized in cost function), as result it select an simple case.

Dropout

With a certain probability, keep some weights and remove others. As result, your are training a smaller network

Implement dropout using inverted dropout:

keep_prob = 0.8   # 0 <= keep_prob <= 1
l = 3  # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob

a3 = np.multiply(a3,d3)   # keep only the values in d3

# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob

The last step al /= keep_prob is used to not alterate the expected value of $Z^{[l+1]} = W^{[l+1]}a^{[l]}+b^{[l+1]}$

Using dropout in test wil cause noise, making the prediction random.

Understanding dropout

Dropout work because it can't rely on any one feature because it may go away, so it have to spred out the weight, shringking W.
We can set different dropout ratio for different layer. If layer has a lot of units, it may has overfitting problem, so the dropout ratio can be high ( for example 0.5), in layers with less units, we set lower dropout ratio, or set it to 0.
A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
- To solve that you'll need to turn off dropout, set all the keep_probs to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.

Data augmentation

It may be difficult to get more data, to improve the model performance, we can use data augmentation, such like

Horizontal/vertical flipping
Random transformation/distortion
Zoom

Early stopping

With more iterations, the model will fit better in training example but the error in validation sample may increase(overfitting). To have the model that best fits the test model, we stop training process before reach the best model for the training set.

Downside

In machine learning we want to optimizationse the cost function J and we dont want to overfit. By using early stoping, we couple these 2 tasks and no longer can work these two problem independently, because by stopping early, you are breaking whatever you're doing to optimize J and simultaneous trying to not overfit.

Setting up the normalization problem

Normalizing the training set

Normalizing the training data can speed up the training process because it facilitate the optimization of the cost function. When the features are in very different scale, it should not take large steps because when we compute the gradient descend it oscilate a lot whereas using normalized data we can use larger step size.

Steps for normalization:

Calculate the mean $\mu = \frac{1}{m} \sum{x}$
Subtract mean $x = x-\mu$, then $x$ is zero centered.
Calculate the variance: $\sigma^2 = \frac{1}{m} \sum(x)^2$
Normalize by the variance: $x = \frac{x}{\sigma^2}$

We apply $\mu, \sigma^2$ to normalize training, validation and test set

Vanishing /exploding gradients

The vanishing/exploding effect occures when your derivatives become very small or very large. Supose than $g(z)=z$ and $b^{[l]}=0$, so the output: $$ \hat{y} = W^{[L]}W^{[L-1]}...W^{[2]}W^{[1]}X $$ If we have 2 hidden units per layer and X1 = x2=1:

If $W^{[l]}>I$ (where I is the identity matrix), the result will explode, in other words, $\hat{y}$ increase exponentially. For example $$ W^{[l]}=\left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right]^{[L-1]}X $$
If $W^{[l]}<I$, it will decrease exponentially, $\hat{y}$ will vanish

$$ W^{[l]}=\left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right]^{[L-1]}X $$

Weight initialization for deep networks

As we have see in the previous video, incorrect W can lead to vanishing/exploding gradient, to avoid this problem, we initialize weights.

In a single neuron model, $z = w_1 x_1 + w_2 x_2 + …+ w_n x_n$. Larger is n, smaller should $w_i$ be to not explode the cost. One thing we can do is set the variance $Var(w_i) = \frac{1}{n}$, so the $Var(z)=1$.

W[l] = np.random.rand(shape) * np.sqrt(1/n[l-1])

where $n^{[l-1]}$ is the features that you feed to each layer. We set W[l] using gaussian distribution, which is zero centered and has variance 1, multiplying by $\sqrt{1/n}$ , z will have similar sclae, helping with the gradient exploding/vanishing problem.

For ReLu activation , it is better to use $Var(w_i) = \frac{2}{n}$.

Other variations

tanh: $\sqrt{\frac{1}{n^{[l-1]}}}$
Xavier initialization $\sqrt{\frac{2}{n^{[l-1]}*n^{[l]}}}$

Gradient checking

To check that our backward propagation is correctly implemented, We can check the gradient using the numerical approximation of the gradient.

Definition of the derivative $$ f'(\theta) = \lim_{\epsilon \to \infty} \frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} $$ Implement gradient check

Take W and b and reshape into a big vector $\theta$.
Take dW and db and reshape into $d\theta$.
The cost function will be $J(\theta) = J(\theta_1, \theta_2….)$.
For each i, $$ d \theta_{\text {approx}}[i]=\frac{J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}+\varepsilon, \ldots\right)-J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}-\varepsilon, \ldots\right)}{2 \varepsilon} $$
Then check: $$ \frac{{||d\theta_{approx} - d\theta||}2}{{||d\theta{approx}||}_2+{||d\theta||}2} $$ If this is $ \approx 10^-{7}$ is great, if is $10^{-5}$, it can be ok, but need to inspect if there are no particularly big values in ${||d\theta{approx} - d\theta||}_2$. If is $> 10^{-3}$, is bad, probably there is a bug in back propagation.

Tips:

Dont use in training, only debug
If algotithm fails grad check, look at components to try to identify bug
Remember regularization
Doesn't work with dropout because it causes the J random
Run at random initialization; perhaps agin after some training because in some cases it cant be seen in the first iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.1 Practical aspects of DL.md

2.1 Practical aspects of DL.md

Practical aspects of Deep Learning

Train/dev/test set

Bias/Variance

Basic recipe for ML

Regularizarizing your neural network

Regularization

Dropout

Data augmentation

Early stopping

Setting up the normalization problem

Normalizing the training set

Vanishing /exploding gradients

Weight initialization for deep networks

Gradient checking

Files

2.1 Practical aspects of DL.md

Latest commit

History

2.1 Practical aspects of DL.md

File metadata and controls

Practical aspects of Deep Learning

Train/dev/test set

Bias/Variance

Basic recipe for ML

Regularizarizing your neural network

Regularization

Dropout

Data augmentation

Early stopping

Setting up the normalization problem

Normalizing the training set

Vanishing /exploding gradients

Weight initialization for deep networks

Gradient checking