Skip to content

Latest commit

 

History

History
223 lines (142 loc) · 9.99 KB

2.1 Practical aspects of DL.md

File metadata and controls

223 lines (142 loc) · 9.99 KB

Practical aspects of Deep Learning

[TOC]

Train/dev/test set

Machine learning is highly iterative process, based on the outcome you change your hyperparameters until find the best. Due to that, you need an hold out dataset to evaluate the model.

It is convenient to divede your dataset to

  • Train: for train the model
  • Hold out/cross validation/development set: use dev set to see which of many different models performs best on your dev set
  • Test set: when you have the final model, take the best model you have found and evaluate on test set
Dataset size Train Dev Test
100 to 10000 60% 20% 20%
more than 10000 98% 1% 1%
  • make sure dev and test come from same distribution
  • It may be ok not have test set: train on training set, evaluate in dev set.

Bias/Variance

  • High bias: underfitting, not fit well in train neither in test
  • High variance: overfitting: fit well in train but not in test
High variance High bias High bias & variance Low bias & variance
Training error 1% 15% 15% 0.5%
Test error 11% 16% 30% 1%

Basic recipe for ML

Basic recipe

Regularizarizing your neural network

To avoid overfitting, there are some strategies that can be applied, such as:

  • Penalize model complecity-> regularization.
  • Dropout
  • Early stopping: stop training before your really converge in the training data
  • Data augmentation

Regularization

How we evaluate the model complexity? We use L1 or L2 regularization. According to bayesian prior, our weight are centered in 0 and not too big. Given a cost function, we add regularization term: $$ J(w,b)=\frac{1}{m}\sum_{i=1}^m (\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m}||w||^2_2 $$ where $||w||^2_2 $ is the L2 norm. We also can use L1 norm(less frequent),$||w||_1 $ . When use L1 w will be sparse, w will have a lot of z

Normally we dont add regularization for b because w is high dimensional vector, with a high variance problem, whereas b is just a single number. Add it just dont make difference.

When we implement cost function for neural network, $w:(n^{[l]}, n^{[l-1]})$, so $$ ||w^{[l]}||^2_2 = \sum_{i=1}^{l-1} \sum_{j=1}^{l} w_{ij}^2 $$ This called Frobenius norm of a matrix instead of L2 norm of a matrix.

Gradient descend

The original backward propagation is $dw^{[l]} = \frac{\partial J}{\partial w^{[l]}}$, but down we also have to derivate the regularization term,

  • $dw^{[l]}= dw^{[l]}_{orig}+ \frac{\lambda}{m}w^{[l]}$
  • Update weight: $w^{[l]} = w^{[l]} - \alpha dw^{[l]} = w^{[l]} - \alpha ( dw^{[l]}{orig}+\frac{\lambda}{m}w^{[l]}) = w^{[l]} - \alpha dw^{[l]}{orig} - \frac{\alpha \lambda}{m}dw^{[l]}_{orig}$

Observe the previous equation, dw is increased due to the term $\frac{\lambda}{m}w^{[l]}$, so w is decreased as result of $\alpha dw^{[l]} $, for this reason, L2 norm is also called weight decay. $\frac{\alpha \lambda}{m}<1$

L2 regularization $$ J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{L}\right) + (1-y^{(i)})\log\left(1- a^{L}\right) \large{)} }\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2} $$

Why regularization reduce overfitting?

  • If $\lambda$ is large, they will be incentivized to set the weight matrices W to be reasonably close to zero (As it will be penalized in cost function), as result it select an simple case.

Dropout

With a certain probability, keep some weights and remove others. As result, your are training a smaller network

Implement dropout using inverted dropout:

keep_prob = 0.8   # 0 <= keep_prob <= 1
l = 3  # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob

a3 = np.multiply(a3,d3)   # keep only the values in d3

# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob  

The last step al /= keep_prob is used to not alterate the expected value of $Z^{[l+1]} = W^{[l+1]}a^{[l]}+b^{[l+1]}$

Using dropout in test wil cause noise, making the prediction random.

Understanding dropout

  • Dropout work because it can't rely on any one feature because it may go away, so it have to spred out the weight, shringking W.
  • We can set different dropout ratio for different layer. If layer has a lot of units, it may has overfitting problem, so the dropout ratio can be high ( for example 0.5), in layers with less units, we set lower dropout ratio, or set it to 0.
  • A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
    • To solve that you'll need to turn off dropout, set all the keep_probs to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.

Data augmentation

It may be difficult to get more data, to improve the model performance, we can use data augmentation, such like

  • Horizontal/vertical flipping
  • Random transformation/distortion
  • Zoom

Early stopping

Early stopping

With more iterations, the model will fit better in training example but the error in validation sample may increase(overfitting). To have the model that best fits the test model, we stop training process before reach the best model for the training set.

Downside

In machine learning we want to optimizationse the cost function J and we dont want to overfit. By using early stoping, we couple these 2 tasks and no longer can work these two problem independently, because by stopping early, you are breaking whatever you're doing to optimize J and simultaneous trying to not overfit.

Setting up the normalization problem

Normalizing the training set

Normalizing the training data can speed up the training process because it facilitate the optimization of the cost function. When the features are in very different scale, it should not take large steps because when we compute the gradient descend it oscilate a lot whereas using normalized data we can use larger step size.

Normalization

Steps for normalization:

  1. Calculate the mean $\mu = \frac{1}{m} \sum{x}$
  2. Subtract mean $x = x-\mu$, then $x$ is zero centered.
  3. Calculate the variance: $\sigma^2 = \frac{1}{m} \sum(x)^2$
  4. Normalize by the variance: $x = \frac{x}{\sigma^2}$

We apply $\mu, \sigma^2$ to normalize training, validation and test set

Vanishing /exploding gradients

The vanishing/exploding effect occures when your derivatives become very small or very large. Supose than $g(z)=z$ and $b^{[l]}=0$, so the output: $$ \hat{y} = W^{[L]}W^{[L-1]}...W^{[2]}W^{[1]}X $$ If we have 2 hidden units per layer and X1 = x2=1:

  • If $W^{[l]}&gt;I$ (where I is the identity matrix), the result will explode, in other words, $\hat{y}$ increase exponentially. For example $$ W^{[l]}=\left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right]^{[L-1]}X $$

  • If $W^{[l]}&lt;I$, it will decrease exponentially, $\hat{y}$ will vanish

    $$ W^{[l]}=\left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right]^{[L-1]}X $$

Weight initialization for deep networks

As we have see in the previous video, incorrect W can lead to vanishing/exploding gradient, to avoid this problem, we initialize weights.

In a single neuron model, $z = w_1 x_1 + w_2 x_2 + …+ w_n x_n$. Larger is n, smaller should $w_i$ be to not explode the cost. One thing we can do is set the variance $Var(w_i) = \frac{1}{n}$, so the $Var(z)=1$.

W[l] = np.random.rand(shape) * np.sqrt(1/n[l-1])

where $n^{[l-1]}$ is the features that you feed to each layer. We set W[l] using gaussian distribution, which is zero centered and has variance 1, multiplying by $\sqrt{1/n}$ , z will have similar sclae, helping with the gradient exploding/vanishing problem.

For ReLu activation , it is better to use $Var(w_i) = \frac{2}{n}$.

Other variations

  • tanh: $\sqrt{\frac{1}{n^{[l-1]}}}$
  • Xavier initialization $\sqrt{\frac{2}{n^{[l-1]}*n^{[l]}}}$

Gradient checking

To check that our backward propagation is correctly implemented, We can check the gradient using the numerical approximation of the gradient.

Definition of the derivative $$ f'(\theta) = \lim_{\epsilon \to \infty} \frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} $$ Implement gradient check

  1. Take W and b and reshape into a big vector $\theta$.

  2. Take dW and db and reshape into $d\theta$.

  3. The cost function will be $J(\theta) = J(\theta_1, \theta_2….)$.

  4. For each i, $$ d \theta_{\text {approx}}[i]=\frac{J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}+\varepsilon, \ldots\right)-J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}-\varepsilon, \ldots\right)}{2 \varepsilon} $$

  5. Then check: $$ \frac{{||d\theta_{approx} - d\theta||}2}{{||d\theta{approx}||}_2+{||d\theta||}2} $$ If this is $ \approx 10^-{7}$ is great, if is $10^{-5}$, it can be ok, but need to inspect if there are no particularly big values in ${||d\theta{approx} - d\theta||}_2$. If is $&gt; 10^{-3}$, is bad, probably there is a bug in back propagation.

Tips:

  • Dont use in training, only debug
  • If algotithm fails grad check, look at components to try to identify bug
  • Remember regularization
  • Doesn't work with dropout because it causes the J random
  • Run at random initialization; perhaps agin after some training because in some cases it cant be seen in the first iteration.