[TOC]
Machine learning is highly iterative process, based on the outcome you change your hyperparameters until find the best. Due to that, you need an hold out dataset to evaluate the model.
It is convenient to divede your dataset to
- Train: for train the model
- Hold out/cross validation/development set: use dev set to see which of many different models performs best on your dev set
- Test set: when you have the final model, take the best model you have found and evaluate on test set
Dataset size | Train | Dev | Test |
---|---|---|---|
100 to 10000 | 60% | 20% | 20% |
more than 10000 | 98% | 1% | 1% |
- make sure dev and test come from same distribution
- It may be ok not have test set: train on training set, evaluate in dev set.
- High bias: underfitting, not fit well in train neither in test
- High variance: overfitting: fit well in train but not in test
High variance | High bias | High bias & variance | Low bias & variance | |
---|---|---|---|---|
Training error | 1% | 15% | 15% | 0.5% |
Test error | 11% | 16% | 30% | 1% |
To avoid overfitting, there are some strategies that can be applied, such as:
- Penalize model complecity-> regularization.
- Dropout
- Early stopping: stop training before your really converge in the training data
- Data augmentation
How we evaluate the model complexity? We use L1 or L2 regularization. According to bayesian prior, our weight are centered in 0 and not too big. Given a cost function, we add regularization term: $$ J(w,b)=\frac{1}{m}\sum_{i=1}^m (\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m}||w||^2_2 $$ where $||w||^2_2 $ is the L2 norm. We also can use L1 norm(less frequent),$||w||_1 $ . When use L1 w will be sparse, w will have a lot of z
Normally we dont add regularization for b because w is high dimensional vector, with a high variance problem, whereas b is just a single number. Add it just dont make difference.
When we implement cost function for neural network,
Gradient descend
The original backward propagation is
$dw^{[l]}= dw^{[l]}_{orig}+ \frac{\lambda}{m}w^{[l]}$ - Update weight: $w^{[l]} = w^{[l]} - \alpha dw^{[l]} = w^{[l]} - \alpha ( dw^{[l]}{orig}+\frac{\lambda}{m}w^{[l]}) = w^{[l]} - \alpha dw^{[l]}{orig} - \frac{\alpha \lambda}{m}dw^{[l]}_{orig}$
Observe the previous equation, dw is increased due to the term
L2 regularization $$ J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{L}\right) + (1-y^{(i)})\log\left(1- a^{L}\right) \large{)} }\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2} $$
Why regularization reduce overfitting?
- If
$\lambda$ is large, they will be incentivized to set the weight matrices W to be reasonably close to zero (As it will be penalized in cost function), as result it select an simple case.
With a certain probability, keep some weights and remove others. As result, your are training a smaller network
Implement dropout using inverted dropout:
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
The last step al /= keep_prob
is used to not alterate the expected value of
Using dropout in test wil cause noise, making the prediction random.
Understanding dropout
- Dropout work because it can't rely on any one feature because it may go away, so it have to spred out the weight, shringking W.
- We can set different dropout ratio for different layer. If layer has a lot of units, it may has overfitting problem, so the dropout ratio can be high ( for example 0.5), in layers with less units, we set lower dropout ratio, or set it to 0.
- A downside of dropout is that the cost function J is not well defined and it will be hard to debug (plot J by iteration).
- To solve that you'll need to turn off dropout, set all the
keep_prob
s to 1, and then run the code and check that it monotonically decreases J and then turn on the dropouts again.
- To solve that you'll need to turn off dropout, set all the
It may be difficult to get more data, to improve the model performance, we can use data augmentation, such like
- Horizontal/vertical flipping
- Random transformation/distortion
- Zoom
With more iterations, the model will fit better in training example but the error in validation sample may increase(overfitting). To have the model that best fits the test model, we stop training process before reach the best model for the training set.
Downside
In machine learning we want to optimizationse the cost function J and we dont want to overfit. By using early stoping, we couple these 2 tasks and no longer can work these two problem independently, because by stopping early, you are breaking whatever you're doing to optimize J and simultaneous trying to not overfit.
Normalizing the training data can speed up the training process because it facilitate the optimization of the cost function. When the features are in very different scale, it should not take large steps because when we compute the gradient descend it oscilate a lot whereas using normalized data we can use larger step size.
Steps for normalization:
- Calculate the mean
$\mu = \frac{1}{m} \sum{x}$ - Subtract mean
$x = x-\mu$ , then$x$ is zero centered. - Calculate the variance:
$\sigma^2 = \frac{1}{m} \sum(x)^2$ - Normalize by the variance:
$x = \frac{x}{\sigma^2}$
We apply
The vanishing/exploding effect occures when your derivatives become very small or very large. Supose than
-
If
$W^{[l]}>I$ (where I is the identity matrix), the result will explode, in other words,$\hat{y}$ increase exponentially. For example $$ W^{[l]}=\left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{1.5} & {0} \ {0} & {1.5}\end{array}\right]^{[L-1]}X $$ -
If
$W^{[l]}<I$ , it will decrease exponentially,$\hat{y}$ will vanish$$ W^{[l]}=\left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right] -> \hat{y} = \left[ \begin{array}{lll}{0.5} & {0} \ {0} & {0.5}\end{array}\right]^{[L-1]}X $$
As we have see in the previous video, incorrect W can lead to vanishing/exploding gradient, to avoid this problem, we initialize weights.
In a single neuron model,
W[l] = np.random.rand(shape) * np.sqrt(1/n[l-1])
where
For ReLu activation , it is better to use
Other variations
- tanh:
$\sqrt{\frac{1}{n^{[l-1]}}}$ - Xavier initialization
$\sqrt{\frac{2}{n^{[l-1]}*n^{[l]}}}$
To check that our backward propagation is correctly implemented, We can check the gradient using the numerical approximation of the gradient.
Definition of the derivative $$ f'(\theta) = \lim_{\epsilon \to \infty} \frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} $$ Implement gradient check
-
Take W and b and reshape into a big vector
$\theta$ . -
Take dW and db and reshape into
$d\theta$ . -
The cost function will be
$J(\theta) = J(\theta_1, \theta_2….)$ . -
For each i, $$ d \theta_{\text {approx}}[i]=\frac{J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}+\varepsilon, \ldots\right)-J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}-\varepsilon, \ldots\right)}{2 \varepsilon} $$
-
Then check: $$ \frac{{||d\theta_{approx} - d\theta||}2}{{||d\theta{approx}||}_2+{||d\theta||}2} $$ If this is $ \approx 10^-{7}$ is great, if is $10^{-5}$, it can be ok, but need to inspect if there are no particularly big values in ${||d\theta{approx} - d\theta||}_2$. If is
$> 10^{-3}$ , is bad, probably there is a bug in back propagation.
Tips:
- Dont use in training, only debug
- If algotithm fails grad check, look at components to try to identify bug
- Remember regularization
- Doesn't work with dropout because it causes the J random
- Run at random initialization; perhaps agin after some training because in some cases it cant be seen in the first iteration.