[TOC]
Why RNN?
There are a lot of sequential data, to understand it, we need the information from previous time steps. In other words, humans don't start theri thinking every time from scratch, but based on the previous information.
Examples of sequential data in application:
-
Speech recognition
-
music generation,
-
Sentiment classification,
-
DNA sequence analysis,
-
Machine translation,
-
Video activity recognition,
-
Name entity recognition:
- X: "Harry Potter and Hermoine Granger invented a new spell."
- Y: 1 1 0 1 1 0 0 0 0 where 1 means it's a name and 0 not a name
There are some problems when we try to use a traditional network for sequencial data:
- Inputs, outputs can be different lengths in different examples(for exmaple in sentiment classification, the input can be a sentence, but the output can be a number between 1-5)
- The traditional network does't share features learned across different position of text
- For example if the network have learned in some position that "Harry" is a name, it would be nice that a name appeared in other position can also be recognized.
- It usually have some dependency, for example the verb form depending if the subject is singular or plural.
Notation
- Superscript
$[l]$ denotes an object associated with the$l^{th}$ layer.- Example:
$a^{[4]}$ is the$4^{th}$ layer activation.$W^{[5]}$ and$b^{[5]}$ are the$5^{th}$ layer parameters.
- Example:
- Superscript
$(i)$ denotes an object associated with the$i^{th}$ example.- Example:
$x^{(i)}$ is the$i^{th}$ training example input.
- Example:
- Superscript
$\langle t \rangle$ denotes an object at the$t^{th}$ time-step.- Example:
$x^{\langle t \rangle}$ is the input x at the$t^{th}$ time-step.$x^{(i)\langle t \rangle}$ is the input at the$t^{th}$ timestep of example$i$ .
- Example:
- Lowerscript
$i$ denotes the$i^{th}$ entry of a vector.- Example:
$a^{[l]}_i$ denotes the$i^{th}$ entry of the activations in layer$l$ .
- Example:
-
$T_x$ is the length of the input X -
$T_y$ is the length of the output Y
NLP
In natural lenguage processing, we have to build a diccionary to represent words.
For modern application the diccionary size is normally between 30000 an 50000 words, but some large companies uses even bigger dictionarys. A diccionaty looks like:
- [a ... And ... Harry ... Potter ... Zulu]
Then we create one hot encoding for each word.
In addiction, we add <UNK>
for words that can't be found in our diccionary and <EOS>
to represent end of sentence.
RNN model try to address the shortcome of the traditional network by adding the activation of the previous layer: It recieve input X
and the activation of the previous layera^<t-1>
, then produces an output y
and a
.
In some research papers it is represented by a more compact diagram, in which the recurrent net is draw as a loop.
For example, given an sentence "Harry Potter is a boy"and we have to predict the output . If we read the sentence from lect to right, the activation
As we can observe in the diagram, one weakness of the RNN is that it only uses the previous information, but the following.
The parameters that govern the connection from
RNN is composed by RNN cells, which can be represent as:
RNN takes as input
Notice that the notation
-
tanh
is common used for activation (relu
is also used) -
for the output, the activation function depends on the types of output, for example
sigmoid
-
It can also be writen as $$ a^{} =g\left( W_a[a^{}, x^{}]+b_a \right) $$
where
$W_a=[W_{aa} | W_{ax}]$ and $[a^{}, x^{}]=\begin{bmatrix} a^{} \ x^{}\end{bmatrix}$ -
and the output can be writen as
$\hat y^{⟨t⟩} = g(W_{y}a^{⟨t⟩} + b_y)$
Shapes:
-
$x^{}$ : input data at timestep "t", (n_x, m). -
$a_{prev}$ : Hidden state at timestep "t-1", ($n_a$ , m) -
$W_{ax}$ : Weight matrix multiplying the input, ($n_a, n_x$ ) -
$W_{aa}$ : Weight matrix multiplying the hidden state, ($n_a, n_a$ ) -
$W_{ya}$ : Weight matrix relating the hidden-state to the output, ($n_y, n_a$ ) -
$b_a$ : Bias,$(n_a$ , 1) -
$b_y$ : Bias relating the hidden-state to the output, ($n_y$ , 1) -
$a_{next}$ : next hidden state, of shape ($n_a$ , m) -
$y^{}_{pred}$ : prediction at timestep "t", numpy array of shape ($n_y$ , m)
Cross entropy loss function for each step: $$ L^{}(\hat{y}^{}, y^{}) = -y^{} \log \hat{y}^{} - (1-y^{}) \log(1-\hat{y}^{}) $$ Overall loss: $$ J = L(\hat y, y) = \sum^{T_y}_{t=1} L^{⟨t⟩}(\hat y^{⟨t⟩}, y^{⟨t⟩}) $$ It is also called Backpropagation throught time
For a simple unit,
$$
\begin{array}{rcl}
a^{⟨t⟩} & = & tanh(W_{aa}a^{⟨t-1⟩} + W_{ax}x^{⟨t⟩} + b_a)\
\end{array}
$$
where
Until now we have see examples in which
- In sentimental classification, the input may be a sentense but the output is number between 1-5 ->many to one
- In machine translation, the input sentense and the output sentense may not have the same length -> many to many
- Music generation, we can input the first note and it have to generate the output melody -> one to many
From left to right
- One to one: for example image classification
- One to many: image caption, input an image and output a sentence
- Many to one: sentimental classification
- Many to many: machine translation
- Many to many: video classification where we wish to label each frame of the video)
What is language modelling
The basic job of a language model is given a sequence of word, predict the probability.
For example, in a sppeach recognition problem someone says that:
- The apple and pair salad
- The apple and pear salad
pair and pear sounds exactly the same, but the lenguage model should give the probability of those 2 words basing in other informations.
Buiding the model
Training set: large corpus of english text:
-
Tockenize the sentense: given a sentense, one not encode it to your dictionary :
- The Egyptian Mau is a bread of cat
<EOS>
->$y^{<1>}, y^{<2>}….y^{<9>}$ - We add
<EOS>
to indicate end of sentence. - When the workd is not in the dictionary, for example
Mau
, it will be encoded as<UNK>
which means unknown
- The Egyptian Mau is a bread of cat
-
Then the model try to figure out what is the word: for the sentence "Cats average 15 hours of sleep a day <EOS>"
- At time 1,
$a^{<0>}=0, x^{<1>}=\vec{0}$ , what the model do is predict the probability all the words in the diccionary , is the first worda
,aron
,cats
?. As the first word is cat, Pr(cat) will be largest. - Then, it try to figure out the second word knowing the first word is
cat
, Pr( _ /cats) - The next one is to predict Pr(_ /"cats average")
- And so on until fin
<EOS>
- At time 1,
-
The loss function is defined as $$ \begin{array}{lcl} \mathcal{L}\left(\hat{y}^{}, y^{}\right) & = &-\sum_{i} y_{i}^{} \log \hat{y}{i}^{}\ \mathcal{L}& = & \sum{t} \mathcal{L}^{}\left(\hat{y}^{}, y^{}\right) \end{array} $$
After a sequence model is trained on a language model, to check what the model has learn we can apply to sample novel sequence.
Given the model:
-
Input
$a^{<0>}=\vec{0}, x^{<1>}=\vec{0}$ . Notice that x's shape is (dict_size,1) because it will be one hot encoded to the dict. a's shape is(n_a, 1) -
Run forward propagation to get
$a^{}$ and$y^{}$ . Note that$y^{}_i$ represents the probability that the character indexed byi
is the next character. -
Randomly chose the next character's index according to the probability distribution specified by
$y^{}$ . This means that if$y^{}_i=0.16$ , you will pick the indexi
with 16% of probability.np.random.seed(0) p = np.array([0.1, 0.0, 0.7, 0.2]) index = np.random.choice([0, 1, 2, 3], p = p.ravel())
This means that you will pick the
index
according to the distribution:$P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2$ -
Use the last predicted word as the input and the activation to compute the next word.
-
Keep sampling until we got
<EOS>
or a fixed length.
we can reject <UNK>
or keep it in the output
If is character level, the output will be [a-z, A-Z. 0-9]
In standard neural network we have see the gradient vanishing and gradient exploding problem. A RNN is usually a large network: an RNN that process a sequence data with the size 100000 time steps has 10000 layers, which is very hard to optimize.
When the relationship that it need are close, it is easier for RNN to learn, for example "the clouds are in the sky,” clouds and sky are quire close.
However, there are some more complicate situations: For example:
- "The cat, which already ate ..., was full"
- "The cats, which already ate ..., were full"
For RNN is quite difficult to capture the long term dependencies like that. As there are a lot of things in the middle, then is difficult to correct the mistake of "was/were".
To resolve the gradient vanishing one can uses following techniques:
- Weight initialization
- Use LSTM/GRU networks
Exploding gradient
RNN has gradient exploling problem, which can be catastrophic. Gradient expliding can be easily seen when your weight values become nan
. It can be resulve by gradient clipping: rescale so the gradient vector it not too big.
There are different ways to clip gradients, one of them is to clip each element to [-N, N]
We can use numpy.clip:np.clip(a, a_min, a_max, out)# For in-place clipping, specify out.
Is a modification of RNN hidden layer that captures better long range connections and helps with the vanishing gradient problem.
GRU has a memory cell, which's role is to memorize, in the example of cat, memorize the cat is singular or plural.
- At time t, the memory cell will have value
$c^{}$ - The GRU unit will output an activation value that's equal to
$c^{}$ :$a^{⟨t⟩} = c^{⟨t⟩}$ , so they have the same shape. - At each time step, we overwrite the memory cell with
$\tilde c^{⟨t⟩}$ so this will be a candidate to replace$c^{}$
- The gate
$\Gamma_u$ is ranged between 0 and 1 and u means update. The gate decide if we update$c^{}$
- Then, we will replace
$c^{⟨t⟩}$ depending on the value of gate. If gate=0, then remain the previous value, otherwise replace by$\tilde c^{⟨t⟩}$ $$ c^{⟨t⟩} = \Gamma_u \times \tilde c^{⟨t⟩} + (1 - \Gamma_u) \times c^{⟨t-1⟩} $$
Suppose that C is 0 or 1 depending on whether the subject of the sentence is singular or plular, if is singular it is set to 1, else 0.
GRU unit would memorize the value of
As
Full GRU $$ \begin{array}{ccl} \tilde c^{⟨t⟩} & = & tanh(W_c[ \Gamma_r *c^{⟨t-1⟩}, x^{⟨t⟩}] + b_c) \ \Gamma_u & = & \sigma(W_u[c^{⟨t-1⟩}, x^{⟨t⟩}] + b_u)\ \Gamma_r & = & \sigma(W_r[c^{⟨t-1⟩}, x^{⟨t⟩}] + b_r)\ c^{⟨t⟩} & = & \Gamma_u \times \tilde c^{⟨t⟩} + (1 - \Gamma_u) \times c^{⟨t-1⟩} \end{array} $$
- where
$\Gamma_r$ represents the relevance gate, how relevante is $ c^{⟨t-1⟩}$
LSTM is a more powerful and general version of GRU.
- Decide what information we're going to throw away from cell state, decision made by forget gate.
$\Gamma_f$ is a sigmoid function, so it is ranged between 0 and 1. If$\Gamma_f^{\langle t \rangle} \approx 0$ is 0 then the LSTM should forget that piece of information (e.g. the singular subject) in the corresponding component of$c^{\langle t-1 \rangle}$ . If$\Gamma_f^{\langle t \rangle} \approx 1$ then it will keep the information.
-
$$ \Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f) $$
For example, in a language model when trying to predict the next word based on the previous ones. In such a problem, the cell state might include that if the present subject is singular or plural. Once new subject are found ( for example a plural subject), we want forget the old subject.
-
The next step is to decide what new information we are going to store in the cell state. This has 2 parts:
-
Update gate: this is a sigmoid layer which decides which values we will update. $$ \Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{{t}}] + b_u) $$ Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural.
-
Updating cells the candidate for replace the previous cell is
$\tilde{c}$ . When update gate is open(update gate=1), then compute the cell state$c^{\langle t \rangle}$ by taking elemnet wise multiplication with$\Gamma_u^{\langle t \rangle}$ $$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c) \ $$ The new$c^{\langle t \rangle}$ is the sum of what wasn't be forgot + what is updated. $$ c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} $$
-
-
After that, we need to decide what we are going to output. This output will be based on our cell state, but will be a filtered version.
-
First, we run a sigmoid layer which decides what parts of the cell state we’re going to output, decided by output gate $$ \Gamma_o^{\langle t \rangle}= \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o) \ $$
-
Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output gate, so that we only output the parts we decided to.
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle}) $$
-
-
Finally, the output: $$ y^{}=\sigma(W_y*a^{}+b_y) $$
Shapes
When we use LSTM in a RNN:
- x's shape =
$(m, T_x, n_{values})$ where$n_{values}$ is the unique values in the dataset -
$a^{<0>}=\vec{0}$ , a's shape is(n_a, 1) -
$c^{<0>}$ 's shape =$a^{<0>}$ 's shape
$$d \Gamma_o^{\langle t \rangle} = da_{next}\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}(1-\Gamma_o^{\langle t \rangle})\tag{7}$$
$$d\Gamma_u^{\langle t \rangle} = dc_{next}\tilde c^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * \tilde c^{\langle t \rangle} * da_{next}\Gamma_u^{\langle t \rangle}*(1-\Gamma_u^{\langle t \rangle})\tag{9}$$
$$d\Gamma_f^{\langle t \rangle} = dc_{next}\tilde c_{prev} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * c_{prev} * da_{next}\Gamma_f^{\langle t \rangle}*(1-\Gamma_f^{\langle t \rangle})\tag{10}$$
As previously commented that RNN only learns from previous information, bidirectional RNN is made to learn also from next informations.
For example:
- He said, "Teddy bears are on sale"
- He said, "Teddy Roosevelt was a great president"
In this case, only knowing the previous information we don't know if teddy referes to bear of the president, but with the following information there are no doubt.
Note that BRNN is acyclic graphic, it learns from both sides.
For a lot of NLP or text processing problems, a BRNN with LSTM are commonly used.
The disadvantage of BRNN that you need the entire sequence before you can process it
In a lot of applications one layer RNN will solve your problem, but in some cases it is useful to use deep RNNs.
A 3 layers RNN looks like
3 layer is a lot for RNN due to the time dimension.
Further reading