-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM model equations #9
Comments
Michael, You read closely those diagrams :) Yes, indeed it appears that referencing 2013 and Alex Graves is not as precise as I'd hoped. In any case, it is informative to see how Andrej Karpathy in https://github.com/karpathy/recurrentjs describes LSTMs (in Javascript), and how Zaremba describes LSTMs here (in Lua) https://github.com/wojciechz/learning_to_execute. To be fair the most common implementation is the one present here, but potentially a better one is the one you speak of. If you cross-validate one against the other I'd be very interested in hearing if there's a major difference |
Michael, Quick follow up. Ran a couple models with the two different versions and using the version you talk about most models hit a local minima way sooner in their training. In most cases training time is doubled or tripled to exit it. While the version implemented here (where memory does not feed back to gates) reaches a lower local minima and exits quicker. There may be some coupling with the type of gradient descent run (Adadelta vs Adam vs RMSProp or something else). If you find a way of training them easily, or some combination that works well I'd be curious to hear about it, but for now it appears that these cannot be used interchangeably without understanding where the optimisation troubles come from. |
Thanks for your very detailed reply! I'll let you know if I find anything else useful related to this. |
You might be interested in a more thorough discussion from last week's Arxiv paper. |
That's a very useful reference. Thanks! |
The code says it implements the version of the LSTM from Graves et al. (2013), which I assume is this http://www.cs.toronto.edu/~graves/icassp_2013.pdf or http://www.cs.toronto.edu/~graves/asru_2013.pdf. However, it looks like the LSTM equations in those papers have both the output layer values and memory cell values from the previous time step as input to the gates.
E.g., in equation 3 of http://www.cs.toronto.edu/~graves/icassp_2013.pdf:
i_t = σ (W_xi xt + W_hi ht−1 + W_ci ct−1 + bi)
However, it looks like the code is doing the following:
i_t = σ (W_xi xt + W_hi ht−1 + bi)
Am I missing something here? Is there another LSTM paper this is based on?
I doubt there's much of a practical difference between these two formulations, but it would be good if the documentation were accuracy. Sorry if I'm misunderstanding something here (also sorry for the messy equations above).
The text was updated successfully, but these errors were encountered: