-
Notifications
You must be signed in to change notification settings - Fork 349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a positive bias to the LSTM forget gate #840
base: master
Are you sure you want to change the base?
Adding a positive bias to the LSTM forget gate #840
Conversation
… An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.
…LSTM':D504: Docstring exceeds 75 characters" was fixed.
…tinuation line under-indented for visual indent" was fixed.
@@ -354,6 +354,9 @@ class LSTM(BaseRecurrent, Initializable): | |||
networks*, arXiv preprint arXiv:1308.0850 (2013). | |||
.. [HS97] Sepp Hochreiter, and Jürgen Schmidhuber, *Long Short-Term | |||
Memory*, Neural Computation 9(8) (1997), pp. 1735-1780. | |||
.. [Jozefowicz15] Jozefowicz R., Zaremba W. and Sutskever I., *An | |||
Empirical Exploration of Recurrent Network Architectures*, Journal | |||
of Machine Learning Research 37 (2015). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sphynx doesn't like citations which were not referenced from somewhere. And actually, I see no reason of mentioning this paper, we cannot mention everyone who worked with LSTMs.
Can you whether add some documentation concerning this paper or delete the reference?
Thank you! Have you tested this implementation? Does it outperform the standard one? |
… gate was fixed. 2 more LSTM tests were added: in case using and not using bias.
… continuation line under-indented for visual indent" was fixed.
Thank you Dmitriy for comprehensive review of my code. I fixed all your issues you mentioned. I haven't tested this implementation of LSTM yet, but, I think, I will use it. As soon as I have some results, I will let you know. |
If this is really helpful, we should make it mandatory, I think. Just as currently all the initial states of recurrent networks are trainable, which is not a completely standard technique, but at least never hurts. |
So, what is the next steps for me for this pull request?To my mind, if we set "has_bias" to False, the LSTM will work as previously (without bias to forget) and you can merge this pull request. And as soon we have strong (practical) evidence that with bias it works better, we can set "has_bias" to True and new functionality will be in the framework. What do you think? |
My opinion is that we can just add these trainable biases without But it would really help if you try how it works in practice. |
Any results? |
Sorry, Dzmitry, but due to the busy conditions I didn't have enough time to fully test the proposed approach. On the task I tried, it didn't show large outperformance in comparison with traditional LSTM. However, it doesn't mean that LSTM with bias doesn't outperform the traditional one. Most probable, my experiment wasn't good enough and reliable. To really prove something, some we need some experiments with real metrics. Also, if you really want to merge this changes your valuable comment about |
Sorry, closed by mistake...reopen |
No rush, if you are still planning to test this change, I will not close On 28 October 2015 at 02:14, ablavatski [email protected] wrote:
|
This is a bit of a random comment, but the biases introduced here will be redundant; LSTM takes its biases from the inputs. However, the improvement reported in the paper is due to initial bias. We don't need a new trainable parameter to implement that. All we need is an extra term in the expression for the forget gate: forget_gate = tensor.nnet.sigmoid(slice_last(activation, 1) +
cells * self.W_cell_to_forget +
self.initial_forget_gate_bias) where |
Thanks for this insight Tim! Do you have an experience with this trick, On 6 January 2016 at 15:18, Tim Cooijmans [email protected] wrote:
|
I use it and it seems to work, but I don't have a simple experiment to back it up and I may be confounding it with the effect of identity initialization for the h-to-h matrix. It's worth noting that the originator of forget gates used positive initialization for the bias ("Learning to Forget: Continual Prediction with LSTM", ftp://ftp.idsia.ch/pub/felix/GersFA-NIPS.ps ). |
According to the "Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever, An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.