Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a positive bias to the LSTM forget gate #840

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ablavatski
Copy link

According to the "Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever, An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.

Artsiom added 3 commits September 28, 2015 13:08
… An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.
…LSTM':D504: Docstring exceeds 75 characters" was fixed.
…tinuation line under-indented for visual indent" was fixed.
@@ -354,6 +354,9 @@ class LSTM(BaseRecurrent, Initializable):
networks*, arXiv preprint arXiv:1308.0850 (2013).
.. [HS97] Sepp Hochreiter, and Jürgen Schmidhuber, *Long Short-Term
Memory*, Neural Computation 9(8) (1997), pp. 1735-1780.
.. [Jozefowicz15] Jozefowicz R., Zaremba W. and Sutskever I., *An
Empirical Exploration of Recurrent Network Architectures*, Journal
of Machine Learning Research 37 (2015).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sphynx doesn't like citations which were not referenced from somewhere. And actually, I see no reason of mentioning this paper, we cannot mention everyone who worked with LSTMs.

Can you whether add some documentation concerning this paper or delete the reference?

@dmitriy-serdyuk
Copy link
Contributor

Thank you! Have you tested this implementation? Does it outperform the standard one?

Artsiom added 2 commits September 29, 2015 09:47
… gate was fixed. 2 more LSTM tests were added: in case using and not using bias.
… continuation line under-indented for visual indent" was fixed.
@ablavatski
Copy link
Author

Thank you Dmitriy for comprehensive review of my code. I fixed all your issues you mentioned. I haven't tested this implementation of LSTM yet, but, I think, I will use it. As soon as I have some results, I will let you know.

@rizar
Copy link
Contributor

rizar commented Sep 29, 2015

If this is really helpful, we should make it mandatory, I think. Just as currently all the initial states of recurrent networks are trainable, which is not a completely standard technique, but at least never hurts.

@ablavatski
Copy link
Author

So, what is the next steps for me for this pull request?To my mind, if we set "has_bias" to False, the LSTM will work as previously (without bias to forget) and you can merge this pull request. And as soon we have strong (practical) evidence that with bias it works better, we can set "has_bias" to True and new functionality will be in the framework. What do you think?

@rizar
Copy link
Contributor

rizar commented Sep 29, 2015

My opinion is that we can just add these trainable biases without use_bias option. However, I have a bad feeling about using biases_init initialization scheme for that, because what we want is to set biases for forget gates, not reset gates, e.g.

But it would really help if you try how it works in practice.

@rizar
Copy link
Contributor

rizar commented Oct 26, 2015

Any results?

@ablavatski
Copy link
Author

Sorry, Dzmitry, but due to the busy conditions I didn't have enough time to fully test the proposed approach. On the task I tried, it didn't show large outperformance in comparison with traditional LSTM. However, it doesn't mean that LSTM with bias doesn't outperform the traditional one. Most probable, my experiment wasn't good enough and reliable. To really prove something, some we need some experiments with real metrics. Also, if you really want to merge this changes your valuable comment about biases_init should be implemented before. As soon as I will have some free time I will try to check to versions atleast on some examples and we will see. I really apologize.

@ablavatski ablavatski closed this Oct 28, 2015
@ablavatski ablavatski reopened this Oct 28, 2015
@ablavatski
Copy link
Author

Sorry, closed by mistake...reopen

@rizar
Copy link
Contributor

rizar commented Oct 28, 2015

No rush, if you are still planning to test this change, I will not close
the PR.

On 28 October 2015 at 02:14, ablavatski [email protected] wrote:

Sorry, closed by mistake...reopen


Reply to this email directly or view it on GitHub
#840 (comment).

@cooijmanstim
Copy link

This is a bit of a random comment, but the biases introduced here will be redundant; LSTM takes its biases from the inputs. However, the improvement reported in the paper is due to initial bias. We don't need a new trainable parameter to implement that. All we need is an extra term in the expression for the forget gate:

forget_gate = tensor.nnet.sigmoid(slice_last(activation, 1) +
                                  cells * self.W_cell_to_forget +
                                  self.initial_forget_gate_bias)

where self.initial_forget_gate_bias is a constant that can be specified by the user in the constructor (or lazily assigned).

@rizar
Copy link
Contributor

rizar commented Jan 8, 2016

Thanks for this insight Tim! Do you have an experience with this trick,
does it actually help?

On 6 January 2016 at 15:18, Tim Cooijmans [email protected] wrote:

This is a bit of a random comment, but the biases introduced here will be
redundant; LSTM takes its biases from the inputs. However, the improvement
reported in the paper is due to initial bias. We don't need a new
trainable parameter to implement that. All we need is an extra term in the
expression for the forget gate:

forget_gate = tensor.nnet.sigmoid(slice_last(activation, 1) +
cells * self.W_cell_to_forget +
self.initial_forget_gate_bias)

where self.initial_forget_gate_bias is a constant that can be specified
by the user in the constructor (or lazily assigned).


Reply to this email directly or view it on GitHub
#840 (comment).

@cooijmanstim
Copy link

I use it and it seems to work, but I don't have a simple experiment to back it up and I may be confounding it with the effect of identity initialization for the h-to-h matrix. It's worth noting that the originator of forget gates used positive initialization for the bias ("Learning to Forget: Continual Prediction with LSTM", ftp://ftp.idsia.ch/pub/felix/GersFA-NIPS.ps ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants