Adding a positive bias to the LSTM forget gate #840

ablavatski · 2015-09-28T05:16:46Z

According to the "Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever, An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.

… An Empirical Exploration of Recurrent Network Architectures, JMLR 2015" the LSTM with large forget bias outperformed all available recurrent units in almost all tasks: "But most importantly, we determined that adding a positive bias to the forget gate greatly improves the performance of the LSTM. Given that this technique the simplest to implement, we recommend it for every LSTM implementation". The pull request contains implementation of the technique.

…LSTM':D504: Docstring exceeds 75 characters" was fixed.

…tinuation line under-indented for visual indent" was fixed.

dmitriy-serdyuk · 2015-09-28T15:39:58Z

blocks/bricks/recurrent.py

@@ -354,6 +354,9 @@ class LSTM(BaseRecurrent, Initializable):
        networks*, arXiv preprint arXiv:1308.0850 (2013).
    .. [HS97] Sepp Hochreiter, and Jürgen Schmidhuber, *Long Short-Term
        Memory*, Neural Computation 9(8) (1997), pp. 1735-1780.
+    .. [Jozefowicz15] Jozefowicz R., Zaremba W. and Sutskever I., *An
+        Empirical Exploration of Recurrent Network Architectures*, Journal
+        of Machine Learning Research 37 (2015).


Sphynx doesn't like citations which were not referenced from somewhere. And actually, I see no reason of mentioning this paper, we cannot mention everyone who worked with LSTMs.

Can you whether add some documentation concerning this paper or delete the reference?

dmitriy-serdyuk · 2015-09-28T15:44:05Z

Thank you! Have you tested this implementation? Does it outperform the standard one?

… gate was fixed. 2 more LSTM tests were added: in case using and not using bias.

… continuation line under-indented for visual indent" was fixed.

ablavatski · 2015-09-29T02:01:39Z

Thank you Dmitriy for comprehensive review of my code. I fixed all your issues you mentioned. I haven't tested this implementation of LSTM yet, but, I think, I will use it. As soon as I have some results, I will let you know.

rizar · 2015-09-29T15:34:37Z

If this is really helpful, we should make it mandatory, I think. Just as currently all the initial states of recurrent networks are trainable, which is not a completely standard technique, but at least never hurts.

ablavatski · 2015-09-29T16:06:56Z

So, what is the next steps for me for this pull request?To my mind, if we set "has_bias" to False, the LSTM will work as previously (without bias to forget) and you can merge this pull request. And as soon we have strong (practical) evidence that with bias it works better, we can set "has_bias" to True and new functionality will be in the framework. What do you think?

rizar · 2015-09-29T22:29:24Z

My opinion is that we can just add these trainable biases without use_bias option. However, I have a bad feeling about using biases_init initialization scheme for that, because what we want is to set biases for forget gates, not reset gates, e.g.

But it would really help if you try how it works in practice.

rizar · 2015-10-26T20:16:51Z

Any results?

ablavatski · 2015-10-28T06:13:39Z

Sorry, Dzmitry, but due to the busy conditions I didn't have enough time to fully test the proposed approach. On the task I tried, it didn't show large outperformance in comparison with traditional LSTM. However, it doesn't mean that LSTM with bias doesn't outperform the traditional one. Most probable, my experiment wasn't good enough and reliable. To really prove something, some we need some experiments with real metrics. Also, if you really want to merge this changes your valuable comment about biases_init should be implemented before. As soon as I will have some free time I will try to check to versions atleast on some examples and we will see. I really apologize.

ablavatski · 2015-10-28T06:14:31Z

Sorry, closed by mistake...reopen

rizar · 2015-10-28T13:12:37Z

No rush, if you are still planning to test this change, I will not close
the PR.

On 28 October 2015 at 02:14, ablavatski [email protected] wrote:

Sorry, closed by mistake...reopen

—
Reply to this email directly or view it on GitHub
#840 (comment).

cooijmanstim · 2016-01-06T20:18:36Z

This is a bit of a random comment, but the biases introduced here will be redundant; LSTM takes its biases from the inputs. However, the improvement reported in the paper is due to initial bias. We don't need a new trainable parameter to implement that. All we need is an extra term in the expression for the forget gate:

forget_gate = tensor.nnet.sigmoid(slice_last(activation, 1) +
                                  cells * self.W_cell_to_forget +
                                  self.initial_forget_gate_bias)

where self.initial_forget_gate_bias is a constant that can be specified by the user in the constructor (or lazily assigned).

rizar · 2016-01-08T23:20:56Z

Thanks for this insight Tim! Do you have an experience with this trick,
does it actually help?

On 6 January 2016 at 15:18, Tim Cooijmans [email protected] wrote:

This is a bit of a random comment, but the biases introduced here will be
redundant; LSTM takes its biases from the inputs. However, the improvement
reported in the paper is due to initial bias. We don't need a new
trainable parameter to implement that. All we need is an extra term in the
expression for the forget gate:

forget_gate = tensor.nnet.sigmoid(slice_last(activation, 1) +
cells * self.W_cell_to_forget +
self.initial_forget_gate_bias)

where self.initial_forget_gate_bias is a constant that can be specified
by the user in the constructor (or lazily assigned).

—
Reply to this email directly or view it on GitHub
#840 (comment).

cooijmanstim · 2016-01-08T23:46:14Z

I use it and it seems to work, but I don't have a simple experiment to back it up and I may be confounding it with the effect of identity initialization for the h-to-h matrix. It's worth noting that the originator of forget gates used positive initialization for the bias ("Learning to Forget: Continual Prediction with LSTM", ftp://ftp.idsia.ch/pub/felix/GersFA-NIPS.ps ).

Artsiom added 3 commits September 28, 2015 13:08

Scrutinizer issues "blocks/bricks/recurrent.py:334 in public class '…

e7dd8aa

…LSTM':D504: Docstring exceeds 75 characters" was fixed.

Scrutinizer issues "blocks/bricks/recurrent.py:420(398):49: E128 con…

07ebc76

…tinuation line under-indented for visual indent" was fixed.

dmitriy-serdyuk reviewed Sep 28, 2015
View reviewed changes

Artsiom added 2 commits September 29, 2015 09:47

Unnecessary reference was deleted. Bug with not using bias for forget…

82c9e5e

… gate was fixed. 2 more LSTM tests were added: in case using and not using bias.

Scrutinizer issue "tests/bricks/test_recurrent.py:149 (151):26: E128…

c8fc509

… continuation line under-indented for visual indent" was fixed.

ablavatski closed this Oct 28, 2015

ablavatski reopened this Oct 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a positive bias to the LSTM forget gate #840

Adding a positive bias to the LSTM forget gate #840

ablavatski commented Sep 28, 2015

dmitriy-serdyuk Sep 28, 2015

dmitriy-serdyuk commented Sep 28, 2015

ablavatski commented Sep 29, 2015

rizar commented Sep 29, 2015

ablavatski commented Sep 29, 2015

rizar commented Sep 29, 2015

rizar commented Oct 26, 2015

ablavatski commented Oct 28, 2015

ablavatski commented Oct 28, 2015

rizar commented Oct 28, 2015

cooijmanstim commented Jan 6, 2016

rizar commented Jan 8, 2016

cooijmanstim commented Jan 8, 2016

Adding a positive bias to the LSTM forget gate #840

Are you sure you want to change the base?

Adding a positive bias to the LSTM forget gate #840

Conversation

ablavatski commented Sep 28, 2015

dmitriy-serdyuk Sep 28, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk commented Sep 28, 2015

ablavatski commented Sep 29, 2015

rizar commented Sep 29, 2015

ablavatski commented Sep 29, 2015

rizar commented Sep 29, 2015

rizar commented Oct 26, 2015

ablavatski commented Oct 28, 2015

ablavatski commented Oct 28, 2015

rizar commented Oct 28, 2015

cooijmanstim commented Jan 6, 2016

rizar commented Jan 8, 2016

cooijmanstim commented Jan 8, 2016