Skip to content

Chapter 6.1 The Generative LSTM

Butcher Yang edited this page Jan 2, 2018 · 1 revision

11.1 The Generative LSTM

11.1.1 Generative Model

LSTMs can be used as a generative model. Given a large corpus of sequence data, such as text documents, LSTM models can be designed to learn the general structural properties of the corpus, and when given a seed input, can generate new sequences that are representative of the original corpus.

The problem of developing a model to generalize a corpus of text is called language modeling in the field of natural language processing. A language model may work at the word level and learn the probabilistic relationships between words in a document in order to accurately complete a sentence and generate entirely new sentences. At its most challenging, language models work at the character level, learning from sequences of characters, and generating new sequences one character at a time.

The goal of character-level language modeling is to predict the next character in a sequence.| Generating Text with Recurrent Neural Networks, 2011.

Although more challenging, the added fexibility of a character-level model allows new words to be generated, punctuation added, and the generation of any other structures that may exist in the text data.

... predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings.| Generating Sequences With Recurrent Neural Networks, 2013.

Language modeling is by far the most studied application of Generative LSTMs, perhaps because of the use of standard datasets where model performance can be quantied and compared. This approach has been used to generate text on a suite of interesting language modeling problems, such as:

  • Generating Wikipedia articles (including markup).
  • Generating snippets from great authors like Shakespeare.
  • Generating technical manuscripts (including markup).
  • Generating computer source code.
  • Generating article headlines.

The quality of the results vary; for example, the markup or source code may require manual intervention to render or compile. Nevertheless, the results are impressive. The approach has also been applied to dierent domains where a large corpus of existing sequence information is available and new sequences can be generated one step at a time, such as:

  • Handwriting generation.
  • Music generation.
  • Speech generation.

11.1.2 Architecture and Implementation

A Generative LSTM is not really architecture, it is more a change in perspective about what an LSTM predictive model learns and how the model is used. We could conceivably use any LSTM architecture as a generative model. In this case, we will use a simple Vanilla LSTM.

In the case of a character-level language model, the alphabet of all possible characters is xed. A one hot encoding is used both for learning input sequences and predicting output sequences. A one-to-one model is used where one step is predicted for each input time step. This means that input sequences may require specialized handling in order to be vectorized or formatted for eciently training a supervised model. For example, given the sequence:

"hello world"
Listing 11.1: Example of a character sequence.

A dataset would need to be constructed such as:

'h' => 'e'
'e' => 'l'
'l' => 'l'
...
Listing 11.2: Example of a character sequence as a one-to-one model.

This could be presented as-is as a dataset of one time step samples, which could be quite limiting to the network (e.g. no BPTT). Alternately, it could be vectorized to a xed-length input sequence for a many-to-one time step model, such as:

['h', 'e', 'l'] => 'l'
['e', 'l', 'l'] => 'o'
['l', 'l', 'o'] => ' '
...
Listing 11.3: Example of a character sequence as a many-to-one model.

Or, a xed-length output sequence for a one-to-many time step model:

'h' => ['e', 'l', 'l']
'e' => ['l', 'l', 'o']
'l' => ['l', 'o', ' ']
...
Listing 11.4: Example of a character sequence as a one-to-many model.

Or some variation on these approaches. Note that the same vectorized representation would be required when making predictions, meaning that predicted characters would need to be presented as input for subsequent samples. This could be quite clumsy in implementation. The internal state of the network may need careful management, perhaps reset at choice locations in the input sequence (e.g. end of paragraph, page, or chapter) rather than at the end of each input sequence.

Clone this wiki locally