Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Modeling with Gated Convolutional Networks #16

Open
flrngel opened this issue Jun 10, 2018 · 0 comments
Open

Language Modeling with Gated Convolutional Networks #16

flrngel opened this issue Jun 10, 2018 · 0 comments

Comments

@flrngel
Copy link
Owner

flrngel commented Jun 10, 2018

https://arxiv.org/abs/1612.08083

Abstract

  • propose gating mechanism
  • uses WikiText-103 and Google Billion Words
  • proposed model is very competitive to strong recurrent models on large scale language tasks

1. Introduction

  • convolutional network has parallelization benefit
    • but cuDNN is not optimized for 1d convolutions yet
  • Gated linear solves vanishing gradient
  • Compare to PixcelCNN, Oord et al. 2016, GLU is better than LSTM-style gating (GTU)

2. Approach

  • convolutional has no temporal dependency compare to recurrent models
  • recurrent models have infinite contexts but paper's experiment shows it is not necessary
  • GLU
    image
  • abstract model
    image
  • model uses adaptive softmax which assign higher capacity to very frequent words and lower capacity to rare words
    • this results faster coputation and needs lower memory

3. Gating Mechanisms

  • purpose of gating mechanism is to control what information should be propagated through the hierarchy of layers
  • comparing to GTU (LSTM-style gating mechanism), gradient of GTU gradually vanishes because of downscaling factors tanh'(X) and \sigma'(X) but GLU doesn't have downscaling factor
    • this can be thought of as a multiplicative skip connection (which helps gradients flow through the layers)

4. Experimental Setup

4.2. Training

  • uses gradient clipping on training and it works well

4.3. Hyper-parameters

  • initialize layers with Kaiming initialization

5. Results

5.3. Non-linear Modeling

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant