Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about inference #2

Open
twwwwx opened this issue Jan 19, 2023 · 1 comment
Open

A question about inference #2

twwwwx opened this issue Jan 19, 2023 · 1 comment

Comments

@twwwwx
Copy link

twwwwx commented Jan 19, 2023

Hi thank you for your impressive work :)
As is mentioned in your paper, "For MHA, heads with mask ’0’ will not be executed. For FFN, as matrix-matrix multiplication can be transformed to multiple matrix vector multiplications, we only need to complete part of computations where vector’s mask is not zero."
however, it seems that in modeling_ebert.py you may just simply multiply the mask with the hidden states or attention probs and computations aren't reduced. Then the inference flops is computed theoretically. Is what I said true?
But if you actually prune the channels and heads, the feature dimension (e.g.768) of the hidden states would be diminished, causing a mismatch of all those linear layers(e.g.in FFN 768->3072->768,the weight matrix is (3072,768) so if the inter dim<3072, the multiplication is invalid) How did you deal with this mismatch?

@zejiangp
Copy link
Collaborator

Thank you for interesting our work.

  1. Yes, our implementation does not reduce the actual inference flops, that's why we do not report real time speed up in the paper.
  2. For example, if we reduce 3072 to 3000, than we need a weight matrix (3000, 768), and this matrix is selected from the original matrix (3072, 768) according to the predicted masks. This selection is not easy to implement, so we just set those unused part to zero, which achieve the same result but can not bring real speed up.
  3. If we want to translate the reduction of flops to real speed up, we need to implement some functions using cuda for gpu, or using special accelerators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants