Question about masking #89

Microbiods · 2022-07-21T01:43:28Z

Hi, thanks for the wonderful repo, I am new in BERT, so I 'd like to make sure in your example:

model = PerformerLM()
x = torch.randint(0, 20000, (1, 2048))
mask = torch.ones_like(x).bool()
model(x, mask = mask) # (1, 2048, 20000)

is this 'mask' is attention_mask? i.e., TRUE (1) for normal tokens and FALSE (0) for padding tokens? Or set 1 to indicate padding token?
Thanks a lot!

BomanNg · 2022-11-10T09:52:06Z

@Microbiods this attention mask is just the same as the attention mask of BERT. FALSE for padding tokens

jdefriel · 2023-02-07T03:01:12Z

Another question on masking: in normal transformer the mask is implemented after QK^T and before softmax but since we do KV first when is the mask implemented? After full attention is calculated?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about masking #89

Question about masking #89

Microbiods commented Jul 21, 2022

BomanNg commented Nov 10, 2022

jdefriel commented Feb 7, 2023

Question about masking #89

Question about masking #89

Comments

Microbiods commented Jul 21, 2022

BomanNg commented Nov 10, 2022

jdefriel commented Feb 7, 2023