Bert Attention Masking from Huggingface repo

Hey everyone,

I’m relatively new to transformer models and I was looking through how the BERT models are use in allennlp and huggingface. This may not be the best place to ask since the code I’m inquiring about is actually in huggingface’s repo but I figured you would know the answer to this.

I was wondering why the attention mask is added to the attention scores on line 215 instead of multiplying. The probabilities will be small after the softmax is applied but they won’t be zero.

This is just a guess, but if the attention scores are log values at that point, it would make sense to mask with 0 and -inf instead of 1 and 0 and use addition rather than multiplication.

1 Like