Custom attention vs allennlp attention

Hi,
I have implemented the following attention function, adapting it from an, as it called “attention bi lstm” class I found (https://github.com/littleflow3r/attention-bilstm-for-relation-classification/blob/master/model.py),

def attention(out, hidden):
    out = out.permute(1,0,2)
    hidden = hidden.squeeze(0)
    attn_weights = torch.einsum('pqr,pr->pq', [out, hidden])
    soft_attn_weights = F.softmax(attn_weights, 1)
    new_hid = torch.einsum('pqr,pq->pr', [out, soft_attn_weights])
    return new_hid

as I mentioned, this is intended to be used after an lstm, and works in my code.

Calling it with

out: torch.Size([2, 2, 3])
hidden: torch.Size([2, 3])

gives torch.Size([2, 3])

or with
hn: torch.Size([2, 4])
out.shape: torch.Size([3, 2, 4])

attention(out, hn).shape: torch.Size([2, 4])

(example:
hn = torch.FloatTensor([[0, 0, 0, 0], [1, 1, 1, 1]])
out = torch.FloatTensor([[[1, 2, 3, 4], [4, 5, 6, 7]], [[1, 2, 3, 4], [4, 5, 6, 7]], [[7, 8, 9, 10], [10, 11, 12, 13]]])
)

But if I use allennlp attention implementations, I have:

linear = CosineAttention(normalize=False)
output = linear(
torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
)
output.shape
torch.Size([2, 2])

[2, 3], [2, 2, 3] —> [2, 2]

linear = DotProductAttention(normalize=False)
output = linear(
torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),
torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),
)
output.shape
torch.Size([2, 2])

[2, 3], [2, 2, 3] —> [2, 2]

and if I use the second example, I have error because it works differently and waits different dimensions.
RuntimeError: Expected tensor to have size 2 at dimension 0, but got size 3 for argument #2 ‘batch2’ (while checking arguments for bmm)

Can you provide some feedback? Is my implementation wrong?
I would like to replace my implementation with allennlp’s attention functions!
But are those implementations appropriate for applying them after an lstm output?
Your help would be much appreciated.

It’s hard to know what your intended operations are without having more descriptive code. Our code is computing attention weights, not the weighted sum. So our inputs are (batch_size, num_elements, embedding_dim) and (batch_size, embedding_dim), and our output is (batch_size, num_elements). We then have a weighted_sum() function that will compute the (batch_size, embedding_dim) attention-weighted vector. It looks like your function is doing both at the same time.

OK, it works, thanks for the help.
Specifically the last four lines substituted my custom function, and now I can experiment with different attention functions:
outputs_prem, hidden_prem = self.rnn(translated_prem)
# instead of the following four lines we could use this:
# hidden_prem = attention(outputs_prem, hidden_prem)

        outputs_prem = outputs_prem.permute(1,0,2)
        hidden_prem = hidden_prem.squeeze(0)
        hidden_prem_att = cosineAttention(hidden_prem, outputs_prem)
        hidden_prem = weighted_sum(hidden_prem, hidden_prem_att)

Thanks