Hi,

I have implemented the following attention function, adapting it from an, as it called “attention bi lstm” class I found (https://github.com/littleflow3r/attention-bilstm-for-relation-classification/blob/master/model.py),

```
def attention(out, hidden):
out = out.permute(1,0,2)
hidden = hidden.squeeze(0)
attn_weights = torch.einsum('pqr,pr->pq', [out, hidden])
soft_attn_weights = F.softmax(attn_weights, 1)
new_hid = torch.einsum('pqr,pq->pr', [out, soft_attn_weights])
return new_hid
```

as I mentioned, this is intended to be used after an lstm, and works in my code.

Calling it with

out: torch.Size([2, 2, 3])

hidden: torch.Size([2, 3])

gives torch.Size([2, 3])

or with

hn: torch.Size([2, 4])

out.shape: torch.Size([3, 2, 4])

attention(out, hn).shape: torch.Size([2, 4])

(example:

hn = torch.FloatTensor([[0, 0, 0, 0], [1, 1, 1, 1]])

out = torch.FloatTensor([[[1, 2, 3, 4], [4, 5, 6, 7]], [[1, 2, 3, 4], [4, 5, 6, 7]], [[7, 8, 9, 10], [10, 11, 12, 13]]])

)

But if I use allennlp attention implementations, I have:

linear = CosineAttention(normalize=False)

output = linear(

torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),

torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),

)

output.shape

torch.Size([2, 2])

[2, 3], [2, 2, 3] —> [2, 2]

linear = DotProductAttention(normalize=False)

output = linear(

torch.FloatTensor([[0, 0, 0], [1, 1, 1]]),

torch.FloatTensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]),

)

output.shape

torch.Size([2, 2])

[2, 3], [2, 2, 3] —> [2, 2]

and if I use the second example, I have error because it works differently and waits different dimensions.

RuntimeError: Expected tensor to have size 2 at dimension 0, but got size 3 for argument #2 ‘batch2’ (while checking arguments for bmm)

Can you provide some feedback? Is my implementation wrong?

I would like to replace my implementation with allennlp’s attention functions!

But are those implementations appropriate for applying them after an lstm output?

Your help would be much appreciated.