Difference between BilinearMatrixAttention and torch.nn.Bilinear?

I’m looking through the implementation of BilinearMatrixAttention as used in the BiaffineDependencyParser model, but I’m having trouble finding what they differences are between this and torch’s standard Bilinear layer.

Since both a bilinear layer and a Bilinear Matrix Attention layer are used in the BiaffineDependencyParser model, I assume that they’re deliberately separate, but I don’t understand by torch’s standard Bilinear layer couldn’t be used for both.

The output of BilinearMatrixAttention has shape (batch_size, sequence_1_length, sequence_2_length). The output Bilinear has shape (batch_size, sequence_1_length, sequence_2_length, out_features).

Ah, that makes sense. So this could work in the same way, but it’d have to be squashed afterward (if out_features == 1).

Thanks for explaining!