Sequence of Sequence encoding for TextClassification


I am working on a document classification similar to

Each document is comprised of sentences (max = N, but there are varying numbers), and each sentence is comprised of many words.

Similar to previous post, I would like to create an encoder for each sentence (“sentence” encoder), using an Seq2vec encoder. Then, I want to pass these sentence encodings to the “document” encoder.

Based on the discussions in the previous post,
My DataReader returns a ListField of TextFields (each TextField = a sentence).

I have two “TimeDistributed” encoders: one for sentence and one for document. Each encoder is an LSTM.

  • self.sentence_encoder = TimeDistributed(sentence_encoder)
  • self.document_encoder = TimeDistributed(document_encoder)

The sentence_encoder gets the embedded text and get a mask using get_text_field_mask(text, num_wrapping_dims=1)

Currently, I got stuck at implementing Model.forward(): in particular, get mask for the input of the document_encoder. How can i get the mask from the sentence_encoder output? When I create ListField, should I keep recording of sentence numbers and manually create a vector for mask?

Thank you a lot.

The sentence mask is going to have shape (batch_size, num_sentences, num_words). For the document encoder, you want something of shape (batch_size, num_sentences). If the document encoder is operating on a sequence of sentence vectors, you don’t want to use TimeDistributed with it, because (batch_size, num_sentences, sentence_encoding_dim) is already the expected shape of Seq2VecEncoder. To get a proper mask for this, you just need to sum the sentence mask along the num_words dimension so that you get something of shape (batch_size, num_sentences).

Thanks so much. It works now. :slight_smile: