Understanding max_len for pretrained transformers

There are three different max_length parameters that you can set when developing a BERT model using AllenNLP:

  • in token_indexers
  • in token_embedders
  • in tokenizer

From the docs, it seems like (1) and (2) are used in concert to split the document into segments of this many tokens while the tokenizer parameter is used to truncate the sequence. Does anyone know whether these can be used in concert with each other? If so, how? What is the returned dictionary where overflow tokens are added (mentioned here)? What is the default behavior when max length in the token_indexer is none? Thanks!

If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionary
If not None, split the document into segments of this many tokens (including special tokens) before feeding into the embedder. The embedder embeds these segments independently and concatenate the results to get the original document representation. Should be set to the same value as the max_length option on the PretrainedTransformerEmbedder.

In the tokenizer, it just truncates the sequence if it is too long.

The indexer and embedder have a feature where they split sequences that are too long for the transformer into smaller segments, embed them separately, and then combine the representations at the end. To do this, you have to set max_length on both of them to the same value. I don’t know what happens when they are set to different values, but I suspect it won’t do anything useful.

This is a way to process sequences longer than 512 tokens with standard BERT.