PretrainedTransformerIndexer with pre-tokenized text

PretrainedTransformer{Tokenizer/Indexer/Embedder} expects a string of untokenized text as input. Is it possible to use these classes with pretokenized sentences, as was possible with the old PretrainedBertEmbedder? The reason I want this is because I am applying RoBERTa to a tagging task, where there is a 1/1 correspondence between the original tokenization and the predicted sequence. The desired behavior would be:

  1. Retokenize each sentence into word pieces
  2. Pass the word pieces through the pretrained transformer
  3. Return a sequence of the length of the original tokenization, by taking the first/last word piece for each word

PretrainedTransformerTokenizer does not seem able to support this, because it expects string input instead of tokens. Attempting to use PretrainedTransformerIndexer without the tokenizer will understandably throw errors.

This is planned for the near future: We call this “mismatched” tokenization and modeling, and we need to update our old BERT code to handle this case for all transformers in the new transformers repo. No one has actually started working on that yet, so contributions welcome!

1 Like