Using LanguageModelingReader with LanguageModel

I’m attempting to use an allennlp.data.dataset_readers.language_modeling.LanguageModelingReader to read data to train an allennlp.models.language_model.LanguageModel.

However, it appears that the APIs for these two classes do not quite agree: the LanguageModelingReader prepares an instance with two keys input_tokens and output_tokens, corresponding to the input and label sequences, whereas the LanguageModel expects just a single field source, which I assume should contain all the tokens in the document.

It seems like these classes should not be used together. What is the best practice work-around? Should I just write my own basic reader?

Confusingly, you should actually use https://github.com/allenai/allennlp/blob/master/allennlp/data/dataset_readers/simple_language_modeling.py with the LanguageModel. Take a peek at the config here: https://github.com/allenai/allennlp/blob/master/training_config/bidirectional_language_model.jsonnet#L5. I hope that helps!

1 Like