I have a little bit of a wacky setup where my
DatasetReader.text_to_instance, at train time, performs a bunch of extra steps beyond tokenization/indexing. The problem is, before training, the
Vocabulary object runs through the whole dataset, fitting the token dictionary from the dataset. Because my dataset is large (~500K documents), this incurs a large burden (over 1.5hrs).
I have thought of the following solutions but don’t know how to approach them, any guidance would be appreciated:
- Somehow determine when the
Vocabularyobject is fitting a token dictionary from the dataset, disable the extra steps beyond tokenizing/indexing in my
- Save and load the vocabulary from disk so I only pay this 1.5hr cost once.
- Skip the fitting token dictionary step altogether? As I used a pre-trained transformer and my vocabulary is pre-trained.
Some possibly relevant details:
- My DatasetReader is lazy
- I use a