Mitgate computational burden of fitting token dictionary?

I have a little bit of a wacky setup where my DatasetReader.text_to_instance, at train time, performs a bunch of extra steps beyond tokenization/indexing. The problem is, before training, the Vocabulary object runs through the whole dataset, fitting the token dictionary from the dataset. Because my dataset is large (~500K documents), this incurs a large burden (over 1.5hrs).

I have thought of the following solutions but don’t know how to approach them, any guidance would be appreciated:

  1. Somehow determine when the Vocabulary object is fitting a token dictionary from the dataset, disable the extra steps beyond tokenizing/indexing in my DatasetReader.text_to_instance.
  2. Save and load the vocabulary from disk so I only pay this 1.5hr cost once.
  3. Skip the fitting token dictionary step altogether? As I used a pre-trained transformer and my vocabulary is pre-trained.

Some possibly relevant details:

  • My DatasetReader is lazy
  • I use a PretrainedTransformerTokenizer and PretrainedTransformerIndexer

Relevant: Probably your best option is to use 2 for now (which also skips the vocabulary fitting step). This is very easily done by using the "from_files" vocabulary type. If you’ve already trained a model, you should already have these files computed somewhere. And if you don’t have any labels, then it would be easy to just create it manually - an otherwise empty directory with a non_padded_namespaces.txt file is enough.

We might have a better story for this at some point, but nothing is coming particularly soon.

Ah I see, I didn’t realize there were several vocab types (never interacted with the object directly) but this set me on the right path. Thanks a lot!

Ah sorry, is there an example config anywhere on how to specify the "from_files" Vocabulary class? I tried to do this in my config, specifying

        "vocab": {
            "type": "from_files",
            "directory": "datasets/openwebtext"

as an argument to my DatasetReader but I get an error

allennlp.common.checks.ConfigurationError: Extra parameters passed to ContrastiveDatasetReader: {'vocab': {'type': 'from_files', 'directory': 'datasets/openwebtext'}}

(ContrastiveDatasetReader inherits from DatasetReader and calls super().__init__(**kwargs) in its constructor).

Nevermind! Figured it out. It is vocabulary, not vocab and needs to be provided to the top-level of the config.

It’s a top-level key, not a parameter to the dataset reader class.

1 Like