Allennlp predict with unseen labels at test time

Hi!

I’d like to us allennlp predict with --use-dataset-reader to get predictions for my dev set. However, my dev set includes labels that weren’t in training—when running allennlp predict, I get the following error:

0it [00:00, ?it/s]2019-10-01 13:46:31,391 - INFO - streusle_tagger.dataset_readers.streusle - Reading instances from
lines in file at: /home/nfliu/.allennlp/cache/0f77b39476d1e5ebe4debcba134064d28c5f25369b0112104b70621ef9fda188.ab4304
b164deed035290f36ee7441a992c3378777277e1137b01c25247e451dd
554it [00:00, 8871.99it/s]
2019-10-01 13:46:50,895 - ERROR - allennlp.data.vocabulary - Namespace: labels
2019-10-01 13:46:50,896 - ERROR - allennlp.data.vocabulary - Token: B-!@
Traceback (most recent call last):
  File "/home/nfliu/miniconda3/envs/streusle/bin/allennlp", line 10, in <module>
    sys.exit(run())
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 102, in
 main
    args.func(args)
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/commands/predict.py", line 226, in
_predict
    manager.run()
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/commands/predict.py", line 200, in
run
    for model_input_instance, result in zip(batch, self._predict_instances(batch)):
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/commands/predict.py", line 158, in
_predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 181$ in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/models/model.py", line 124, in forw
ard_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/models/model.py", line 151, in forw
ard_on_instances
    dataset.index_instances(self.vocab)
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/data/dataset.py", line 155, in inde
x_instances
    instance.index_fields(vocab)
  File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/data/instance.py", line 72, in inde

x_fields
field.index(vocab)
File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/data/fields/sequence_label_field.py
", line 98, in index
for label in self.labels]
File "/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/data/fields/sequence_label_field.p$
", line 98, in
for label in self.labels]
File “/home/nfliu/miniconda3/envs/streusle/lib/python3.6/site-packages/allennlp/data/vocabulary.py”, line 637, in $
et_token_index
return self._token_to_index[namespace][self._oov_token]
KeyError: ‘@@UNKNOWN@@’

Seems like it’s choking because it wants to index the label, but the label hasn’t been seen? Is there an easy way of fixing this sort of converting the dataset I want to jsonl and not using --use-dataset-reader ?

What do you want to happen when you get dev set labels that weren’t seen at training time?

I just want to get the model’s prediction anyway. Intuitively, “predict” shouldn’t rely on the dataset having labels, but in this case the dataset being labeled actually breaks things.

How about modifying your dataset reader to have an option to not include labels subject to a config setting? Then just modify that setting with --overrides when calling allennlp predict.

I don’t see a clear way to do what you suggest without making labels special in some way and that feels brittle. (Essentially, a label could be a true input and not just a target.) Alternatively we’d need Haskell levels of laziness to infer that we don’t actually need specific labels in the predict case.

How about modifying your dataset reader to have an option to not include labels subject to a config setting? Then just modify that setting with --overrides when calling allennlp predict .

That’s a great idea, I’ll just do that—thanks!