Best Practice to Use External Evaluators

I am curious what is the best practice to incorporate external evaluators? Many datasets/tasks (e.g. BioNLP shared tasks) have their own official offline evaluators. The usual pipeline to use these evaluators is to predict the test set and store the predictions in some local files, and then pass the predictions to the official evaluators.

It seems challenging if we want to use the scores from these official evaluators as the validation metric/ early stopping criteria due to the design of AllenNLP:

  1. Evaluation is run on each step instead of epoch.
  2. Trainer and predictor are separate.
  3. The training loop is abstracted.

I am wondering is there an easy way to integrate these external evaluator into the training pipeline of AllenNLP?

I’m not sure I understand the setting correctly. You want to make predictions on the validation set, pass them off to an external offline evaluator, and use the results for early stopping?

Yes, that is exactly what I was trying to do! I wonder how this procedure is typically implemented in pipelines based on AllenNLP?

I don’t think we have a “typical” way of approaching this. It’s somewhat unusual for our workflow.

I would probably just train one epoch at a time, and then use allennlp evaluate or allennlp predict to get predictions on the validation set. Then get the evaluations from the humans, and then train for the second epoch, starting with the weights from the first epoch. It might be easier to do this if you don’t use configuration files and command line, and use AllenNLP as a Python library instead. The only thing you can’t take advantage of then is the validation and patience algorithms, but they are not hard to do yourself.

I hope this helps?

I see. Using AllenNLP as a Python library instead sounds like a much easier way to do it. Thank you very much for the help!

No problem, let me know if you run into any roadblocks!