I am curious what is the best practice to incorporate external evaluators? Many datasets/tasks (e.g. BioNLP shared tasks) have their own official offline evaluators. The usual pipeline to use these evaluators is to predict the test set and store the predictions in some local files, and then pass the predictions to the official evaluators.
It seems challenging if we want to use the scores from these official evaluators as the validation metric/ early stopping criteria due to the design of AllenNLP:
- Evaluation is run on each step instead of epoch.
- Trainer and predictor are separate.
- The training loop is abstracted.
I am wondering is there an easy way to integrate these external evaluator into the training pipeline of AllenNLP?