Evaluate multiple saved models when the parameter file name is different than best.th

Hi,

When we train a model we can keep multiple parameters versions of the model.
For example, you can keep a parameter version after each epoch. The best performance checkpoint is also saved into the “best.th” file.

Assuming I want to evaluate different model checkpoints on a test set, (I know it might be a bad practice but still). AFAIK, when using the evaluate command, you must specify a path to the model directory and then only the “best.th” version is loaded. Is there a way to load different checkpoints other than renaming them as “best.th”?

Also, is there a process to evaluate many saved parameters one after another?

Hi @saboof, I would like to caution you against doing this. Have you already split your dataset into train, dev and test partitions? If you do this, allennlp train will use the dev set to avoid overfitting on train. This is how we select the best model. It would be rather surprising for another weight file to perform significantly better on test. Further, if it somehow did, your results on test wouldn’t necessarily be transferable to another test set drawn from the same distribution. Could you elaborate on why you want to do this?

1 Like

Hi @brendan,

Thanks for the quick reply.
Yes, I’ve already split the data into train, dev and test.
I study the few shot learning setup and often in this scenario the learning curve on the dev set is volatile, it changes rapidly between epochs.
To better understand this phenomena, I generated two dev sets to avoid picking into the test set, now I would like to see if the two dev sets performance curve are correlated.

I hope this is more clear now.

Hi @saboof,

If you look at the code for our evaluate command, you’ll see it’s not actually very much code. I would recommend copying that code into your own script, then changing this line, which lets you set a particular weight file when loading a model.

You could also just use the --weights-file argument to evaluate, but for your use case, I’d imagine you’d be better off writing a script that will do a loop and compare results, and so on.