I was training KnowBert following the knowbert_wordnet.jsonnet configuration file specified in the Github repository, which trains on one epoch and infinite iterations. I was meant to stop the training after 500,000 iterations, however my computer crashed after about 300,000 iterations. I ran the
train command with the
-r flag to recover my model and continue training, but since there was no epoch to restore, the command created a
model.tar.gz with the best weights.
I tried the command
train again, now specifying the newly created
model.tar.gz folder as the model folder, and even though the training resumed, I noticed that the metrics (such as nsp accuracy and loss) are way higher now than they were when my computer crashed. It’s been one hour since I started training again, and things have not changed – the values of the metrics I get are similar to those I had when I first started training.
Is it possible to actually recover training if I am training on one epoch and infinite iterations? What would the best approach to this situation be?