How to set model_size in NoamLR for pre-trained transformers?

Hi, I am interested in using the NoamLR schedular with a pre-trained transformer, but I am not exactly sure how to set the required model_size param.

My guesses are:

  • the size of the model’s hidden dimension (hidden_size) e.g. 768 for BERT-base (and other “base” transformers)
  • size of the multi-head attention matrices (intermediate_size), e.g. 3072 for BERT-base (and other “base” transformers)

but I can’t convince myself which one (or either) is correct. Does anyone happen to know? Thanks in advance!