Hi, I am interested in using the NoamLR schedular with a pre-trained transformer, but I am not exactly sure how to set the required
My guesses are:
- the size of the model’s hidden dimension (
hidden_size) e.g. 768 for BERT-base (and other “base” transformers)
- size of the multi-head attention matrices (
intermediate_size), e.g. 3072 for BERT-base (and other “base” transformers)
but I can’t convince myself which one (or either) is correct. Does anyone happen to know? Thanks in advance!