Hi!

I’v tried to replace the Stacked-BiLSTM encoder of the graph-parser model (actually a slightly different model that is based on it) with a bert encoder.

Im using the pretrained-transformer-mismatched indexer and embedder, and the “pass_through” encoder. After a few iterations, I get “nan” while computing the loss (BCEWithLogits).

It seems like PretrainedTransformerMismatchedEmbedder() returns some nan values in the embeddings of the given tokens (and so nan logits and nan CE losses). Any idea of how to debug/fix it? Im using grad_norm…

Here is the relevant changes in the configuration (the BASE is very similar to this one):

```
local BASE = import 'biaffine-graph-parser.jsonnet';
local bert_model = "bert-base-uncased";
local max_length = 128;
local bert_dim = 768;
BASE+{
"dataset_reader"+: {
"token_indexers": {
"tokens": {
"type": "pretrained_transformer_mismatched",
"model_name": bert_model,
"max_length": max_length
},
},
},
"model"+: {
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "pretrained_transformer_mismatched",
"model_name": bert_model,
"max_length": max_length
}
}
},
"pos_tag_embedding"+:{
"sparse": false # huggingface_adamw cannot work with sparse
},
// "pos_tag_embedding": null,
"encoder": {
"type": "pass_through",
"input_dim": bert_dim + $.model.pos_tag_embedding.embedding_dim
},
},
"trainer"+: {
"grad_norm": 1.0,
"optimizer": {
"type": "huggingface_adamw",
"lr": 1e-3,
"weight_decay": 0.01,
"parameter_groups": [
[[".*transformer.*"], {"lr": 1e-5}]
]
}
}
}
```

Thanks!