Is it Possible to Combine Representations from Pre-trained Transformer and Token Characters?

I have recently created a new AllenNLP environment from master. I have some old configuration files where the token_indexers are a combination of bert and token_characters similar to this older NER example.

I’ve updated to the new usage of pretrained_transformer_mismatched in place of the bert-specific indexing:

 "token_indexers": {
    "tokens": {
      "type": "pretrained_transformer_mismatched",
      "model_name": "bert-base-uncased"
    },

and have kept the old code to keep the token characters:

"token_characters": {
      "type": "characters",
      "min_padding_length": 1
    }

I have looked through all of the configuration files and tokens and token characters are only ever used together when tokens are of type embedding. Whenever type pretrained_transformer_mismatched is used, it is never combined with regular character embeddings so I am unable to see a correct use case. I’m just wondering is it possible to do this like in the older NER example above? The main reason is that I want my model to operate at a lower level than word-piece units. I have included my current configuration below where I have naively merged the behaviour of configuration files which include i) representations from transformers and ii) ones which include regular token and character embeddings. Perhaps I am missing something obvious? The current configuration results in a RuntimeError: CUDA error: device-side assert triggered but if I remove all of the code to include the token characters my model can train.

I can post further detail if necessary but my main question is how can we combine representations from a pre-trained transformer with regular character embeddings? Or is it not necessary to do this? Thanks

local bert_embedding_dim = 768;
local char_embedding_dim = 64;
local tag_embedding_dim = 50;
local tag_combined_dim = 150;
local embedding_dim = bert_embedding_dim + tag_combined_dim + char_embedding_dim + char_embedding_dim;
//local embedding_dim = bert_embedding_dim + tag_combined_dim;
local hidden_dim = 600;
local num_epochs = 75;
local patience = 10;
local learning_rate = 0.001;
local dropout = 0.5;
local input_dropout = 0.5;
local recurrent_dropout_probability = 0.5;

{
  "random_seed":  std.parseInt(std.extVar("RANDOM_SEED")),
  "pytorch_seed": std.parseInt(std.extVar("PYTORCH_SEED")),
  "numpy_seed": std.parseInt(std.extVar("NUMPY_SEED")),
  "dataset_reader":{
    "type":"universal_dependencies_enhanced",
     "token_indexers": {
        "tokens": {
          "type": "pretrained_transformer_mismatched",
          "model_name": "bert-base-uncased"
        },
         // can we still combine transformer and character representations?   
        "token_characters": {
          "type": "characters",
          "min_padding_length": 1
        }
      }
    },
    "train_data_path": std.extVar("TRAIN_DATA_PATH"),
    "validation_data_path": std.extVar("DEV_DATA_PATH"),
    //"test_data_path": std.extVar("TEST_DATA_PATH"),
    "model": {
      "type": "enhanced_dm_parser",
      "text_field_embedder": {
        "token_embedders": {
          "tokens": {
            "type": "pretrained_transformer_mismatched",
            "model_name": "bert-base-uncased"
            },
          "token_characters": {
            "type": "character_encoding",
            "embedding": {
            "embedding_dim": char_embedding_dim
            },
            "encoder": {
            "type": "lstm",
            "input_size": char_embedding_dim,
            "hidden_size": char_embedding_dim,
            "num_layers": 2,
            "bidirectional": true
            }
          }
        }
      },
      "lemma_tag_embedding":{
        "embedding_dim": tag_embedding_dim,
        "vocab_namespace": "lemmas",
        "sparse": true
      },
      "upos_tag_embedding":{
        "embedding_dim": tag_embedding_dim,
        "vocab_namespace": "upos",
        "sparse": true
      },
      "xpos_tag_embedding":{
        "embedding_dim": tag_embedding_dim,
        "vocab_namespace": "xpos",
        "sparse": true
      },
      "encoder": {
        "type": "stacked_bidirectional_lstm",
        "input_size": embedding_dim,
        "hidden_size": hidden_dim,
        "num_layers": 3,
        "recurrent_dropout_probability": 0.5,
        "use_highway": true
      },
      "arc_representation_dim": 500,
      "tag_representation_dim": 100,
      "dropout": 0.33,
      "input_dropout": 0.33
      },
      "data_loader": {
      "batch_sampler": {
        "type": "bucket",
        "sorting_keys": ["tokens"],
        "batch_size": std.parseInt(std.extVar("BATCH_SIZE"))
      }
    },
    "evaluate_on_test": false,
    "trainer": {
      "num_epochs": std.parseInt(std.extVar("NUM_EPOCHS")),
      "grad_norm": 5.0,
      "patience": 10,
      "cuda_device": std.parseInt(std.extVar("CUDA_DEVICE")),
      "validation_metric": "+labeled_f1",
      "num_gradient_accumulation_steps": std.parseInt(std.extVar("GRAD_ACCUM_BATCH_SIZE")),
      "optimizer": {
        "type": "dense_sparse_adam",
        "betas": [0.9, 0.9]
      }
    }
  }

Yes, this should work just fine (at least in the code; I have no intuition for whether this is a good modeling idea or not). There is one issue with your configuration, where you need to pass the vocabulary namespace to the token characters embedding; see the third bullet under “config file changes” here. If you still get an error after fixing that, can you open an issue with more details on github? As I said, this is supposed to just work, and if it doesn’t, there’s a bug we need to fix.

Thanks Matt! that worked, e.g. just adding the "vocab_namespace": "token_characters" below:

{“type”: “character_encoding”, “embedding”: {“embedding_dim”: 25, “vocab_namespace”: “token_characters”}

May I also ask what the model_name will be if we are using our own BERT model, e.g. a model downloaded from HuggingFace but has been trained for extra steps?

In the old config, this was as below for the token_indexer and token_embedder respectively:

// N.B. old usage:
"pretrained_model": std.extVar("BERT_VOCAB")
"pretrained_model": std.extVar("BERT_WEIGHTS")

Here is the config file I used in case it is helpful for anyone (I copied the dependency parser config so it wouldn’t be due to a bug in my own model)

// copies dependency_parser.jsonnet but includes transformer + chars
    local transformer_model = "bert-base-uncased";
    local embedding_dim = 768 + 100 + 16 + 16;
    {
        "dataset_reader":{
            "type":"universal_dependencies_copy", // copy is copied file from allennlp-models registered with altered name
            "token_indexers": {
            "tokens": {
              "type": "pretrained_transformer_mismatched",
              "model_name": transformer_model
            },
            "token_characters": {
              "type": "characters",
              "min_padding_length": 1
            }
          }
        },
        "train_data_path": std.extVar("TRAIN_DATA_PATH"),
        "validation_data_path": std.extVar("DEV_DATA_PATH"),
        "model": {
          "type": "biaffine_parser_copy", // copy is copied file from allennlp-models registered with altered name
          "text_field_embedder": {
            "token_embedders": {
              "tokens": {
                "type": "pretrained_transformer_mismatched",
                "model_name": transformer_model
              },
              "token_characters": {
                "type": "character_encoding",
                "embedding": {
                "embedding_dim": 16,
                "vocab_namespace": "token_characters"
                },
                "encoder": {
                "type": "lstm",
                "input_size": 16,
                "hidden_size": 16,
                "num_layers": 1,
                "bidirectional": true
                }
              }
            }
          },
          "pos_tag_embedding":{
            "embedding_dim": 100,
            "vocab_namespace": "pos",
            "sparse": true
          },
          "encoder": {
            "type": "stacked_bidirectional_lstm",
            "input_size": embedding_dim,
            "hidden_size": 400,
            "num_layers": 3,
            "recurrent_dropout_probability": 0.3,
            "use_highway": true
          },
          "use_mst_decoding_for_validation": true,
          "arc_representation_dim": 500,
          "tag_representation_dim": 100,
          "dropout": 0.3,
          "input_dropout": 0.3,
          "initializer": {
            "regexes": [
              [".*projection.*weight", {"type": "xavier_uniform"}],
              [".*projection.*bias", {"type": "zero"}],
              [".*tag_bilinear.*weight", {"type": "xavier_uniform"}],
              [".*tag_bilinear.*bias", {"type": "zero"}],
              [".*weight_ih.*", {"type": "xavier_uniform"}],
              [".*weight_hh.*", {"type": "orthogonal"}],
              [".*bias_ih.*", {"type": "zero"}],
              [".*bias_hh.*", {"type": "lstm_hidden_bias"}]
            ]
          }
        },
        "data_loader": {
          "batch_sampler": {
            "type": "bucket",
            "batch_size" : 8
          }
        },
        "trainer": {
          "num_epochs": 50,
          "grad_norm": 5.0,
          "patience": 50,
          "cuda_device": 0,
          "validation_metric": "+LAS",
          "optimizer": {
            "type": "dense_sparse_adam",
            "betas": [0.9, 0.9]
          }
        }
    }

I don’t know the answer to that, other than that we use huggingface’s AutoModel and AutoTokenizer to instantiate things. Whatever you would pass to those methods for instantiating your model should work for our code, too.

If it’s not possible to use your own local model with huggingface’s AutoModel and AutoTokenizer, then maybe we should consider another way of getting the model into our code. But it might be better to just contribute something upstream to huggingface to make that work.

1 Like

Thanks, it turns out most of the models I had stored locally were just ones I had downloaded from huggingface and converted to weights.tar.gz so I could use it in AllenNLP. The way things are now make it really easy because it obviates the need for that, thanks! I also agree that an upstream change would be better if locally-trained models are not currently usable with huggingface’s tools but I can’t confirm if that’s the case yet.