Customization in Using Word Embeddings

I’m using the Glove embeddings and before using AllenNLP, I have the following usage:

  1. If a word is not presented in Glove embedding, but its lowercased form is presented in the embedding. I will still use the embedding to initialize.
  2. During the evaluation (on the development set), if a word is not in the vocabulary in the training data, but it has a representation in the embedding file (Glove embeddings), I will still use this embedding as the word representation for this word.

Is it possible for me to do this in AllenNLP? I don’t mind modifying any part of the code.

Is your intention in point 1 specifically to keep case distinctions when they exist, but fall back on the uncased model where the cased token doesn’t have a match in the vocabulary? If just using an uncased model isn’t enough, you could subclass your tokenizer and conditionally downcase any token that isn’t in your Glove vocabulary set. There may be some more elegant way to get this vocabulary, but imagine you could pass in the embedding file path as a parameter to the tokenizer and build a set when the tokenizer is initialized.

For point 2, my understanding is that by including your dev and test data in your configuration file, your vocabulary will automatically include all those tokens and map them to the original Glove embeddings during evaluation. I could be wrong, though. I’ll stay tuned and hope somebody can confirm or correct this.

From my reading of the code (https://github.com/allenai/allennlp/blob/master/allennlp/training/trainer_pieces.py#L46), @Kevin_H is correct that including the dev and test data in your config file (with the keys “validation_data_path” and “test_data_path”) will add the tokens they contain to your vocabulary. But I strongly recommend that you verify this yourself by looking at the vocabulary directory in your serialization directory. I’ve been bitten by vocab bugs in the past, so I suggest being paranoid here.

EDIT: I originally replied via email and discourse choked on the reply.

@Kevin_H Thanks. For the first point, my input is actually already tokenized. But I’m actually not so sure what you mean to extend the Glove vocabulary.

My suggestion was to check if the cased version of the token is in the embedding vocabulary and otherwise downcase it. A custom token_indexer seems like it could be a promising place to start. You could expand on the logic here to cover your case: https://github.com/allenai/allennlp/blob/master/allennlp/data/token_indexers/single_id_token_indexer.py#L69

Thanks a lot. I’m actually using Glove embedding which only contains lowercase vocabulary in this pretrained embedding.

I understand that using a customized indexer can help me extend the our own vocabulary. But can I say that if the embedder meets a “Word”, it will still just use random embeddings because of Glove’s vocabulary is all lowercased?

Sure that I can downcase the “Word” to “word”. But I want to keep this “Word” to be trainable, so that although “Word” and “word” are initialized with the same vector, but they will be fine-tuned into different vectors. Hope this is clear.

I think I solve this by adding a new token_embedder and most of the code is same as the embedding.

I add something to the code below

if token in embeddings:
      embedding_matrix[i] = torch.FloatTensor(embeddings[token])
      num_tokens_found += 1
else:
      logger.debug(
           "Token %s was not found in the embedding file. Initialising randomly.", token
       )

to

if token in embeddings:
      embedding_matrix[i] = torch.FloatTensor(embeddings[token])
      num_tokens_found += 1
elif token.lower() in embeddings:
      embedding_matrix[i] = torch.FloatTensor(embeddings[token.lower()])
      num_tokens_found += 1
else:
      logger.debug(
           "Token %s was not found in the embedding file. Initialising randomly.", token
       )

Ah, sorry I hadn’t quite understood what you were going for earlier. This makes sense, though.