Stuck in Embedding Tokens

I am trying to learn and re-implement name classification tutorial from pytorch website using AllenNLP. However, I’m stuck in the embedding process.

The data consists of: names and language labels such as Daher:Arabic, Abraham:French and so on. I want to use character tokenization and pass them to LSTM as the following code:

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                   if unicodedata.category(c) != 'Mn' and c in string.ascii_letters + ".,;'")


def read_lines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicode_to_ascii(line) for line in lines]


class NamesDatasetReader(DatasetReader):

    def _read(self, file_path: str) -> Iterable[Instance]:
        for filename in glob.glob(file_path):
            category = os.path.splitext(os.path.basename(filename))[0]
            lines = read_lines(filename)
            for line in lines:
                yield self.text_to_instance(line, filename, category)

    def text_to_instance(self, line, filename, category) -> Instance:
        token_char_indexer = TokenCharactersIndexer('token_character',
                                                    CharacterTokenizer(lowercase_characters=False))
        text_field = TextField([Token(character) for character in line],
                               {'token_character_indexer': token_char_indexer})
        label_field = LabelField(filename)
        fields = {'name': text_field, 'language': label_field}
        return Instance(fields)


class RNNClassifier(Model):

    def __init__(self, vocab: Vocabulary, name_embedding, encoder) -> None:
        super().__init__(vocab)
        self.name_embedding = name_embedding
        self.encoder = encoder
        self.hidden_to_output = torch.nn.Linear(in_features=self.encoder.get_output_dim(),
                                                out_features=self.vocab.get_vocab_size('language'))
        self.criterion = torch.nn.NLLLoss()
        self.log_softmax = torch.nn.LogSoftmax(dim=1)

    def forward(self, name, language) -> Dict[str, torch.Tensor]:
        mask = get_text_field_mask(name)
        embeddings = self.name_embedding(name)
        encoder_out = self.encoder(torch.cat(embeddings), mask)
        output = self.hidden_to_output(encoder_out)

        output = {'language_dist': self.log_softmax(output)}
        output['loss'] = self.criterion(output['language_dist'], language)
        return output


reader = NamesDatasetReader()
dataset = reader.read('data/names/*.txt')
vocab = Vocabulary.from_instances(dataset)
bow_embedding = BagOfWordCountsTokenEmbedder.from_params(vocab=vocab,
                                                         params=Params({
                                                             "ignore_oov": True,
                                                             'vocab_namespace': 'token_character'}))
name_embedding = BasicTextFieldEmbedder({'token_character_indexer': bow_embedding})
EMBEDDING_DIM = bow_embedding.get_output_dim()
print(EMBEDDING_DIM)
HIDDEN_DIM = 128

encoder = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = RNNClassifier(vocab=vocab, name_embedding=name_embedding, encoder=encoder)

iterator = BucketIterator(batch_size=2, sorting_keys=[("name", "num_tokens")])
iterator.index_with(vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
trainer = Trainer(model=model, optimizer=optimizer, train_dataset=dataset, iterator=iterator)
trainer.train()

I got this error in the bag_of_word_counts_token_embedder line 72:

RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 1

Where did I go wrong?

Thanks

I’m not sure on this without trying it out myself, but when I see errors related to tensors with size 2, I suspect vocabulary namespace problems. By default, a vocabulary namespace includes a padding and unknown token, so if nothing is being loaded into the namespace, it will have size 2. Could you try setting the namespace of your TextField in the dataset_reader to match the ‘token_character’ namespace of the BOW embedder? If that doesn’t settle it, you could try throwing a print statement in your forward method to see what your name and embeddings tensors look like. You can also print out your vocabulary statistics (https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L714) at some point to see if it looks like you expect.

Thanks for the reply. But the size 2 is the size of the batch. I have checked the vocabulary, and it’s working fine. I am afraid that I have applied a wrong approach in applying the embedding process, but I don’t know where to look for.

Are you able to post any more of the trace context? It could help to see exactly where that error is coming from.

Sure. Here I changed the batch size to 3

Traceback (most recent call last):
  File "../pycharm-2019.1.1/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "../pycharm-2019.1.1/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "../run.py", line 98, in <module>
    trainer.train()
  File "../venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "../venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "../venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "../venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "../run.py", line 69, in forward
    embeddings = self.name_embedding(name)
  File "../venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "../venv/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "../venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "../venv/lib/python3.7/site-packages/allennlp/modules/token_embedders/bag_of_word_counts_token_embedder.py", line 72, in forward
    mask *= (inputs != self._oov_idx).long()
RuntimeError: The size of tensor a (3) must match the size of tensor b (7) at non-singleton dimension 1

Two issues:

  1. The TokenCharacters indexer is intended if you want to model words by their sequence of characters. The data type that you get out of that is not one id per character, it’s one sequence of characters per token. But you’re passing in individual characters as tokens, so you’re not getting what you expect. I’d recommend either using a regular tokenizer (instead of taking each character in the line as your tokens), or using a SingleIdTokenIndexer, which will give you one id per character, as it looks like you’re expecting.
  2. I think you just want a simple Embedding TokenEmbedder, instead of a BagOfWordCounts embedding. The WordCounts one gives you something very different than a single vector for every character, which I believe is what you’re looking for.