Bert Seq2Seq via config file - tying bert_indexer and target

I am trying to get a simple_seq2seq model to work with Bert as the encoder. However, I am unable at the moment to figure out the right way to tie the source and target embeddings. They are both in the same language.
I get this error for the config file below:

File “/nlp/git/allennlp/allennlp/models/encoder_decoders/simple_seq2seq.py”, line 218, in forward
output_dict = self._forward_loop(state, target_tokens)
File “/nlp/git/allennlp/allennlp/models/encoder_decoders/simple_seq2seq.py”, line 310, in _forward_loop
targets = target_tokens[“tokens”]
KeyError: ‘tokens’

Any help is much appreciated. Thanks.

{
    "dataset_reader": {
        "type": "seq2seq",
	"source_token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": "bert-base-cased",
                "do_lowercase": false
            }
        },
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "bert-base-cased",
            "do_lowercase": false
        },

	"target_token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": "bert-base-cased",
                "do_lowercase": false,
		"namespace": "target_tokens"
            }
        },

        "target_tokenizer": {
            "type": "word",
            "word_splitter": {
                "type": "bert-basic",
                "do_lower_case": false
            }
        }
    },
    "train_data_path": "/nlp/Data/small_sent.tsv",
    "validation_data_path": "/nlp/Data/small_sent.tsv",
    "model": {
        "type": "simple_seq2seq",
	"source_embedder": {
            "token_embedders": {
                "bert": {
                    "type": "bert-pretrained",
                    "pretrained_model": "bert-base-cased",
                    "top_layer_only": true,
                    "requires_grad": false
                }
            }
        },
        "encoder": {
		"type": "lstm",
		"num_layers":1,
		"input_size": 768,
		"hidden_size":768
	        },
        "target_namespace": "target_tokens",
	"target_embedding_dim": 256,
	"max_decoding_steps": 80,
        "attention": {
            "type": "dot_product"
        },
        "beam_size": 5
    },
    "iterator": {
        "type": "basic",
        "batch_size": 32
    },
    "trainer": {
        "optimizer": {
            "type": "adam",
            "lr": 0.001
        },
        "patience": 10,
        "num_epochs": 100,
        "num_serialized_models_to_keep": 5,
        "should_log_learning_rate": true
    }
}

The error that you’re seeing is because the simple seq2seq model expects you to use the key “tokens” where you are using the key “bert”. If you switch the name, you should at least get passed that line.

I’m still not sure that will really do what you expect, though. It’s been a while since I looked at the simple seq2seq code, but it would probably be encoding the whole thing with BERT, which would be cheating for the decoder, and not just tying the source and target embeddings. You need to actually pull out the initial embedding layer and use that, probably. We have a utility function that will probably work to get this for you: https://github.com/allenai/allennlp/blob/ba6297d30314fe43f2b143dcca9e4809135512db/allennlp/nn/util.py#L1475

@brendan, you’ve touched this code more recently than I have. Does this sound right?

1 Like

@EhsanK, I don’t think SimpleSeq2Seq has the ability to tie the embeddings easily. You’d need to make some modifications to the model code. One option would be to try using the new ComposedSeq2Seq (Search for composed_seq2seq.py in our repo. Discourse is strangely not letting me link it right now…) which does have some support for this. See the tied_source_embedder_key option. Be aware though that that code is quite new and there are likely some rough edges. You’d be doing us a favor by helping test it out! :slight_smile:

@mattg, could you elaborate a bit on why using BERT as-is would be cheating? I’m afraid I must be missing something obvious here. Also, in the config above the BERT weights would only be “tied” essentially by accident as we cache the BERT models, correct? To be honest, despite working on the seq2seq code a bit I don’t feel too expert in it, so your help and clarifications are appreciated!

1 Like

Weight tying means using the same weights for your input embedding matrix as you use for your output softmax. In practice that means that you do a dot product between the hidden state and the embedding matrix at each step of the decoder, then a softmax over those values (or something more fancy like an adaptive or hierarchical softmax).

You could try to simplify this somehow by embedding the targets up front, and doing an L2 loss between the hidden state and the target weights, or something (I know something similar has been done with language modeling). If you do this with BERT, which has seen the whole context, this feels really weird, like you must be cheating. This is why transformer decoders use some crazy masking, to be sure that at each timestep you’re only using information that the decoder should have already seen. But if you’re only doing it at training time, it’s maybe ok, but then you have a mismatch between training and test…

In any event, it’s not really clear to me what weight tying means in a seq2seq model with BERT. It could mean just using BERT’s initial embedding, and then grabbing the embedding layer as I mentioned above should work. If you want to do something more fancy in the decoder… :man_shrugging: There are several papers on non-autoregressive decoding using masked predictions and such, and that’s maybe a good place to start if you want to do something like this.

1 Like

@mattg Thanks immensely for the tips. I will go for an auto-regressive decoder and as you suggested tie it with L2 to the embedding layer of the encoder BERT. Is there any pre-trained decoder available in AllenNLP?

@brendan Thanks a lot, I am trying both composed_seq2seq and copynet on my data. I will report if I hit a bug or sth.

We don’t have any pretrained decoder available, as far as I know. If you get something working and want to contribute a model / code, we’d love to hear about it. There will sometime soonish be a new allennlp-seq2seq repo with less onerous contribution requirements than we have on the main allennlp repo.