What format do I need to use to train or fine-tune the BERT SRL?

Hi there,

I am trying to figure out how to train an in-domain Semantic Role Labeling model based on BERT for my use case, based on bert_base_srl.jsonnet. I have two questions:

1. What format does the training data need to be in?

This doesn’t seem to be mentioned anywhere, and I’m at a loss here regarding the kind of data I need to put in the training folder. I can generate the data in any format I want, I just really could not figure out what format the model was taking in at all. Would it be possible to publish a couple lines out of the training files of the model, so I can reverse-engineer what format I need to output to train it?

The jsonnet file says it uses the “srl” data loader, but I could not figure out what that did map to, and anyway that still wouldn’t tell me what data the model really uses out of these files if I figured it out (my naive assumption is that most fields would be useless since the model only uses the input sentence directly, and labels for the output).

For reference, here is a copy of the jsonnet file in the repo:

{
    "dataset_reader": {
      "type": "srl",
      "bert_model_name": "bert-base-uncased",
    },

// ...

    "train_data_path": std.extVar("SRL_TRAIN_DATA_PATH"),
    "validation_data_path": std.extVar("SRL_VALIDATION_DATA_PATH"),

    "model": {
        "type": "srl_bert",
        "embedding_dropout": 0.1,
        "bert_model": "bert-base-uncased",
    },

//...

}

In the ideal world, the input data format would look something like

My/B-ARG0 family/I-ARG0 loves/V you/B-ARG1 ./O

Or be something I could output if I have a datastructure containing this information.

2. How can I specify my own fine-tuned BERT model?

Let’s say I want to fine-tune BERT for my in-domain data, how can I achieve this in this case? The BERT in use seems to be written in two places, both by using an identifier. What if I want to use my own BERT model here, what would I do?

3. Thank you for your insights!

Best regards,
François

Answering to question 1 partially, I found where the ‘srl’ dataset reader is defined, it’s inside the allennlp_models repository; based on that the easiest solution for me is probably to implement a new dataset reader that returns a similar instance shape:

But the base format seems to be OntoNotes 5.0 format (which I don’t really know seems the dataset itself doesn’t seem open source)