I’m trying to create a DatasetReader for the CNN/DailyMail RC dataset. In this setting, each passage has a bunch of entity tokens, and the model has to predict which entity token is the correct answer (multiclass classification).
A few questions:
(1) The output layer of the model needs to project to a label space of the number of unique entities across all paragraphs in the dataset—how would I get this number from the DatasetReader when I’m creating my model? Ideally, I’d like to have some sort of vocabulary field with only all the entities…
(2) The model needs a mask that tells it, among all the entities in all the paragraphs in the dataset, which entities actually occur in the current paragraph. I was just going to use an ArrayField on top of a manually constructed NumPy mask. But, building this is hard because when you’re processing only a single instance at a time (in text_to_instance), you don’t how many other entities there are, and more importantly, you don’t have a mapping from entity -> the index that the label indexer assigns them.
(3) I’d like to be able to handle UNK entities / have an UNK entity label, so even passages with novel entities that I haven’t seen at test time can be handled. I have no clue how i’d go about adding this to the DatasetReader, though…
Any thoughts on the right Field schema for this sort of dataset?