Handling labels in a DatasetReader for CNN/DailyMail RC

Hi!
I’m trying to create a DatasetReader for the CNN/DailyMail RC dataset. In this setting, each passage has a bunch of entity tokens, and the model has to predict which entity token is the correct answer (multiclass classification).

A few questions:
(1) The output layer of the model needs to project to a label space of the number of unique entities across all paragraphs in the dataset—how would I get this number from the DatasetReader when I’m creating my model? Ideally, I’d like to have some sort of vocabulary field with only all the entities…
(2) The model needs a mask that tells it, among all the entities in all the paragraphs in the dataset, which entities actually occur in the current paragraph. I was just going to use an ArrayField on top of a manually constructed NumPy mask. But, building this is hard because when you’re processing only a single instance at a time (in text_to_instance), you don’t how many other entities there are, and more importantly, you don’t have a mapping from entity -> the index that the label indexer assigns them.
(3) I’d like to be able to handle UNK entities / have an UNK entity label, so even passages with novel entities that I haven’t seen at test time can be handled. I have no clue how i’d go about adding this to the DatasetReader, though…

Any thoughts on the right Field schema for this sort of dataset?

I’m not sure I have all of your requirements, but I’m imagining an EntityField that looks like this: you get a list of entities from the current document, with their string ID (“ent213”, or whatever), and their position in the document. This field has a count_vocab_items method that adds all of these string IDs to the vocab (and picks a namespace that has padding and OOV tokens, to easily handle UNK entities - you’ll have to be careful at evaluation if there are actually multiple UNK entities, though). You output the set of all string IDs seen in the document (after indexing them).

Seems straightforward? You could do this just fine with a ListField[LabelField] if you wanted - construct the unique set in your dataset reader beforehand, then just add the elements of the set to the list. They will get indexed and put in the vocabulary without issue (you’ll need to use a non-default namespace with the LabelField).

It’s still a little unclear to me how you handle the representation of paragraph tokens that are these entities, though. I was expecting you to say something about needing a mask that lets you replace entity word vectors with these vectors, or something. Or maybe I’ve misunderstood something, and you want the entity IDs to still be part of the regular vocab, you just want them to also be in some separate vocab. One option there is to just make sure that all of the entity IDs are detectable in a single regex, and just loop through the vocab in the model and count / store them. For getting the set of entities in the paragraph, compute the set in the dataset reader, then put them in a TextField. Then you can do util.get_token_ids_from_text_field_tensors to get the ids from that when you’re in the model.

I’m not totally clear on how exactly you want to use this, though. Why do you need to know how many entities there are across the whole dataset? Wouldn’t it be easier to have a mask for entity tokens in the current document, and sum up probabilities for unique tokens in your model (projecting to document tokens, instead of to a separate label space)? You shouldn’t need a global list of entities, which removes the need for UNK entities, also.