Ensuring Batches of Homogeneous Instances

I am working on a model that iterates over multiple datasets. I am creating multiple vocabularies: a source vocabulary which is the union of all vocabularies, and data-set specific vocabularies. As such, some of the fields have different keys (depending on the dataset it is read from) which causes an error in the following file: https://github.com/allenai/allennlp/blob/0e33b0bac5282f2f1ac64bf36bc58ff94902a851/allennlp/data/batch.py#L38

An example of two instances in my dataset which have different field names can be seen below, e.g. 'words_en_lines' and 'words_en_ewt':

# Instance 1
{'words_source': 'TextField', 'pos_tags_source': 'SequenceLabelField', 'head_tags_source': 'SequenceLabelField', 'head_indices_source': 'SequenceLabelField', 'words_en_lines': 'TextField', 'pos_tags_en_lines': 'SequenceLabelField', 'head_tags_en_lines': 'SequenceLabelField', 'head_indices_en_lines': 'SequenceLabelField', 'metadata': 'MetadataField'}

# Instance 2
{'words_source': 'TextField', 'pos_tags_source': 'SequenceLabelField', 'head_tags_source': 'SequenceLabelField', 'head_indices_source': 'SequenceLabelField', 'words_en_ewt': 'TextField', 'pos_tags_en_ewt': 'SequenceLabelField', 'head_tags_en_ewt': 'SequenceLabelField', 'head_indices_en_ewt': 'SequenceLabelField', 'metadata': 'MetadataField'}

There has been previous work which has sorted instances by some particular key but they have since been removed:

  1. https://github.com/allenai/allennlp/blob/da16ad13f891a9a91a55a1a5eefd404fbc7d1b70/allennlp/data/iterators/same_language_iterator.py
  2. https://github.com/joelgrus/allennlp/blob/2510fa1f6ca57766c7d204994a85d41a957b37d4/allennlp/data/iterators/homogeneous_batch_iterator.py

I wanted to get your opinion before jumping into it - would the best approach be to create my own data/samplers/bucket_batch_sampler.py, e.g. a homogenous_bucket_batch_sampler.py which first sorts my instances by a certain partition key, e.g. as in the old homogeneous_batch_iterator.py? Thanks

1 Like

The plan that you mention of recreating the homogenous samplers is probably what I would recommend for now, and copying the old code into your repo is probably a good starting place.

Longer term, we’re working on more fully-fledged support for multi-task training, which will be included in our 2.0 release. You can see a couple of open PRs for that here and here. It’s not working yet, but it’s not too far off. We’re hoping to have a beta release of 2.0 by the end of September, but it might slip by a few weeks. In that setup, you’d probably just implement a Scheduler that ensures homogenous batches, or make it so that you can batch non-homogenous instances and then split them apart as needed for the separate heads.

That latter case is probably better as far as optimization goes, so that you get more diversity in each gradient update, but figuring out how to do it right in a general way is a bit tricky, and it probably won’t be implemented in the initial version of the multi-task code (unless we get some help from someone).

Hi Matt,

Thank you very much, I went ahead and partitioned the instances based on a metadata key as has been done previously (links 1 and 2 above) and it’s going fine so far.

Thanks for the updates regarding the MTL progress. I’ll take a look at the pull requests to get a better idea of what’s going on. I have been following this thread and the updates look exciting: https://github.com/allenai/allennlp/issues/4422. These changes will be very helpful for me as I aim to have a base class model or MultitaskModel, which orchestrates the training of three models or heads.

I agree regarding the non-homogeneous batches providing greater diversity with respect to gradient updates. If I see anywhere I can help out with that or any of the MTL stuff I’ll be sure to let yous know.