Vocab counts for class weights

I have a model with self.loss = torch.nn.CrossEntropyLoss(weight=weight) set in the __init__ method. I’m hoping that setting these weights will help address the class imbalance in the data. To calculate the weights, I’m currently starting with labels_with_counts = list(self.vocab._retained_counter["labels"].items()), but the hackiness of this solution caught up with me. Once I train the model and want to run a predictor or evaluator, that _retained_counter is no longer available for the loaded vocabulary (https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L717). Is there “built-in” way to get these counts, or is this the kind of thing I’d be better off calculating externally as something like a data-pre-processing step and loading in from a file or config?

This kind of thing has been on our wish list for a long time, but we don’t currently have a good mechanism of handling it. There are two main problems: (1) how do we pass global dataset statistics or other parameters to the model? And (2) how do we save this reliably in the model archive so that it gets loaded correctly at test time? We don’t currently have solutions for either of these, though contributions for them would be welcome.

Oh, and sorry, yeah, the best way currently is to pre-compute it in a data processing step, then pass it in as a parameter to the model. Ugly, but that’s what you have to do right now.

Cool–thanks for the quick reply!

If we’re only concerned with label weights when computing the training loss, would it be possible to encapsulate the loss computation in forward() with if self.training: and still use this approach?