Hi, I’m working on a neural IR model that can be summarized as:
- A transformer that encodes input A (e.g., BERT/Roberta)
- A transformer that encodes input B
- Compute the inner product, and make a prediction based on the inner product
I’m having some issues with GPU memory usage (on 1080Tis) so thought I might:
- Put transformer_a on GPU0
- Put transformer_b on GPU1
- Modify the device placement for my two corresponding text fields to match
I originally tried distributed training, but was running into similar memory issues since each GPU still has a copy of each transformer. If I can’t get this to work, I’ll probably fall back to that with smaller batch sizes.
I poked around the source code/docs/examples, and there doesn’t seem to be an “out of the box” way to do this. If there is, would love to hear about it, otherwise do these steps seem reasonable?
- Override the TokenIndexer class to accept a device placement
- Override the trainer so that when it places the tensors on a GPU, it uses the defined device placement, otherwise falling back to the current assignment
- Override the trainer to make it so when it moves the model to the GPU, it moves each sub-model to the right place (perhaps a method like
- Probably do this several times and find places where there will inevitably be device placement mismatches (guessing this will be an issue when saving the model).