Distributed training across multiple GPUs

I’m trying to train a model across multiple GPUs.
I read that: “The batch size should be larger than the number of GPUs used locally.”.
However, I can’t fit a larger batch size with 4 GPUs than I can fit with one.
I’ve tried to debug it and discovered that each GPU got the same batch size, which explains why I haven’t managed to increase the batch size, since using multiple GPUs made the effective batch size 4 times larger. BUT, it seems that every process (i.e. GPU) got the exact same batch with the same data in every iteration, which doesn’t make sense.

Can you help me figure it out?

My distributed related configuration:

 "distributed": {"cuda_devices": [0,1,2,3]},
    "trainer": {
        "distributed": true,
        "world_size": 4


Hi, which version of AllenNLP are you using? I know there this was a bug (each GPU getting the same exact batches) that I believe we fixed a while ago.

But it’s possible that this could still happen if your dataset reader isn’t configured properly.

I’m using 1.1.0rc1. I believe you’re talking about https://github.com/allenai/allennlp/pull/4241, so I have that commit.

Regarding your second comment about the misconfiguration:
Should I make changes to the dataset reader while using distributed configuration? If so, can you tell me what exactly or give me some reference please?

I have a custom dataset reader, which overrides _read, _instances_from_cache_file, _instances_to_cache_file and implements text_to_instance.


I see. And when you noticed that different GPU workers were seeing the same instances, were you utilizing a dataset cache?

Yes, I have a cache_directory set in my dataset reader class.

I’ve deleted the cache directory and now it’s working as expected!
Thank you

How did you implement _instances_from_cache_file? Are you utilizing the _multi_worker_islice method like here: