How to properly use multiple GPUs?

I am just wondering if there is documentation anywhere on how to use multiple GPUs and/or how to distribute training across multiple nodes. I noticed that a PR on distributed training was recently merged (#3529) but wasn’t able to find instructions.

Thanks!

Perhaps there are no instructions since the latest changes are not part of a proper release yet. However, you can configure distributed training with the below configuration:

"trainer": {"num_epochs": 2, "optimizer": "adam"},
"distributed": {
    "cuda_devices": [0, 1],
    "master_addr": <ip>,
    "master_port": <port number>,
    "num_nodes": 2
}, 

To start the training on the first node:
allennlp train experiment.jsonnet -s output --node-rank 0
Second node:
allennlp train experiment.jsonnet -s output --node-rank 1

In a single-node case, num_nodes, master_addr & master_port attributes are not needed and --node-rank will not be useful.

Note that the datasets need to be placed on the respective machines beforehand. Also make sure the dataset size is evenly divisible across GPUs.

1 Like

Awesome. Thanks a lot, @Ananda_Seelan. When you say:

In a single-node case, num_nodes , master_addr & master_port attributes are not needed and --node-rank will not be useful.

Does that mean I do not need to provide --node-rank in the single-node case?

And one final question, in the distributed case, does the batch_size argument correspond to total batch size or batch size per GPU?

Thanks again!

That’s right. It defaults to 0 in a single-node case.

does the batch_size argument correspond to total batch size or batch size per GPU?

It corresponds to single GPU batch size. So the effective batch size would be GPU count * batch_size.

Great, thanks a lot. Really appreciate it.