Failed KnowBert fine-tuning

Hello, I am trying to fine tune BERT after pre-training the entity linker in KnowBert. I am using the same configuration as in training_config/pretraining/knowbert_wordnet.jsonnet - with paths to local files and folders.
However, when I run the training command allennlp train -s OUTPUT_DIRECTORY --file-friendly-logging --include-package kb.include_all training_config/pretraining/knowbert_wordnet.jsonnet (as specified in the KnowBert repository), training stops and I get multiple errors:

2020-01-20 16:28:36,147 - INFO - allennlp.training.trainer - Beginning training.
2020-01-20 16:28:36,147 - INFO - allennlp.training.trainer - Epoch 0/0
2020-01-20 16:28:36,147 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 3975.84
2020-01-20 16:28:36,226 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 66
2020-01-20 16:28:36,226 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1757
2020-01-20 16:28:36,227 - INFO - allennlp.training.trainer - Training
  0%|          | 0/842 [00:00<?, ?it/s]
[INFO/Process-3] starting worker 0
[INFO/Process-3:1] child process calling self.run()
[INFO/Process-3:1] reading instances from data/data_2.txt
Traceback (most recent call last):
  File "/home/livia/.local/bin/allennlp", line 10, in <module>
    sys.exit(run())
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 101, in main
    args.func(args)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 103, in train_model_from_args
    args.force)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 136, in train_model_from_file
    return train_model(params, serialization_dir, file_friendly_logging, recover, force)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 204, in train_model
    metrics = trainer.train()
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 538, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 367, in _train_epoch
    loss = self.batch_loss(this_batch, for_training=True)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 278, in batch_loss
    output_dict = self.model(**batch)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "./kb/knowbert.py", line 916, in forward
    **soldered_kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "./kb/knowbert.py", line 737, in forward
    candidate_segment_ids, **kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "./kb/knowbert.py", line 623, in forward
    **kwargs
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "./kb/knowbert.py", line 493, in forward
    candidate_entity_embeddings = self.entity_embeddings(candidate_entities)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "./kb/wordnet.py", line 807, in forward
    projected_entity_and_pos = self.dropout(self.proj_feed_forward(entity_and_pos.contiguous()))
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/livia/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: device-side assert triggered
[INFO/Process-3] process shutting down
[INFO/Process-3] calling join() for process Process-3:1
[INFO/Process-3:1] process shutting down
Process Process-3:1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/data/dataset_readers/multiprocess_dataset_reader.py", line 50, in _worker
    output_queue.put(instance)
  File "<string>", line 2, in put
  File "/usr/lib/python3.7/multiprocessing/managers.py", line 796, in _callmethod
    kind, result = conn.recv()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
[INFO/Process-3:1] process exiting with exitcode 1
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 48, in _queuer
    input_queue.put(instance)
  File "<string>", line 2, in put
  File "/usr/lib/python3.7/multiprocessing/managers.py", line 795, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 397, in _send_bytes
    self._send(header)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[INFO/Process-3] process exiting with exitcode 1
Process Process-4:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/livia/.local/lib/python3.7/site-packages/allennlp/data/iterators/multiprocess_iterator.py", line 32, in _create_tensor_dicts
    output_queue.put(tensor_dict)
  File "<string>", line 2, in put
  File "/usr/lib/python3.7/multiprocessing/managers.py", line 796, in _callmethod
    kind, result = conn.recv()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
  0%|          | 0/842 [00:09<?, ?it/s]`

After which the command stops and the connection is reset (I am using a remote machine)
Do you have any idea why I am getting these errors, and how I could solve them?

@markn, any ideas here? Looks like Matt Peters doesn’t have an account here that would let me ping him.