How to shuffle the data each batch when lazy=True

Hi,

Is there a canonical way in AllenNLP to shuffle the data of a lazy dataset reader before the beginning of each batch? (I am using the new PyTorch dataloading API, so under the hood this would be an IterableDataset).

Because I can fit my raw dataset into memory but cannot fit a List[Instance] of that raw data into memory, I naively tried something like the following:

@overrides
def _read(self, file_path: str) -> Iterable[Instance]:
    with open(cached_path(file_path), "r") as data_file:
        logger.info("Reading instances from lines in file at: %s", file_path)
        # This fits into memory, its my `List[Instance]` that doesn't
        data_file = list(enumerate(data_file))
        np.random.shuffle(data_file)
        data_file = iter(data_file)

        for idx, text in data_file:
            yield self.text_to_instance(text)

but I now realize this won’t work (confirmed by printing data_file[0] at the beginning of each epoch, which was identical).

I feel like this should be straightforward, but I can’t seem to figure it out within the context of a AllenNLP DatasetReader. Would appreciate any suggestions!

Hmm, I don’t know enough about how IterableDataset behaves to answer this question, but the behavior you’re describing definitely seems like a bug. Can you open an issue on github? If you can make a simple failing test case, that would also be super helpful.

1 Like

Weird. When I replace np.random.shuffle() with random.shuffle(), I do in fact yield Instance's in a random order every epoch. Need to look into the minutia of these two functions to figure out why.

Sorry, this might be a false alarm. If I determine that the bug is AllenNLP specific I will open an issue!