Is there a canonical way in AllenNLP to shuffle the data of a lazy dataset reader before the beginning of each batch? (I am using the new PyTorch dataloading API, so under the hood this would be an
Because I can fit my raw dataset into memory but cannot fit a
List[Instance] of that raw data into memory, I naively tried something like the following:
@overrides def _read(self, file_path: str) -> Iterable[Instance]: with open(cached_path(file_path), "r") as data_file: logger.info("Reading instances from lines in file at: %s", file_path) # This fits into memory, its my `List[Instance]` that doesn't data_file = list(enumerate(data_file)) np.random.shuffle(data_file) data_file = iter(data_file) for idx, text in data_file: yield self.text_to_instance(text)
but I now realize this won’t work (confirmed by printing
data_file at the beginning of each epoch, which was identical).
I feel like this should be straightforward, but I can’t seem to figure it out within the context of a AllenNLP
DatasetReader. Would appreciate any suggestions!