Dataset reader, read, entity markers

I added entity markers to my Dataset reader, but get an error at text_a = tokens[3] (IndexError: list index out of range). I don’t know what I’m doing wrong. I’m using the dataset BioRelEx (

This is how I wrote _read:

        def _read(self, file_path):
            with open(cached_path(file_path), "r") as data_file:
                data = json.load(data_file)
            for item in data:
            text = item["text"]
            label = item.get("label")
            tokens = text.split('\t')           
            text_a = tokens[3], 
            text_b = tokens[4] 
#entity markers einfügen 
            if self.entity_markers:
                idx1, idx2 = [int(ind) for ind in tokens[2].split('-')]
                tokens_a = text_a.strip().split()
                tokens_b = text_b.strip().split()
                tokens_a.insert(inx1, '[e1start]')
                tokensa_a.insert(idx1 +2, '[e1end]')
                tokens_b.insert(inx1, '[e2start]')
                tokensa_b.insert(idx1 +2, '[e2end]')
                text_a = ''.join(tokens_a)
                text_b = ''.join(tokens_b)

            if label is not None:
                if self._skip_label_indexing:
                        label = int(label)
                    except ValueError:
                        raise ValueError(
                            "Labelss must be integers if skip_label_indexing is True."
                    label = str(label)
            instance = self.text_to_instance(text=text, label=label)
            if instance is not None:
                 yield instance

Does anyone have any idea what I might be doing wrong or suggestions for what I could try ? Thanks in advance for your time!


Your data is apparently not formatted like you expect. You’re reading a line, splitting it into what you think are 5 pieces, but apparently there aren’t five pieces in your data.

So, in this case can I not split the line at all?

I don’t know what your data looks like, so I can’t really answer that question for you. But the problem is a mismatch between how your data looks and how you are processing it.