Right now the bert-pretrained indexer returns either starting or ending offsets for the input tokens into the wordpiece array. Is there a way to get an array that maps each wordpiece position to its corresponding token index ?
For example, if the input tokens are
[This, is, good] and output wordpieces are
[[CLS], This, is, g, #ood, [SEP]], then right now the bert starting offsets are
[1, 2, 3] which is array of size
(B, Num tokens).
I am looking for following array
[-1, 0, 1, 2, 2, -1] which would be an array of size
(B, Num wordpieces) and map each wordpiece to index of token it comes from (Here I am assuming start and end tokens map to -1, but thats not required).