I’m interested in using a BERT model to classify pairs of text sequences–specifically the titles or abstracts of scientific papers. I’ve been working mostly from https://github.com/allenai/scibert so far, which only seems to have examples of classification using the [CLS] token (https://github.com/allenai/scibert/blob/master/scibert/models/bert_text_classifier.py#L76). Specifically, is there a way to signal the segment difference to the BERT tokenizer? Unlike the [CLS] token, the [SEP] token isn’t guaranteed to be in a specific index position. Is there a way programmatically find this?
Also, note that I’m using the same alternate AllenNLP branch as SciBERT, which may complicate things a bit. If anybody has advice pertaining to just the main AllenNLP branch, I’m happy (ok, well not delighted, but willing) to try sorting out whatever differences there are.