SPECTER benchmarking best practice

I enjoyed reading the paper here about AllenAI’s SPECTER model:

Thank you for making the SPECTER model available at:

I benchmarked a bunch of text models a while back for journal recommendation using cosine similarity. Essentially, all it does is take a list of published arXiv preprints as input and then pull a list of the most cosine-similar preprints from the last 2 years. I take the rank of the first preprint published in the same journal as a rough measure of how well this approach works at recommending journals. It’s quite a biased and rough measure, but it is useful for getting a sense of how well this method recommends journals in practice.

I have now plugged some SPECTER embeddings into that same benchmarking code and, unfortunately, the results are not improving on Doc2Vec (p@1==16 for specter vs p@1==18 for doc2vec).

Could someone recommend a better test to compare these models?