Best dataset output format for annotation tooling

Hi everyone! I’m a maintainer on the Universal Data Tool project, I’m trying to build out more export options and while looking for standard NLP output formats I stumbled upon AllenNLP. I’m curious what export formats are preferable for users.

Single/multi-classification formats seem to fit nicely into a tsv, but NER and dependencies/relationships seem annoying or less-standardized? Our default JSON export looks like this, but I’d prefer not to force “yet another format”.

Any thoughts appreciated!

I’d recommend looking into the CoNLL format, which is pretty widely used for annotating a large number of tasks. You can see an example of that format here, and we have a reader for data formatted this way here, which describes the format in some detail.

1 Like

Thanks @mattg! Those links will be perfect for testing. I’ve updated our github issue about CoNLL. Glad to hear it’s well supported in AllenNLP! We can run our output against the dataset reader to verify we’re doing it right!