How should properly formatted data for NER in BERT look like?

问题

I am using Huggingface's transformers library and want to perform NER using BERT. I tried to find an explicit example of how to properly format the data for NER using BERT. It is not entirely clear to me from the paper and the comments I've found.

Let's say we have a following sentence and labels:

sent = "John Johanson lives in Ramat Gan."
labels = ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC']

Would data that we input to the model be something like this:

sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = ['O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O']
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Thank you!

回答1:

There is actually a great tutorial for the NER example on the huggingface documentation page. Specifically, it also goes into detail how the provided script does the preprocessing. Specifically, there is a link to an external contributor's preprocess.py script, that basically takes the data from the CoNLL 2003 format to whatever is required by the huggingface library. I found this to be the easiest way to assert I have proper formatting, and unless you have some specific changes that you might want to incorporate, this gets you started super quick without worrying about implementation details.

The linked example script also provides more than enough detail on how to feed the respective inputs into the model itself, but checking around line 192 basically confirms your above mentioned feeding pattern, together with the commnet provided by @Jindřich.

来源：https://stackoverflow.com/questions/60220842/how-should-properly-formatted-data-for-ner-in-bert-look-like

标签

python

nlp

format

bert-language-model

huggingface-transformers