How should properly formatted data for NER in BERT look like?

别来无恙 提交于 2020-08-09 08:57:28

问题


I am using Huggingface's transformers library and want to perform NER using BERT. I tried to find an explicit example of how to properly format the data for NER using BERT. It is not entirely clear to me from the paper and the comments I've found.

Let's say we have a following sentence and labels:

sent = "John Johanson lives in Ramat Gan."
labels = ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC']

Would data that we input to the model be something like this:

sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = ['O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O']
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

?

Thank you!


回答1:


There is actually a great tutorial for the NER example on the huggingface documentation page. Specifically, it also goes into detail how the provided script does the preprocessing. Specifically, there is a link to an external contributor's preprocess.py script, that basically takes the data from the CoNLL 2003 format to whatever is required by the huggingface library. I found this to be the easiest way to assert I have proper formatting, and unless you have some specific changes that you might want to incorporate, this gets you started super quick without worrying about implementation details.

The linked example script also provides more than enough detail on how to feed the respective inputs into the model itself, but checking around line 192 basically confirms your above mentioned feeding pattern, together with the commnet provided by @Jindřich.



来源:https://stackoverflow.com/questions/60220842/how-should-properly-formatted-data-for-ner-in-bert-look-like

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!