Python NLP Text Tokenization based on custom regex

五迷三道 提交于 2020-05-09 16:00:03

问题


I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc.

I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs' string) as one token.

I am able to extract all the occurrences of this using regex:

re.findall(r'.\d+\s+BBL', Text)

Note: I am doing that because Spacy standard English NER model is mistakenly recognizing that as 'Money' or 'Cardinal' named entities. So I want it to re-train my custom model, so I need to feed it with this pattern (the number and the 'BBLs' string) as one token that indicates my custom entity.

来源:https://stackoverflow.com/questions/60414587/python-nlp-text-tokenization-based-on-custom-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!