问题
I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc.
I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs' string) as one token.
I am able to extract all the occurrences of this using regex:
re.findall(r'.\d+\s+BBL', Text)
Note: I am doing that because Spacy standard English NER model is mistakenly recognizing that as 'Money' or 'Cardinal' named entities. So I want it to re-train my custom model, so I need to feed it with this pattern (the number and the 'BBLs' string) as one token that indicates my custom entity.
来源:https://stackoverflow.com/questions/60414587/python-nlp-text-tokenization-based-on-custom-regex