Word tokeinizing from the list of words in python?

问题

my program has list of words and amongst that i need few specific words to be tokenized as one word.
my program would split a string into words eg

str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."

output will be

list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word']..

now I want is to tokenize words such as 'red blood cell' as a single word. There are many such words in my list that has three or more words to be considered as one such as 'platelet count','white blood cell',etc. any suggestion for doing that.

回答1:

N-grams can be used to group together consecutive words that repeat often in a corpus of text. Here's a wikipedia article about N-grams.

To implement this in Scikit-learn, you need to set n_gram_range parameter to the range of N-grams (bi-grams, tri-grams, ...) needed for the task (in your case, it is n_gram_range=(1,3)).

回答2:

from nltk.tokenize import MWETokenizer
str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."
tokenizer = MWETokenizer()
tokenizer.add_mwe(('red', 'blood', 'cells'))
tokenizer.add_mwe(('white','blood','cell'))
tokenizer.add_mwe(('platelet','count.'))
print(tokenizer.tokenize(str.split()))

Output
`['hello', 'my', 'name', 'is', 'vishal,', 'can', 'you', 'please', 'help', 'me', 'with', 'the', 'red_blood_cells', 'and', 'platelet_count.', 'The', 'white_blood_cell', 'is', 'a', 'single', 'word.']`

You can use the Multi-Word Expression Tokenizer MWETokenizer

来源：https://stackoverflow.com/questions/59530543/word-tokeinizing-from-the-list-of-words-in-python

标签

regex

python-3.x

machine-learning

nltk

python-textprocessing