Word tokeinizing from the list of words in python?

天涯浪子 提交于 2020-01-15 09:36:30

问题


my program has list of words and amongst that i need few specific words to be tokenized as one word.
my program would split a string into words eg

str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."

output will be

list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word']..

now I want is to tokenize words such as 'red blood cell' as a single word. There are many such words in my list that has three or more words to be considered as one such as 'platelet count','white blood cell',etc. any suggestion for doing that.


回答1:


N-grams can be used to group together consecutive words that repeat often in a corpus of text. Here's a wikipedia article about N-grams.

To implement this in Scikit-learn, you need to set n_gram_range parameter to the range of N-grams (bi-grams, tri-grams, ...) needed for the task (in your case, it is n_gram_range=(1,3)).




回答2:


from nltk.tokenize import MWETokenizer
str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single  word."
tokenizer = MWETokenizer()
tokenizer.add_mwe(('red', 'blood', 'cells'))
tokenizer.add_mwe(('white','blood','cell'))
tokenizer.add_mwe(('platelet','count.'))
print(tokenizer.tokenize(str.split()))

Output
`['hello', 'my', 'name', 'is', 'vishal,', 'can', 'you', 'please', 'help', 'me', 'with', 'the', 'red_blood_cells', 'and', 'platelet_count.', 'The', 'white_blood_cell', 'is', 'a', 'single', 'word.']`

You can use the Multi-Word Expression Tokenizer MWETokenizer



来源:https://stackoverflow.com/questions/59530543/word-tokeinizing-from-the-list-of-words-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!