问题
my program has list of words and amongst that i need few specific words to be tokenized as one word.
my program would split a string into words eg
str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single word."
output will be
list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word']..
now I want is to tokenize words such as 'red blood cell' as a single word. There are many such words in my list that has three or more words to be considered as one such as 'platelet count','white blood cell',etc. any suggestion for doing that.
回答1:
N-grams can be used to group together consecutive words that repeat often in a corpus of text. Here's a wikipedia article about N-grams.
To implement this in Scikit-learn, you need to set n_gram_range
parameter to the range of N-grams (bi-grams, tri-grams, ...) needed for the task (in your case, it is n_gram_range=(1,3)
).
回答2:
from nltk.tokenize import MWETokenizer
str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single word."
tokenizer = MWETokenizer()
tokenizer.add_mwe(('red', 'blood', 'cells'))
tokenizer.add_mwe(('white','blood','cell'))
tokenizer.add_mwe(('platelet','count.'))
print(tokenizer.tokenize(str.split()))
Output
`['hello', 'my', 'name', 'is', 'vishal,', 'can', 'you', 'please', 'help', 'me', 'with', 'the', 'red_blood_cells', 'and', 'platelet_count.', 'The', 'white_blood_cell', 'is', 'a', 'single', 'word.']`
You can use the Multi-Word Expression Tokenizer MWETokenizer
来源:https://stackoverflow.com/questions/59530543/word-tokeinizing-from-the-list-of-words-in-python