How to define special “untokenizable” words for nltk.word_tokenize

我们两清 提交于 2019-12-08 19:26:38

问题


I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.

For example:

>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']

Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.


回答1:


The Multi Word Expression Tokenizer should be what you need.

You add the list of exceptions as tuples and pass it the already tokenized sentences:

tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']


来源:https://stackoverflow.com/questions/45618528/how-to-define-special-untokenizable-words-for-nltk-word-tokenize

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!