nltk tokenization and contractions

后端未结

关注

 3  2004

执念已碎 2021-02-19 01:14

I\'m tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. \'don\'t\' to \'don\' +\" \' \"+\'t\') but I want to keep them as one w

3条回答

无人共我 (楼主)

2021-02-19 01:35

Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...