nltk tokenization and contractions

后端 未结 3 2012
执念已碎
执念已碎 2021-02-19 01:14

I\'m tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. \'don\'t\' to \'don\' +\" \' \"+\'t\') but I want to keep them as one w

3条回答
  •  Happy的楠姐
    2021-02-19 01:39

    Because the number of contractions are very minimal, one way to do it is to search and replace all contractions to it full equivalent (Eg: "don't" to "do not") and then feed the updated sentences into the wordpunct_tokenizer.

提交回复
热议问题