nltk wordpunct_tokenize vs word_tokenize

前端未结

关注

 2  1526

-上瘾入骨i 2021-02-05 09:07

Does anyone know the difference between nltk\'s wordpunct_tokenize and word_tokenize? I\'m using nltk=3.2.4 and there\'s noth

2条回答

失恋的感觉 (楼主)

2021-02-05 09:33

wordpunct_tokenize is based on a simple regexp tokenization. It is defined as

wordpunct_tokenize = WordPunctTokenizer().tokenize

which you can find here. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.

word_tokenize on the other hand is based on a TreebankWordTokenizer, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.

sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
 'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
 'Hey', "',", 'she', "'", 'll', 'say', '!']

As we can see, wordpunct_tokenize will split pretty much at all special symbols and treat them as separate units. word_tokenize on the other hand keeps things like 're together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'.

Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):

sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'

we get

>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
 'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
 ',', 'she', "'ll", 'say', '!']

so word_tokenize does split off double quotes, however it also converts them to `` and ''. wordpunct_tokenize doesn't do this:

>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
 'Hey', '",', 'she', "'", 'll', 'say', '!']

0 讨论(0)

查看其它2个回答