Does anyone know the difference between nltk
\'s wordpunct_tokenize
and word_tokenize
? I\'m using nltk=3.2.4
and there\'s noth
wordpunct_tokenize
is based on a simple regexp tokenization. It is defined as
wordpunct_tokenize = WordPunctTokenizer().tokenize
which you can find here. Basically it uses the regular expression \w+|[^\w\s]+
to split the input.
word_tokenize
on the other hand is based on a TreebankWordTokenizer
, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.
sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'",
'Hey', "',", 'she', "'", 'll', 'say', '!']
As we can see, wordpunct_tokenize
will split pretty much at all special symbols and treat them as separate units. word_tokenize
on the other hand keeps things like 're
together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'
.
Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):
sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'
we get
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't",
'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''",
',', 'she', "'ll", 'say', '!']
so word_tokenize
does split off double quotes, however it also converts them to ``
and ''
. wordpunct_tokenize
doesn't do this:
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"',
'Hey', '",', 'she', "'", 'll', 'say', '!']
Word_tokenize
is for tokenizing a word in a sentence while wordpunct_tokenize
is to remove the non-English words in a sentence.