nltk wordpunct_tokenize vs word_tokenize

前端 未结 2 1526
-上瘾入骨i
-上瘾入骨i 2021-02-05 09:07

Does anyone know the difference between nltk\'s wordpunct_tokenize and word_tokenize? I\'m using nltk=3.2.4 and there\'s noth

2条回答
  •  失恋的感觉
    2021-02-05 09:33

    wordpunct_tokenize is based on a simple regexp tokenization. It is defined as

    wordpunct_tokenize = WordPunctTokenizer().tokenize
    

    which you can find here. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.

    word_tokenize on the other hand is based on a TreebankWordTokenizer, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.

    sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
    >>> word_tokenize(sent)
    ['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
     'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
     'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
    >>> wordpunct_tokenize(sent)
    ['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
     're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
     "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
     'Hey', "',", 'she', "'", 'll', 'say', '!']
    

    As we can see, wordpunct_tokenize will split pretty much at all special symbols and treat them as separate units. word_tokenize on the other hand keeps things like 're together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'.

    Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):

    sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'
    

    we get

    >>> word_tokenize(sent)
    ['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
     'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
     'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
     ',', 'she', "'ll", 'say', '!']
    

    so word_tokenize does split off double quotes, however it also converts them to `` and ''. wordpunct_tokenize doesn't do this:

    >>> wordpunct_tokenize(sent)
    ['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
     're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
     "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
     'Hey', '",', 'she', "'", 'll', 'say', '!']
    

提交回复
热议问题