Provoke the NLTK part-of-speech tagger to report a plural proper noun

Let's try out Python's renouned part-of-speech tagger in the nltk package.

import nltk
# You might also need to run nltk.download('maxent_treebank_pos_tagger') 
#  even after installing nltk

string = 'Buddy Billy went to the moon and came Back with several Vikings.'
nltk.pos_tag(nltk.word_tokenize(string))

This gives me

[('Buddy', 'NNP'), ('Billy', 'NNP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('moon', 'NN'), ('and', 'CC'), ('came', 'VBD'), ('Back', 'NNP'), ('with', 'IN'), ('several', 'JJ'), ('Vikings', 'NNS'), ('.', '.')]

You can interpret the codes here. I'm slightly disappointed that 'Back' got categorized as a proper noun (NNP), although the confusion is understandable. I'm more upset that 'Vikings' got called a simple plural noun (NNS) instead of a plural proper noun (NNPS). Can anyone come up with a single example of a brief input that leads to at least one NNPS tag?

There seems to be some problems with the tags in NLTK brown corpus that tags NNPS as NPS (Possibly the NLTK tagset is an updated/outdated tags that is different from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

Here's an example of plural proper nouns:

>>> from nltk.corpus import brown
>>> for sent in brown.tagged_sents():
...     if any(pos for word, pos in sent if pos == 'NPS'):
...             print sent
...             break
... 
[(u'Georgia', u'NP'), (u'Republicans', u'NPS'), (u'are', u'BER'), (u'getting', u'VBG'), (u'strong', u'JJ'), (u'encouragement', u'NN'), (u'to', u'TO'), (u'enter', u'VB'), (u'a', u'AT'), (u'candidate', u'NN'), (u'in', u'IN'), (u'the', u'AT'), (u'1962', u'CD'), (u"governor's", u'NN$'), (u'race', u'NN'), (u',', u','), (u'a', u'AT'), (u'top', u'JJS'), (u'official', u'NN'), (u'said', u'VBD'), (u'Wednesday', u'NR'), (u'.', u'.')]

But if you tag with nltk.pos_tag, you'll get NNPS:

>>> for sent in brown.tagged_sents():
...     if any(pos for word, pos in sent if pos == 'NPS'):
...             print " ".join([word for word, pos in sent])
...             break
... 
Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .
>>> from nltk import pos_tag
>>> pos_tag("Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race , a top official said Wednesday .".split())
[('Georgia', 'NNP'), ('Republicans', 'NNPS'), ('are', 'VBP'), ('getting', 'VBG'), ('strong', 'JJ'), ('encouragement', 'NN'), ('to', 'TO'), ('enter', 'VB'), ('a', 'DT'), ('candidate', 'NN'), ('in', 'IN'), ('the', 'DT'), ('1962', 'CD'), ("governor's", 'NNS'), ('race', 'NN'), (',', ','), ('a', 'DT'), ('top', 'JJ'), ('official', 'NN'), ('said', 'VBD'), ('Wednesday', 'NNP'), ('.', '.')]

来源：https://stackoverflow.com/questions/31349851/provoke-the-nltk-part-of-speech-tagger-to-report-a-plural-proper-noun

标签

python-2.7

nlp

nltk

part-of-speech