I am using nltk
to generate n-grams from sentences by first removing given stop words. However, nltk.pos_tag()
is extremely slow taking up to 0.6 s
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
so each call to pos_tag instantiates the perceptrontagger module which takes much of the computation time.You can save this time by directly calling tagger.tag yourself as:
from nltk.tag.perceptron import PerceptronTagger
sentence_pos = tagger.tag(word_tokenize(sentence))
Use pos_tag_sents
for tagging multiple sentences:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
If you are looking for another POS tagger with fast performances in Python, you might want to try RDRPOSTagger. For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. You can get faster tagging speed by simply using the multi-threaded mode. RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. See experimental results in this paper.