I\'m trying to lemmatize all of the words in a sentence with NLTK\'s WordNetLemmatizer. I have a bunch of sentences but am just using the first sentence to ensure I\'m doing th
,
at the end of vandalisms,
.To remove this trailing ,
, you could use .strip(',')
or use mutliple delimiters as described here.First tag the sentence, then use the POS tag as the additional parameter input for the lemmatization.
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
return 'n'
def lemmatize_sent(text):
# Text input is string, returns lowercased strings.
return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag))
for word, tag in pos_tag(word_tokenize(text))]
lemmatize_sent('He is walking to school')
For a detailed walkthrough of how and why the POS tag is necessary see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Alternatively, you can use pywsd
tokenizer + lemmatizer, a wrapper of NLTK's WordNetLemmatizer
:
Install:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
Code:
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.307677984237671 secs.
>>> text = "Mary leaves the room"
>>> lemmatize_sentence(text)
['mary', 'leave', 'the', 'room']
>>> text = 'Dew drops fall from the leaves'
>>> lemmatize_sentence(text)
['dew', 'drop', 'fall', 'from', 'the', 'leaf']
(Note to moderators: I can't mark this question as duplicate of nltk: How to lemmatize taking surrounding words into context? because the answer wasn't accepted there but it is a duplicate).