NLTK-based stemming and lemmatization

问题

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.

from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)

The final output I get is:

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

And expected output should look like:

This beautiful day I work exercise text

回答1:

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

import re

__stop_words = set(nltk.corpus.stopwords.words('english'))

def clean(tweet):
    cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
    return ' '.join([lemmatizer.lemmatize(i, 'v') 
                for i in cleaned_tweet.split() if i not in __stop_words])

Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

from nltk.stem.porter import PorterStemmer  
stemmer = PorterStemmer()

And, call the stemmer like this:

stemmer.stem(i)

回答2:

I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.

>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text

来源：https://stackoverflow.com/questions/46779116/nltk-based-stemming-and-lemmatization

标签

python

nltk

stemming

lemmatization