问题
I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?
回答1:
Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.
For this, I exemplarily used the snowball stemmer from nltk.
from nltk.stem.snowball import SnowballStemmer
englishStemmer=SnowballStemmer("english") #define stemming dict
And this tokenizer:
from nltk.tokenize import WhitespaceTokenizer as w_tokenizer
Define your function:
def stemm_texts(text):
return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]
Apply the function on your df:
df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))
Note that I additionally added the NaN ignore part.
You might want to detokenize again:
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()
df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))
来源:https://stackoverflow.com/questions/55482342/how-to-stem-a-pandas-dataframe-using-nltk-the-output-should-be-a-stemmed-dataf