How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

问题

I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?

回答1:

Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.

For this, I exemplarily used the snowball stemmer from nltk.

from nltk.stem.snowball import SnowballStemmer
englishStemmer=SnowballStemmer("english") #define stemming dict

And this tokenizer:

from nltk.tokenize import WhitespaceTokenizer as w_tokenizer

Define your function:

def stemm_texts(text):
    return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]

Apply the function on your df:

df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))

Note that I additionally added the NaN ignore part.

You might want to detokenize again:

from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenizer = TreebankWordDetokenizer()
df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))

来源：https://stackoverflow.com/questions/55482342/how-to-stem-a-pandas-dataframe-using-nltk-the-output-should-be-a-stemmed-dataf

标签

python

pandas

dataframe

nltk

stemming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!