Stopword removal with pandas

◇◆丶佛笑我妖孽 提交于 2019-12-11 00:59:53

问题


I would like to remove stopwords from a column of a data frame. Inside the column there is text which needs to be splitted.

For example my data frame looks like this:

ID   Text
1    eat launch with me
2    go outside have fun

I want to apply stopword on text column so it should be splitted.

I tried this:

for item in cached_stop_words:
    if item in df_from_each_file[['text']]:
        print(item)
        df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')

So my output should be like this:

ID   Text
1    eat launch 
2    go fun

It means stopwords have been deleted. but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.

Thanks for your help.


回答1:


replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.

One simple solution, when you have a manageable number of stop words, is using str.replace.

p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')

df
   ID               Text
0   1       eat launch  
1   2   outside have fun

If performance is important, use a list comprehension.

cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words]) 
    for x in df['Text'].tolist()]

df
   ID              Text
0   1        eat launch
1   2  outside have fun


来源:https://stackoverflow.com/questions/51914481/stopword-removal-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!