问题
I would like to remove stopwords from a column of a data frame. Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column
so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted. but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.
回答1:
replace
(by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace
.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun
来源:https://stackoverflow.com/questions/51914481/stopword-removal-with-pandas