问题
I have a set of strings in a dataframe like below
ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]
A. I want to format this string to eliminate all the special characters B. Once formatted, I'd like to get a list of unique words (space being the only split)
Here is the code I have written:
get_df_by_id dataframe has one selected frame, say ID 3.
#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results
But when I check output, I could see that special characters are still in the list.
Output
set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])
Intended output should be like below:
set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])
Why does formatted dataframe still retain the special characters?
回答1:
I think you can first replace special characters (I add \|
to the end), then lower text, split by \s+
(arbitrary wtitespaces). Output is DataFrame. So you can stack it to Series
, drop_duplicates and last tolist:
print (df['title'].str
.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
.str
.lower()
.str
.split('\s+', expand=True)
.stack()
.drop_duplicates()
.tolist())
['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are',
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']
回答2:
If you want the list of unique words per row:
>>> get_df_by_id['title'].str.replace(r'[^a-zA-Z\s]', '').str.lower().str.split('\s+').apply(lambda x: list(set(x)))
0 [this, is, one, line, number]
1 [love, i, puffy, so, are, they, pandas]
2 [specia, this, is, it, characters, tring, yes, with]
Name: title, dtype: object
回答3:
You have to assign formatted values to same data frame
get_df_by_id['title'] = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
来源:https://stackoverflow.com/questions/37429296/python-pandas-how-to-format-and-split-a-text-in-column