Python Pandas - How to format and split a text in column ?

我是研究僧i 提交于 2019-12-24 00:13:11

问题


I have a set of strings in a dataframe like below

ID TextColumn
1 This is line number one
2 I love pandas, they are so puffy
3 [This $tring is with specia| characters, yes it is!]

A. I want to format this string to eliminate all the special characters B. Once formatted, I'd like to get a list of unique words (space being the only split)

Here is the code I have written:

get_df_by_id dataframe has one selected frame, say ID 3.

#replace all special characters
formatted_title = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')
# then split the words
results = set()
get_df_by_id['title'].str.lower().str.split().apply(results.update)
print results

But when I check output, I could see that special characters are still in the list.

Output

set([u'[this', u'is', u'it', u'specia|', u'$tring', u'is!]', u'characters,', u'yes', u'with'])

Intended output should be like below:

set([u'this', u'is', u'it', u'specia', u'tring', u'is', u'characters,', u'yes', u'with'])

Why does formatted dataframe still retain the special characters?


回答1:


I think you can first replace special characters (I add \| to the end), then lower text, split by \s+ (arbitrary wtitespaces). Output is DataFrame. So you can stack it to Series, drop_duplicates and last tolist:

print (df['title'].str
                  .replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
                  .str
                  .lower()
                  .str
                  .split('\s+', expand=True)
                  .stack()
                  .drop_duplicates()
                  .tolist())

['this', 'is', 'line', 'number', 'one', 'i', 'love', 'pandas', 'they', 'are', 
'so', 'puffy', 'tring', 'with', 'specia', 'characters', 'yes', 'it']



回答2:


If you want the list of unique words per row:

>>> get_df_by_id['title'].str.replace(r'[^a-zA-Z\s]', '').str.lower().str.split('\s+').apply(lambda x: list(set(x)))

0                           [this, is, one, line, number]
1                 [love, i, puffy, so, are, they, pandas]
2    [specia, this, is, it, characters, tring, yes, with]
Name: title, dtype: object



回答3:


You have to assign formatted values to same data frame

get_df_by_id['title'] = get_df_by_id['title'].str.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?]' , '')


来源:https://stackoverflow.com/questions/37429296/python-pandas-how-to-format-and-split-a-text-in-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!