Separate comma-separated values within individual cells of Pandas Series using regex

后端未结

关注

 1  722

I have a csv file from a database I\'ve converted into a Pandas DataFrame that I\'m trying to clean up. One of the issues is that multiple values have been input into single cel

相关标签:

1条回答

时光取名叫无心

2021-01-27 17:15
I would be inclined to use a lookahead; how you do so depends on your expected data.

This is a negative lookahead. it says "a comma that is not followed by whitespace" and would be preferred if you are sure that all comments with commas have whitespace, and would want to treat "red,green" as something to split.
```
data.str.split('[,](?!\s)').apply(pd.Series)
```
Another option is a positive lookahead for something that looks like a valid value; your example was numbers, so for instance this would split only on a comma that is followed by a number:
```
data.str.split('[,](?:\d)').apply(pd.Series)
```
Regular expressions are very powerful, but honestly, I am not sure that this solution will be great for you if this is a long-term problem. Getting most cases right as a one-time migration should be fine, but longer term I would consider trying to solve the problem before it gets here. Anyway, here's Debuggex's python regex cheat sheet, in case it is useful to you: https://www.debuggex.com/cheatsheet/regex/python
0 讨论(0)
发布评论:

提交评论
- 加载中...