Separate comma-separated values within individual cells of Pandas Series using regex

后端 未结 1 723
心在旅途
心在旅途 2021-01-27 16:36

I have a csv file from a database I\'ve converted into a Pandas DataFrame that I\'m trying to clean up. One of the issues is that multiple values have been input into single cel

1条回答
  •  时光取名叫无心
    2021-01-27 17:15

    I would be inclined to use a lookahead; how you do so depends on your expected data.

    This is a negative lookahead. it says "a comma that is not followed by whitespace" and would be preferred if you are sure that all comments with commas have whitespace, and would want to treat "red,green" as something to split.

    data.str.split('[,](?!\s)').apply(pd.Series)
    

    Another option is a positive lookahead for something that looks like a valid value; your example was numbers, so for instance this would split only on a comma that is followed by a number:

    data.str.split('[,](?:\d)').apply(pd.Series)
    

    Regular expressions are very powerful, but honestly, I am not sure that this solution will be great for you if this is a long-term problem. Getting most cases right as a one-time migration should be fine, but longer term I would consider trying to solve the problem before it gets here. Anyway, here's Debuggex's python regex cheat sheet, in case it is useful to you: https://www.debuggex.com/cheatsheet/regex/python

    0 讨论(0)
提交回复
热议问题