Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3540
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

22条回答
  •  心在旅途
    2020-11-21 05:11

    TL;DR

    import pandas as pd
    import numpy as np
    
    def explode_str(df, col, sep):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
        return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
    
    def explode_list(df, col):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.len())
        return df.iloc[i].assign(**{col: np.concatenate(s)})
    

    Demonstration

    explode_str(a, 'var1', ',')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    1    f     2
    

    Let's create a new dataframe d that has lists

    d = a.assign(var1=lambda d: d.var1.str.split(','))
    
    explode_list(d, 'var1')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    1    f     2
    

    General Comments

    I'll use np.arange with repeat to produce dataframe index positions that I can use with iloc.

    FAQ

    Why don't I use loc?

    Because the index may not be unique and using loc will return every row that matches a queried index.

    Why don't you use the values attribute and slice that?

    When calling values, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.

    Why do you use assign?

    When I use assign using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.

    Why are the index values repeat?

    By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
    This can be reset with reset_index(drop=True)


    For Strings

    I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.

    I then use that sep to join the strings then split.

    def explode_str(df, col, sep):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
        return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
    

    For Lists

    Similar as for strings except I don't need to count occurrences of sep because its already split.

    I use Numpy's concatenate to jam the lists together.

    import pandas as pd
    import numpy as np
    
    def explode_list(df, col):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.len())
        return df.iloc[i].assign(**{col: np.concatenate(s)})
    

提交回复
热议问题