Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3545
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

相关标签:
22条回答
  • 2020-11-21 05:34

    UPDATE2: more generic vectorized function, which will work for multiple normal and multiple list columns

    def explode(df, lst_cols, fill_value='', preserve_index=False):
        # make sure `lst_cols` is list-alike
        if (lst_cols is not None
            and len(lst_cols) > 0
            and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
            lst_cols = [lst_cols]
        # all columns except `lst_cols`
        idx_cols = df.columns.difference(lst_cols)
        # calculate lengths of lists
        lens = df[lst_cols[0]].str.len()
        # preserve original index values    
        idx = np.repeat(df.index.values, lens)
        # create "exploded" DF
        res = (pd.DataFrame({
                    col:np.repeat(df[col].values, lens)
                    for col in idx_cols},
                    index=idx)
                 .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                                for col in lst_cols}))
        # append those rows that have empty lists
        if (lens == 0).any():
            # at least one list in cells is empty
            res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                      .fillna(fill_value))
        # revert the original index order
        res = res.sort_index()
        # reset index if requested
        if not preserve_index:        
            res = res.reset_index(drop=True)
        return res
    

    Demo:

    Multiple list columns - all list columns must have the same # of elements in each row:

    In [134]: df
    Out[134]:
       aaa  myid        num          text
    0   10     1  [1, 2, 3]  [aa, bb, cc]
    1   11     2         []            []
    2   12     3     [1, 2]      [cc, dd]
    3   13     4         []            []
    
    In [135]: explode(df, ['num','text'], fill_value='')
    Out[135]:
       aaa  myid num text
    0   10     1   1   aa
    1   10     1   2   bb
    2   10     1   3   cc
    3   11     2
    4   12     3   1   cc
    5   12     3   2   dd
    6   13     4
    

    preserving original index values:

    In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
    Out[136]:
       aaa  myid num text
    0   10     1   1   aa
    0   10     1   2   bb
    0   10     1   3   cc
    1   11     2
    2   12     3   1   cc
    2   12     3   2   dd
    3   13     4
    

    Setup:

    df = pd.DataFrame({
     'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
     'myid': {0: 1, 1: 2, 2: 3, 3: 4},
     'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
     'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
    })
    

    CSV column:

    In [46]: df
    Out[46]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
    Out[47]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    using this little trick we can convert CSV-like column to list column:

    In [48]: df.assign(var1=df.var1.str.split(','))
    Out[48]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    UPDATE: generic vectorized approach (will work also for multiple columns):

    Original DF:

    In [177]: df
    Out[177]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    

    Solution:

    first let's convert CSV strings to lists:

    In [178]: lst_col = 'var1' 
    
    In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
    
    In [180]: x
    Out[180]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    Now we can do this:

    In [181]: pd.DataFrame({
         ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
         ...:     for col in x.columns.difference([lst_col])
         ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
         ...:
    Out[181]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    OLD answer:

    Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):

    In [2]: df = pd.DataFrame(
       ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
       ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
       ...: )
    
    In [3]: df
    Out[3]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
       ...:    .var1.str.split(',', expand=True)
       ...:    .stack()
       ...:    .reset_index()
       ...:    .rename(columns={0:'var1'})
       ...:    .loc[:, df.columns]
       ...: )
    Out[4]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    
    0 讨论(0)
  • 2020-11-21 05:35

    Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array.

    The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat.

    var1 = df.var1.str.split(',', expand=True).values.ravel()
    var2 = np.repeat(df.var2.values, len(var1) / len(df))
    
    pd.DataFrame({'var1': var1,
                  'var2': var2})
    
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5    f     2
    
    0 讨论(0)
  • 2020-11-21 05:35

    There are a lot of answers here but I'm surprised no one has mentioned the built in pandas explode function. Check out the link below: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode

    For some reason I was unable to access that function, so I used the below code:

    import pandas_explode
    pandas_explode.patch()
    df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people')
    

    Above is a sample of my data. As you can see the people column had series of people, and I was trying to explode it. The code I have given works for list type data. So try to get your comma separated text data into list format. Also since my code uses built in functions, it is much faster than custom/apply functions.

    Note: You may need to install pandas_explode with pip.

    0 讨论(0)
  • 2020-11-21 05:35

    I had a similar problem, my solution was converting the dataframe to a list of dictionaries first, then do the transition. Here is the function:

    import re
    import pandas as pd
    
    def separate_row(df, column_name):
        ls = []
        for row_dict in df.to_dict('records'):
            for word in re.split(',', row_dict[column_name]):
                row = row_dict.copy()
                row[column_name]=word
                ls.append(row)
        return pd.DataFrame(ls)
    

    Example:

    >>> from pandas import DataFrame
    >>> import numpy as np
    >>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
                   {'var1': 'd,e,f', 'var2': 2}])
    >>> a
        var1  var2
    0  a,b,c     1
    1  d,e,f     2
    >>> separate_row(a, "var1")
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5    f     2
    

    You can also change the function a bit to support separating list type rows.

    0 讨论(0)
提交回复
热议问题