Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3473
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

22条回答
  •  清酒与你
    2020-11-21 05:34

    UPDATE2: more generic vectorized function, which will work for multiple normal and multiple list columns

    def explode(df, lst_cols, fill_value='', preserve_index=False):
        # make sure `lst_cols` is list-alike
        if (lst_cols is not None
            and len(lst_cols) > 0
            and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
            lst_cols = [lst_cols]
        # all columns except `lst_cols`
        idx_cols = df.columns.difference(lst_cols)
        # calculate lengths of lists
        lens = df[lst_cols[0]].str.len()
        # preserve original index values    
        idx = np.repeat(df.index.values, lens)
        # create "exploded" DF
        res = (pd.DataFrame({
                    col:np.repeat(df[col].values, lens)
                    for col in idx_cols},
                    index=idx)
                 .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                                for col in lst_cols}))
        # append those rows that have empty lists
        if (lens == 0).any():
            # at least one list in cells is empty
            res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                      .fillna(fill_value))
        # revert the original index order
        res = res.sort_index()
        # reset index if requested
        if not preserve_index:        
            res = res.reset_index(drop=True)
        return res
    

    Demo:

    Multiple list columns - all list columns must have the same # of elements in each row:

    In [134]: df
    Out[134]:
       aaa  myid        num          text
    0   10     1  [1, 2, 3]  [aa, bb, cc]
    1   11     2         []            []
    2   12     3     [1, 2]      [cc, dd]
    3   13     4         []            []
    
    In [135]: explode(df, ['num','text'], fill_value='')
    Out[135]:
       aaa  myid num text
    0   10     1   1   aa
    1   10     1   2   bb
    2   10     1   3   cc
    3   11     2
    4   12     3   1   cc
    5   12     3   2   dd
    6   13     4
    

    preserving original index values:

    In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
    Out[136]:
       aaa  myid num text
    0   10     1   1   aa
    0   10     1   2   bb
    0   10     1   3   cc
    1   11     2
    2   12     3   1   cc
    2   12     3   2   dd
    3   13     4
    

    Setup:

    df = pd.DataFrame({
     'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
     'myid': {0: 1, 1: 2, 2: 3, 3: 4},
     'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
     'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
    })
    

    CSV column:

    In [46]: df
    Out[46]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
    Out[47]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    using this little trick we can convert CSV-like column to list column:

    In [48]: df.assign(var1=df.var1.str.split(','))
    Out[48]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    UPDATE: generic vectorized approach (will work also for multiple columns):

    Original DF:

    In [177]: df
    Out[177]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    

    Solution:

    first let's convert CSV strings to lists:

    In [178]: lst_col = 'var1' 
    
    In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
    
    In [180]: x
    Out[180]:
                  var1  var2 var3
    0        [a, b, c]     1   XX
    1  [d, e, f, x, y]     2   ZZ
    

    Now we can do this:

    In [181]: pd.DataFrame({
         ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
         ...:     for col in x.columns.difference([lst_col])
         ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
         ...:
    Out[181]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

    OLD answer:

    Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):

    In [2]: df = pd.DataFrame(
       ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
       ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
       ...: )
    
    In [3]: df
    Out[3]:
            var1  var2 var3
    0      a,b,c     1   XX
    1  d,e,f,x,y     2   ZZ
    
    In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
       ...:    .var1.str.split(',', expand=True)
       ...:    .stack()
       ...:    .reset_index()
       ...:    .rename(columns={0:'var1'})
       ...:    .loc[:, df.columns]
       ...: )
    Out[4]:
      var1  var2 var3
    0    a     1   XX
    1    b     1   XX
    2    c     1   XX
    3    d     2   ZZ
    4    e     2   ZZ
    5    f     2   ZZ
    6    x     2   ZZ
    7    y     2   ZZ
    

提交回复
热议问题