Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

前端 未结 4 544
天涯浪人
天涯浪人 2020-11-27 14:43

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the ava

相关标签:
4条回答
  • 2020-11-27 15:02

    Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.

    In [1253]: (df.set_index('A')
                  .apply(lambda x: x.apply(pd.Series).stack())
                  .reset_index()
                  .drop('level_1', 1))
    Out[1253]:
        A   B   C   D   E
    0  x1  v1  c1  d1  e1
    1  x1  v2  c2  d2  e2
    2  x2  v3  c3  d3  e3
    3  x2  v4  c4  d4  e4
    4  x3  v5  c5  d5  e5
    5  x3  v6  c6  d6  e6
    6  x4  v7  c7  d7  e7
    7  x4  v8  c8  d8  e8
    
    0 讨论(0)
  • 2020-11-27 15:12

    pandas >= 0.25

    Assuming all columns have the same number of lists, you can call Series.explode on each column.

    df.set_index(['A']).apply(pd.Series.explode).reset_index()
    
        A   B   C   D   E
    0  x1  v1  c1  d1  e1
    1  x1  v2  c2  d2  e2
    2  x2  v3  c3  d3  e3
    3  x2  v4  c4  d4  e4
    4  x3  v5  c5  d5  e5
    5  x3  v6  c6  d6  e6
    6  x4  v7  c7  d7  e7
    7  x4  v8  c8  d8  e8
    

    The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


    It's also faster.

    %timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
    %%timeit
    (df.set_index('A')
       .apply(lambda x: x.apply(pd.Series).stack())
       .reset_index()
       .drop('level_1', 1))
    
    
    2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
  • 2020-11-27 15:21

    Building on @cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:

    • Preserves column order
    • Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
    df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
    
         A   B   C   D   E
    0   x1  v1  c1  d1  e1
    0   x1  v2  c2  d2  e2
    1   x2  v3  c3  d3  e3
    1   x2  v4  c4  d4  e4
    2   x3  v5  c5  d5  e5
    2   x3  v6  c6  d6  e6
    3   x4  v7  c7  d7  e7
    3   x4  v8  c8  d8  e8
    
    0 讨论(0)
  • 2020-11-27 15:24
    def explode(df, lst_cols, fill_value=''):
        # make sure `lst_cols` is a list
        if lst_cols and not isinstance(lst_cols, list):
            lst_cols = [lst_cols]
        # all columns except `lst_cols`
        idx_cols = df.columns.difference(lst_cols)
    
        # calculate lengths of lists
        lens = df[lst_cols[0]].str.len()
    
        if (lens > 0).all():
            # ALL lists in cells aren't empty
            return pd.DataFrame({
                col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
                for col in idx_cols
            }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
              .loc[:, df.columns]
        else:
            # at least one list in cells is empty
            return pd.DataFrame({
                col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
                for col in idx_cols
            }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
              .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
              .loc[:, df.columns]
    

    Usage:

    In [82]: explode(df, lst_cols=list('BCDE'))
    Out[82]:
        A   B   C   D   E
    0  x1  v1  c1  d1  e1
    1  x1  v2  c2  d2  e2
    2  x2  v3  c3  d3  e3
    3  x2  v4  c4  d4  e4
    4  x3  v5  c5  d5  e5
    5  x3  v6  c6  d6  e6
    6  x4  v7  c7  d7  e7
    7  x4  v8  c8  d8  e8
    
    0 讨论(0)
提交回复
热议问题