Pandas column of lists, create a row for each list element

前端 未结 10 770
有刺的猬
有刺的猬 2020-11-22 06:59

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple values in a cell, I\'d like to expand the dataframe so that each item in t

相关标签:
10条回答
  • 2020-11-22 07:45
    import pandas as pd
    df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
    print(df)
    df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
    print(df)
    

    Try this in pandas >=0.25 version

    0 讨论(0)
  • 2020-11-22 07:47

    Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287

    For the example above you may write:

    data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
    data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
    

    Speed test:

    %timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
    

    1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    %timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
    

    4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    %timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
    

    1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    0 讨论(0)
  • 2020-11-22 07:51
    lst_col = 'samples'
    
    r = pd.DataFrame({
          col:np.repeat(df[col].values, df[lst_col].str.len())
          for col in df.columns.drop(lst_col)}
        ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
    

    Result:

    In [103]: r
    Out[103]:
        samples  subject  trial_num
    0      0.10        1          1
    1     -0.20        1          1
    2      0.05        1          1
    3      0.25        1          2
    4      1.32        1          2
    5     -0.17        1          2
    6      0.64        1          3
    7     -0.22        1          3
    8     -0.71        1          3
    9     -0.03        2          1
    10    -0.65        2          1
    11     0.76        2          1
    12     1.77        2          2
    13     0.89        2          2
    14     0.65        2          2
    15    -0.98        2          3
    16     0.65        2          3
    17    -0.30        2          3
    

    PS here you may find a bit more generic solution


    UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

    in the following line we are repeating values in one column N times where N - is the length of the corresponding list:

    In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
    Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
    

    this can be generalized for all columns, containing scalar values:

    In [11]: pd.DataFrame({
        ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
        ...:           for col in df.columns.drop(lst_col)}
        ...:         )
    Out[11]:
        trial_num  subject
    0           1        1
    1           1        1
    2           1        1
    3           2        1
    4           2        1
    5           2        1
    6           3        1
    ..        ...      ...
    11          1        2
    12          2        2
    13          2        2
    14          2        2
    15          3        2
    16          3        2
    17          3        2
    
    [18 rows x 2 columns]
    

    using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

    In [12]: np.concatenate(df[lst_col].values)
    Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])
    

    putting all this together:

    In [13]: pd.DataFrame({
        ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
        ...:           for col in df.columns.drop(lst_col)}
        ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
    Out[13]:
        trial_num  subject  samples
    0           1        1    -1.04
    1           1        1    -0.58
    2           1        1    -1.32
    3           2        1     0.82
    4           2        1    -0.59
    5           2        1    -0.34
    6           3        1     0.25
    ..        ...      ...      ...
    11          1        2     0.68
    12          2        2     0.55
    13          2        2    -0.56
    14          2        2     0.65
    15          3        2    -0.04
    16          3        2     0.36
    17          3        2    -0.31
    
    [18 rows x 3 columns]
    

    using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...

    0 讨论(0)
  • 2020-11-22 07:51

    For those looking for a version of Roman Pekar's answer that avoids manual column naming:

    column_to_explode = 'samples'
    res = (df
           .set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
           .apply(pd.Series)
           .stack()
           .reset_index())
    res = res.rename(columns={
              res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
              res.columns[-1]: '{}_exploded'.format(column_to_explode)})
    
    0 讨论(0)
提交回复
热议问题