Pandas column of lists, create a row for each list element

前端 未结 10 778
有刺的猬
有刺的猬 2020-11-22 06:59

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple values in a cell, I\'d like to expand the dataframe so that each item in t

相关标签:
10条回答
  • 2020-11-22 07:27

    Very late answer but I want to add this:

    A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.

    df = df.reset_index(drop=True)
    lstcol = df.lstcol.values
    lstcollist = []
    indexlist = []
    countlist = []
    for ii in range(len(lstcol)):
        lstcollist.extend(lstcol[ii])
        indexlist.extend([ii]*len(lstcol[ii]))
        countlist.extend([jj for jj in range(len(lstcol[ii]))])
    df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
    index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
    
    0 讨论(0)
  • 2020-11-22 07:29

    Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:

    items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
    # Keep original df index as a column so it's retained after melt
    items_as_cols['orig_index'] = items_as_cols.index
    
    melted_items = pd.melt(items_as_cols, id_vars='orig_index', 
                           var_name='sample_num', value_name='sample')
    melted_items.set_index('orig_index', inplace=True)
    
    df.merge(melted_items, left_index=True, right_index=True)
    

    Output (obviously we can drop the original samples column now):

                     samples  subject  trial_num sample_num  sample
    0    [1.84, 1.05, -0.66]        1          1          0    1.84
    0    [1.84, 1.05, -0.66]        1          1          1    1.05
    0    [1.84, 1.05, -0.66]        1          1          2   -0.66
    1    [-0.24, -0.9, 0.65]        1          2          0   -0.24
    1    [-0.24, -0.9, 0.65]        1          2          1   -0.90
    1    [-0.24, -0.9, 0.65]        1          2          2    0.65
    2    [1.15, -0.87, -1.1]        1          3          0    1.15
    2    [1.15, -0.87, -1.1]        1          3          1   -0.87
    2    [1.15, -0.87, -1.1]        1          3          2   -1.10
    3   [-0.8, -0.62, -0.68]        2          1          0   -0.80
    3   [-0.8, -0.62, -0.68]        2          1          1   -0.62
    3   [-0.8, -0.62, -0.68]        2          1          2   -0.68
    4    [0.91, -0.47, 1.43]        2          2          0    0.91
    4    [0.91, -0.47, 1.43]        2          2          1   -0.47
    4    [0.91, -0.47, 1.43]        2          2          2    1.43
    5  [-1.14, -0.24, -0.91]        2          3          0   -1.14
    5  [-1.14, -0.24, -0.91]        2          3          1   -0.24
    5  [-1.14, -0.24, -0.91]        2          3          2   -0.91
    
    0 讨论(0)
  • 2020-11-22 07:30

    Pandas >= 0.25

    Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

    df = pd.DataFrame({
        'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 
        'var2': [1, 2, 3, 4]
    })
    df
            var1  var2
    0  [a, b, c]     1
    1     [d, e]     2
    2         []     3
    3        NaN     4
    
    df.explode('var1')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    2  NaN     3  # empty list converted to NaN
    3  NaN     4  # NaN entry preserved as-is
    
    # to reset the index to be monotonically increasing...
    df.explode('var1').reset_index(drop=True)
    
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5  NaN     3
    6  NaN     4
    

    Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).

    However, you should note that explode only works on a single column (for now).

    P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.

    0 讨论(0)
  • 2020-11-22 07:32

    you can also use pd.concat and pd.melt for this:

    >>> objs = [df, pd.DataFrame(df['samples'].tolist())]
    >>> pd.concat(objs, axis=1).drop('samples', axis=1)
       subject  trial_num     0     1     2
    0        1          1 -0.49 -1.00  0.44
    1        1          2 -0.28  1.48  2.01
    2        1          3 -0.52 -1.84  0.02
    3        2          1  1.23 -1.36 -1.06
    4        2          2  0.54  0.18  0.51
    5        2          3 -2.18 -0.13 -1.35
    >>> pd.melt(_, var_name='sample_num', value_name='sample', 
    ...         value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
        subject  trial_num sample_num  sample
    0         1          1          0   -0.49
    1         1          2          0   -0.28
    2         1          3          0   -0.52
    3         2          1          0    1.23
    4         2          2          0    0.54
    5         2          3          0   -2.18
    6         1          1          1   -1.00
    7         1          2          1    1.48
    8         1          3          1   -1.84
    9         2          1          1   -1.36
    10        2          2          1    0.18
    11        2          3          1   -0.13
    12        1          1          2    0.44
    13        1          2          2    2.01
    14        1          3          2    0.02
    15        2          1          2   -1.06
    16        2          2          2    0.51
    17        2          3          2   -1.35
    

    last, if you need you can sort base on the first the first three columns.

    0 讨论(0)
  • 2020-11-22 07:34

    I found the easiest way was to:

    1. Convert the samples column into a DataFrame
    2. Joining with the original df
    3. Melting

    Shown here:

        df.samples.apply(lambda x: pd.Series(x)).join(df).\
    melt(['subject','trial_num'],[0,1,2],var_name='sample')
    
            subject  trial_num sample  value
        0         1          1      0  -0.24
        1         1          2      0   0.14
        2         1          3      0  -0.67
        3         2          1      0  -1.52
        4         2          2      0  -0.00
        5         2          3      0  -1.73
        6         1          1      1  -0.70
        7         1          2      1  -0.70
        8         1          3      1  -0.29
        9         2          1      1  -0.70
        10        2          2      1  -0.72
        11        2          3      1   1.30
        12        1          1      2  -0.55
        13        1          2      2   0.10
        14        1          3      2  -0.44
        15        2          1      2   0.13
        16        2          2      2  -1.44
        17        2          3      2   0.73
    

    It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.

    0 讨论(0)
  • 2020-11-22 07:42

    A bit longer than I expected:

    >>> df
                    samples  subject  trial_num
    0  [-0.07, -2.9, -2.44]        1          1
    1   [-1.52, -0.35, 0.1]        1          2
    2  [-0.17, 0.57, -0.65]        1          3
    3  [-0.82, -1.06, 0.47]        2          1
    4   [0.79, 1.35, -0.09]        2          2
    5   [1.17, 1.14, -1.79]        2          3
    >>>
    >>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
    >>> s.name = 'sample'
    >>>
    >>> df.drop('samples', axis=1).join(s)
       subject  trial_num  sample
    0        1          1   -0.07
    0        1          1   -2.90
    0        1          1   -2.44
    1        1          2   -1.52
    1        1          2   -0.35
    1        1          2    0.10
    2        1          3   -0.17
    2        1          3    0.57
    2        1          3   -0.65
    3        2          1   -0.82
    3        2          1   -1.06
    3        2          1    0.47
    4        2          2    0.79
    4        2          2    1.35
    4        2          2   -0.09
    5        2          3    1.17
    5        2          3    1.14
    5        2          3   -1.79
    

    If you want sequential index, you can apply reset_index(drop=True) to the result.

    update:

    >>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
    >>> res = res.reset_index()
    >>> res.columns = ['subject','trial_num','sample_num','sample']
    >>> res
        subject  trial_num  sample_num  sample
    0         1          1           0    1.89
    1         1          1           1   -2.92
    2         1          1           2    0.34
    3         1          2           0    0.85
    4         1          2           1    0.24
    5         1          2           2    0.72
    6         1          3           0   -0.96
    7         1          3           1   -2.72
    8         1          3           2   -0.11
    9         2          1           0   -1.33
    10        2          1           1    3.13
    11        2          1           2   -0.65
    12        2          2           0    0.10
    13        2          2           1    0.65
    14        2          2           2    0.15
    15        2          3           0    0.64
    16        2          3           1   -0.10
    17        2          3           2   -0.76
    
    0 讨论(0)
提交回复
热议问题