I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple values in a cell, I\'d like to expand the dataframe so that each item in t
Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num
column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error
on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt
to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat
-based solutions).
However, you should note that explode
only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode
. See this (very much) related answer by me.
you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.
I found the easiest way was to:
samples
column into a DataFrameShown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.
A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True)
to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76