I\'m looking to turn a pandas cell containing a list into rows for each of those values.
So, take this:
If I\'d like to unpack and stack the value
Nicer alternative solution with apply(pd.Series):
df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})
# expand df.listcol into its own dataframe
tags = df['listcol'].apply(pd.Series)
# rename each variable is listcol
tags = tags.rename(columns = lambda x : 'listcol_' + str(x))
# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)
Instead of using apply(pd.Series) you can flatten the column. This improves performance.
df = (pd.DataFrame({'name': ['A.J. Price'] * 3,
'opponent': ['76ers', 'blazers', 'bobcats'],
'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
.set_index(['name', 'opponent']))
%timeit (pd.DataFrame(df['nearest_neighbors'].values.tolist(), index = df.index)
.stack()
.reset_index(level = 2, drop=True).to_frame('nearest_neighbors'))
1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
2.73 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use apply(pd.Series)
and stack
, then reset_index
and to_frame
In [1803]: (df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame('nearest_neighbors'))
Out[1803]:
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia
Details
In [1804]: df
Out[1804]:
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
So all of these answers are good but I wanted something ^really simple^ so here's my contribution:
def explode(series):
return pd.Series([x for _list in series for x in _list])
That's it.. just use this when you want a new series where the lists are 'exploded'. Here's an example where we do value_counts() on taco choices :)
In [1]: my_df = pd.DataFrame(pd.Series([['a','b','c'],['b','c'],['c']]), columns=['tacos'])
In [2]: my_df.head()
Out[2]:
tacos
0 [a, b, c]
1 [b, c]
2 [c]
In [3]: explode(my_df['tacos']).value_counts()
Out[3]:
c 3
b 2
a 1
I think this a really good question, in Hive you would use EXPLODE
, I think there is a case to be made that Pandas should include this functionality by default. I would probably explode the list column with a nested generator comprehension like this:
pd.DataFrame({
"name": i[0],
"opponent": i[1],
"nearest_neighbor": neighbour
}
for i, row in df.iterrows() for neighbour in row.nearest_neighbors
).set_index(["name", "opponent"])