I have a large dataframe and I am storing a lot of redundant values that are making it hard to handle my data. I have a dataframe of the form:
import pandas
If you group your meta columns into a list then you can do this:
metas = ['meta1', 'meta2']
new_df = df.set_index(['name'] + metas).unstack('name')
print new_df
data
name n1 n2
meta1 meta2
a g y1 y2
b h y3 y4
Which gets you most of the way there. Additional tailoring can get you the rest of the way.
print new_df.data.rename_axis([None], axis=1).reset_index()
meta1 meta2 n1 n2
0 a g y1 y2
1 b h y3 y4
You can use pivot_table with reset_index and rename_axis (new in pandas
0.18.0
):
print (df.pivot_table(index=['meta1','meta2'],
columns='name',
values='data',
aggfunc='first')
.reset_index()
.rename_axis(None, axis=1))
meta1 meta2 n1 n2
0 a g y1 y2
1 b h y3 y4
But better is use aggfunc
join
:
print (df.pivot_table(index=['meta1','meta2'],
columns='name',
values='data',
aggfunc=', '.join)
.reset_index()
.rename_axis(None, axis=1))
meta1 meta2 n1 n2
0 a g y1 y2
1 b h y3 y4
Explanation, why join
is generally better as first
:
If use first
, you can lost all data which are not first in each group by index
, but join
concanecate them:
import pandas as pd
df = pd.DataFrame([["a","g","n1","y1"],
["a","g","n2","y2"],
["a","g","n1","y3"],
["b","h","n2","y4"]], columns=["meta1", "meta2", "name", "data"])
print (df)
meta1 meta2 name data
0 a g n1 y1
1 a g n2 y2
2 a g n1 y3
3 b h n2 y4
print (df.pivot_table(index=['meta1','meta2'],
columns='name',
values='data',
aggfunc='first')
.reset_index()
.rename_axis(None, axis=1))
meta1 meta2 n1 n2
0 a g y1 y2
1 b h None y4
print (df.pivot_table(index=['meta1','meta2'],
columns='name',
values='data',
aggfunc=', '.join)
.reset_index()
.rename_axis(None, axis=1))
meta1 meta2 n1 n2
0 a g y1, y3 y2
1 b h None y4