I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg
operation):
USAF WBAN year month day s_PC s_CL
After reading through all the answers, I came up with this:
def __my_flatten_cols(self, how="_".join, reset_index=True):
how = (lambda iter: list(iter)[-1]) if how == "last" else how
self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
if isinstance(self.columns, pd.MultiIndex) else self.columns
return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols
Given a data frame:
df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])
grouper val1 2
0 x 0 1
1 x 2 3
2 y 4 5
3 y 6 7
Single aggregation method: resulting variables named the same as source:
df.groupby(by="grouper").agg("min").my_flatten_cols()
df.groupby(by="grouper",
as_index=False)
or .agg(...)
.reset_index()----- before -----
val1 2
grouper
------ after -----
grouper val1 2
0 x 0 1
1 y 4 5
Single source variable, multiple aggregations: resulting variables named after statistics:
df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index()
.----- before -----
val1
min max
grouper
------ after -----
grouper min max
0 x 0 2
1 y 4 6
Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):
df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
# you can combine the names in other ways too, e.g. use a different delimiter:
#df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values]
under the hood (since this form of agg()
results in MultiIndex
on columns).my_flatten_cols
helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values]
, which works similarly in this case (but fails if you have numeric labels on columns)a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]
), but I don't understand why the tuple()
call is needed, and I believe rstrip()
is only required if some columns have a descriptor like ("colname", "")
(which can happen if you reset_index()
before trying to fix up .columns
)----- before -----
val1 2
min sum size
grouper
------ after -----
grouper val1_min 2_sum 2_size
0 x 0 4 2
1 y 4 12 2
You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)
df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
2: {"sum_of_2": "sum", "count_of_2": "count"}}).my_flatten_cols("last")
res.columns = ['A_sum', 'B_sum', 'count']
or .join()
ing multiple groupby
statements.----- before -----
val1 2
count_of_val1 sum_of_val1 count_of_2 sum_of_2
grouper
------ after -----
grouper count_of_val1 sum_of_val1 count_of_2 sum_of_2
0 x 2 2 2 4
1 y 2 10 2 12
map(str, ..)
filter(None, ..)
columns.values
returns the names (str
, not tuples).agg()
you may need to keep the bottom-most label for a column or concatenate multiple labelsreset_index()
to be able to work with the group-by columns in the regular way, so it does that by default