Pandas - How to flatten a hierarchical index in columns

后端 未结 17 1073
忘掉有多难
忘掉有多难 2020-11-22 02:55

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL         


        
17条回答
  •  盖世英雄少女心
    2020-11-22 03:21

    After reading through all the answers, I came up with this:

    def __my_flatten_cols(self, how="_".join, reset_index=True):
        how = (lambda iter: list(iter)[-1]) if how == "last" else how
        self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                        if isinstance(self.columns, pd.MultiIndex) else self.columns
        return self.reset_index() if reset_index else self
    pd.DataFrame.my_flatten_cols = __my_flatten_cols
    

    Usage:

    Given a data frame:

    df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])
    
      grouper  val1  2
    0       x     0  1
    1       x     2  3
    2       y     4  5
    3       y     6  7
    
    • Single aggregation method: resulting variables named the same as source:

      df.groupby(by="grouper").agg("min").my_flatten_cols()
      
      • Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index()
      • ----- before -----
                   val1  2
          grouper         
        
        ------ after -----
          grouper  val1  2
        0       x     0  1
        1       y     4  5
        
    • Single source variable, multiple aggregations: resulting variables named after statistics:

      df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
      
      • Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index().
      • ----- before -----
                    val1    
                   min max
          grouper         
        
        ------ after -----
          grouper  min  max
        0       x    0    2
        1       y    4    6
        
    • Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):

      df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
      # you can combine the names in other ways too, e.g. use a different delimiter:
      #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
      
      • Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns).
      • If you don't have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns)
      • To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don't understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns)
      • ----- before -----
                   val1           2     
                   min       sum    size
          grouper              
        
        ------ after -----
          grouper  val1_min  2_sum  2_size
        0       x         0      4       2
        1       y         4     12       2
        
    • You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)

      df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                         2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
      
      • Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements.
      • ----- before -----
                           val1                      2         
                  count_of_val1 sum_of_val1 count_of_2 sum_of_2
          grouper                                              
        
        ------ after -----
          grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
        0       x              2            2           2         4
        1       y              2           10           2        12
        

    Cases handled by the helper function

    • level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..)
    • they can also be empty, so we have to filter(None, ..)
    • for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples)
    • depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels
    • (since I'm new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

提交回复
热议问题