pandas groupby without turning grouped by column into index

后端 未结 4 1815
粉色の甜心
粉色の甜心 2020-11-27 17:43

The default behavior of pandas groupby is to turn the group by columns into index and remove them from the list of columns of the dataframe. For instance, say I have a dataF

相关标签:
4条回答
  • 2020-11-27 18:09

    Another way to do this would be:

    df.groupby(['col2', 'col3']).sum().reset_index()
    
    0 讨论(0)
  • 2020-11-27 18:15
    df.groupby(['col2','col3'], as_index=False).sum()
    
    0 讨论(0)
  • 2020-11-27 18:36

    Not sure, but I think the right answer would be

    df.groupby(['col2','col3']).sum()
    df = df.reset_index()
    

    At least is what I do all the time to avoid dataframes with multi-index.

    0 讨论(0)
  • 2020-11-27 18:36

    The following, somewhat detailed answer, is added to help those who are still confused on which variant of the answers to use.

    First, the suggested two solutions to this problem are:

    • Solution 1: df.groupby(['col2', 'col3'], as_index=False).sum()
    • Solution 2: df.groupby(['col2', 'col3']).sum().reset_index()

    Both give the expected result.


    Solution 1:

    As explained in the documentation, as_index will ask for SQL style grouped output, which will effectively ask pandas to preserve these grouped by columns in the output as it is prepared.

    as_index: bool, default True

    For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

    Example:

    Given the following Dataframe:

      col1  col2      col3      col4
    0    A     1  0.502130  0.959404
    1    A     3  0.335416  0.087215
    2    B     2  0.067308  0.084595
    3    B     4  0.454158  0.723124
    4    B     4  0.323326  0.895858
    5    C     2  0.672375  0.356736
    6    C     5  0.929655  0.371913
    7    D     5  0.212634  0.540736
    8    D     5  0.471418  0.268270
    9    E     1  0.061270  0.739610
    

    Applying the first solution gives:

    >>> df.groupby(["col1", "col2"], as_index=False).sum()
    
      col1  col2      col3      col4
    0    A     1  0.502130  0.959404
    1    A     3  0.335416  0.087215
    2    B     2  0.067308  0.084595
    3    B     4  0.777483  1.618982
    4    C     2  0.672375  0.356736
    5    C     5  0.929655  0.371913
    6    D     5  0.684052  0.809006
    7    E     1  0.061270  0.739610
    

    Where the groupby columns are preserved correctly.


    Solution 2:

    To understand the second solution, let's look at the output of the previous command with as_index = True which is the default behavior of pandas.DataFrame.groupby (check documentation):

    >>> df.groupby(["col1", "col2"], as_index=True).sum()
                   col3      col4
    col1 col2                    
    A    1     0.502130  0.959404
         3     0.335416  0.087215
    B    2     0.067308  0.084595
         4     0.777483  1.618982
    C    2     0.672375  0.356736
         5     0.929655  0.371913
    D    5     0.684052  0.809006
    E    1     0.061270  0.739610
    

    As you can see, the groupby keys become the index of the dataframe. Using, pandas.DataFrame.reset_index (check documentation) we can put back the indices of the dataframe as columns and use a default index. Which also leads us to the same results as in the previous step:

    >>> df.groupby(['col1', 'col2']).sum().reset_index()
      col1  col2      col3      col4
    0    A     1  0.502130  0.959404
    1    A     3  0.335416  0.087215
    2    B     2  0.067308  0.084595
    3    B     4  0.777483  1.618982
    4    C     2  0.672375  0.356736
    5    C     5  0.929655  0.371913
    6    D     5  0.684052  0.809006
    7    E     1  0.061270  0.739610
    

    Benchmark

    Notice that since the first solution achieves the requirement in 1 step versus 2 steps in the second solution, the former is slightly faster:

    %timeit df.groupby(["col1", "col2"], as_index=False).sum()
    3.38 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit df.groupby(["col1", "col2"]).sum().reset_index()
    3.9 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
提交回复
热议问题