Converting a Pandas GroupBy output from Series to DataFrame

后端 未结 9 557
广开言路
广开言路 2020-11-22 09:58

I\'m starting with input data like this

df1 = pandas.DataFrame( { 
    \"Name\" : [\"Alice\", \"Bob\", \"Mallory\", \"Mallory\", \"Bob\" , \"Mallory\"] , 
           


        
相关标签:
9条回答
  • 2020-11-22 10:34

    g1 here is a DataFrame. It has a hierarchical index, though:

    In [19]: type(g1)
    Out[19]: pandas.core.frame.DataFrame
    
    In [20]: g1.index
    Out[20]: 
    MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
           ('Mallory', 'Seattle')], dtype=object)
    

    Perhaps you want something like this?

    In [21]: g1.add_suffix('_Count').reset_index()
    Out[21]: 
          Name      City  City_Count  Name_Count
    0    Alice   Seattle           1           1
    1      Bob   Seattle           2           2
    2  Mallory  Portland           2           2
    3  Mallory   Seattle           1           1
    

    Or something like:

    In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
    Out[36]: 
          Name      City  count
    0    Alice   Seattle      1
    1      Bob   Seattle      2
    2  Mallory  Portland      2
    3  Mallory   Seattle      1
    
    0 讨论(0)
  • 2020-11-22 10:38

    I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.

    Source:

    Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

    Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

    Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

    nth can act as a reducer or a filter, see here.

    import pandas as pd
    
    df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                        "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
    print df1
    #
    #       City     Name
    #0   Seattle    Alice
    #1   Seattle      Bob
    #2  Portland  Mallory
    #3   Seattle  Mallory
    #4   Seattle      Bob
    #5  Portland  Mallory
    #
    g1 = df1.groupby(["Name", "City"], as_index=False).count()
    print g1
    #
    #                  City  Name
    #Name    City
    #Alice   Seattle      1     1
    #Bob     Seattle      2     2
    #Mallory Portland     2     2
    #        Seattle      1     1
    #
    

    EDIT:

    In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:

    print df1.groupby(["Name", "City"], as_index=False ).count()
    #IndexError: list index out of range
    
    print df1.groupby(["Name", "City"]).count()
    #Empty DataFrame
    #Columns: []
    #Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
    
    print df1.groupby(["Name", "City"])[['Name','City']].count()
    #                  Name  City
    #Name    City                
    #Alice   Seattle      1     1
    #Bob     Seattle      2     2
    #Mallory Portland     2     2
    #        Seattle      1     1
    
    print df1.groupby(["Name", "City"]).size().reset_index(name='count')
    #      Name      City  count
    #0    Alice   Seattle      1
    #1      Bob   Seattle      2
    #2  Mallory  Portland      2
    #3  Mallory   Seattle      1
    

    The difference between count and size is that size counts NaN values while count does not.

    0 讨论(0)
  • 2020-11-22 10:46

    Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.

    example code unrelated to question

    df = df['TIME'].groupby(df['Name']).min()
    df = df.to_frame()
    df = df.reset_index(level=['Name',"TIME"])
    
    0 讨论(0)
  • 2020-11-22 10:47

    The key is to use the reset_index() method.

    Use:

    import pandas
    
    df1 = pandas.DataFrame( { 
        "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
        "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
    
    g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
    

    Now you have your new dataframe in g1:

    0 讨论(0)
  • 2020-11-22 10:47

    I have aggregated with Qty wise data and store to dataframe

    almo_grp_data = pd.DataFrame({'Qty_cnt' :
    almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
              )['Qty'].sum()}).reset_index()
    
    0 讨论(0)
  • 2020-11-22 10:49

    These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:

    Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.

    df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
    df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
    
    # Grouped gives us the indices we want for each grouping
    # We cannot convert a groupedby object back to a dataframe, so we need to do it manually
    # Create a new dataframe to work against
    df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
    df_aggregated['Male Count'] = 0
    df_aggregated['Female Count'] = 0
    df_aggregated['Job Rate'] = 0
    
    def manualAggregations(indices_array):
        temp_df = df.iloc[indices_array]
        return {
            'Male Count': temp_df['Male Count'].sum(),
            'Female Count': temp_df['Female Count'].sum(),
            'Job Rate': temp_df['Hourly Rate'].max()
        }
    
    for name, group in df_grouped:
        ix = df_grouped.indices[name]
        calcDict = manualAggregations(ix)
    
        for key in calcDict:
            #Salary Basis, Job Title
            columns = list(name)
            df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                              (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
    

    If a dictionary isn't your thing, the calculations could be applied inline in the for loop:

        df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                    (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
    
    0 讨论(0)
提交回复
热议问题