How to summarize on different groupby combinations?

后端 未结 5 1546
后悔当初
后悔当初 2020-12-04 02:59

I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different ord

5条回答
  •  有刺的猬
    2020-12-04 03:33

    Method 1:

    Combine the crop columns

    >>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
    ...                           x['Crop2'],
    ...                           x['Crop3']]),axis=1)
    >>> df1.head()
           County   Crop1    Crop2    Crop3  Total_pop              combined_temp
    0      Harney   grain   melons   apples       2000    [grain, melons, apples]
    1       Baker  melons    grain   apples       1500    [melons, grain, apples]
    2     Wheeler  melons    grain   apples       3000    [melons, grain, apples]
    3  Hood River  apples   melons    grain       1500    [apples, melons, grain]
    4       Wasco   pears  carrots  raddish       2000  [pears, carrots, raddish]
    

    make it a sorted tuple

    >>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
    >>> df1.head()
           County   Crop1    Crop2            ...             Total_pop              combined_temp                     sorted
    0      Harney   grain   melons            ...                  2000    [grain, melons, apples]    (apples, grain, melons)
    1       Baker  melons    grain            ...                  1500    [melons, grain, apples]    (apples, grain, melons)
    2     Wheeler  melons    grain            ...                  3000    [melons, grain, apples]    (apples, grain, melons)
    3  Hood River  apples   melons            ...                  1500    [apples, melons, grain]    (apples, grain, melons)
    4       Wasco   pears  carrots            ...                  2000  [pears, carrots, raddish]  (carrots, pears, raddish)
    

    then proceed to your normal group by operation

    >>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
    >>> df1_grouped
                          sorted  Total_pop
    0    (apples, grain, melons)       8000
    1  (carrots, pears, raddish)       9200
    

    Method 2: A shorted version based on the answer by aws-apprentice

    df = df1.copy()
    
    grouping_cols = ['Crop1', 'Crop2', 'Crop3']
    
    df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
                                .apply(set, axis=1) \
                                .apply(sorted)            
                                .values \
                                .tolist(), columns=grouping_cols)
    
    >>> df.head()
           County    Crop1  Crop2    Crop3  Total_pop
    0      Harney   apples  grain   melons       2000
    1       Baker   apples  grain   melons       1500
    2     Wheeler   apples  grain   melons       3000
    3  Hood River   apples  grain   melons       1500
    4       Wasco  carrots  pears  raddish       2000
    

    now take group by group by

    >>> df.groupby(grouping_cols).Total_pop.sum()
    Crop1    Crop2  Crop3  
    apples   grain  melons     8000
    carrots  pears  raddish    9200
    Name: Total_pop, dtype: int64
    

    but i personally prefer this answer using numpy

提交回复
热议问题