Pandas percentage of total with groupby

前端 未结 14 2278
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

14条回答
  •  情话喂你
    2020-11-22 07:01

    You can sum the whole DataFrame and divide by the state total:

    # Copying setup from Paul H answer
    import numpy as np
    import pandas as pd
    np.random.seed(0)
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
    # Add a column with the sales divided by state total sales.
    df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
    
    df
    

    Returns

        office_id   sales state  sales_ratio
    0           1  405711    CA     0.193319
    1           2  535829    WA     0.347072
    2           3  217952    CO     0.198743
    3           4  252315    AZ     0.192500
    4           5  982371    CA     0.468094
    5           6  459783    WA     0.297815
    6           1  404137    CO     0.368519
    7           2  222579    AZ     0.169814
    8           3  710581    CA     0.338587
    9           4  548242    WA     0.355113
    10          5  474564    CO     0.432739
    11          6  835831    AZ     0.637686
    

    But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:

    df.office_id = df.office_id.astype(str)
    df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
    

    TypeError: unsupported operand type(s) for /: 'str' and 'str'

提交回复
热议问题