Pandas: Comparing rows within groups

余生颓废 提交于 2021-02-07 07:26:45

问题


I have a dataframe that is grouped by 'Key'. I need to compare rows within each group to identify whether I want to keep each row of the group or whether I want just one row of a group.

In the condition to keep all rows of a group: if there is one row that has the color 'red' and area of '12' and shape of 'circle' AND another row (within the same group) that has a color of 'green' and an area of '13' and shape of 'square', then I want to keep all rows in that group. Otherwise if this scenario does not exist, I want to keep the row of that group with the largest 'num' value.

df = pd.DataFrame({'KEY': ['100000009', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'], 
              'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120306,20120405],
              'shape': ['circle', 'square', 'circle','circle','circle','circle','circle','circle'],
              'num': [3,4,5,6,7,8,9,10],
              'area': [12, 13, 12,12,12,12,12,12],
              'color': ['red', 'green', 'red','red','red','red','red','red']})


    Date1       KEY        area color   num shape
0   2012-05-06  100000009   12  red     3   circle
1   2012-05-06  100000009   13  green   4   square
2   2012-05-07  100000009   12  red     5   circle
3   2012-06-08  100000009   12  red     6   circle
4   2012-06-20  100000009   12  red     7   circle
5   2012-02-06  100000034   12  red     8   circle
6   2012-03-06  100000034   12  red     9   circle
7   2012-04-05  100000034   12  red     10  circle

Expected result:

    Date1       KEY        area color   num shape
0   2012-05-06  100000009   12  red     3   circle
1   2012-05-06  100000009   13  green   4   square
2   2012-05-07  100000009   12  red     5   circle
3   2012-06-08  100000009   12  red     6   circle
4   2012-06-20  100000009   12  red     7   circle
7   2012-04-05  100000034   12  red     10  circle

I am new to python, and groupby is throwing me a curve ball.

maxnum = df.groupby('KEY')['num'].transform(max)
df = df.loc[df.num == maxnum]

cond1 = (df[df['area'] == 12]) & (df[df['color'] == 'red']) & (df[df['shape'] == 'circle'])
cond2 = (df[df['area'] == 13]) & (df[df['color'] == 'green']) & (df[df['shape'] == 'square'])

回答1:


Define a custom function called function:

def function(x):
    i = x.query(
        'area == 12 and color == "red" and shape == "circle"'
    )
    j = x.query(
        'area == 13 and color == "green" and shape == "square"'
    )
    return x if not (i.empty or j.empty) else x[x.num == x.num.max()].head(1)

This function tests each group on the specified conditions and returns rows as appropriate. In particular, it queries on the conditions and tests for emptiness using df.empty.

Pass this to groupby + apply:

df.groupby('KEY', group_keys=False).apply(function)


      Date1        KEY  area  color  num   shape
0  20120506  100000009    12    red    3  circle
1  20120506  100000009    13  green    4  square
2  20120507  100000009    12    red    5  circle
3  20120608  100000009    12    red    6  circle
4  20120620  100000009    12    red    7  circle
7  20120405  100000034    12    red   10  circle


来源:https://stackoverflow.com/questions/48819644/pandas-comparing-rows-within-groups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!