问题
I have a dataframe that is grouped by 'Key'. I need to compare rows within each group to identify whether I want to keep each row of the group or whether I want just one row of a group.
In the condition to keep all rows of a group: if there is one row that has the color 'red' and area of '12' and shape of 'circle' AND another row (within the same group) that has a color of 'green' and an area of '13' and shape of 'square', then I want to keep all rows in that group. Otherwise if this scenario does not exist, I want to keep the row of that group with the largest 'num' value.
df = pd.DataFrame({'KEY': ['100000009', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120306,20120405],
'shape': ['circle', 'square', 'circle','circle','circle','circle','circle','circle'],
'num': [3,4,5,6,7,8,9,10],
'area': [12, 13, 12,12,12,12,12,12],
'color': ['red', 'green', 'red','red','red','red','red','red']})
Date1 KEY area color num shape
0 2012-05-06 100000009 12 red 3 circle
1 2012-05-06 100000009 13 green 4 square
2 2012-05-07 100000009 12 red 5 circle
3 2012-06-08 100000009 12 red 6 circle
4 2012-06-20 100000009 12 red 7 circle
5 2012-02-06 100000034 12 red 8 circle
6 2012-03-06 100000034 12 red 9 circle
7 2012-04-05 100000034 12 red 10 circle
Expected result:
Date1 KEY area color num shape
0 2012-05-06 100000009 12 red 3 circle
1 2012-05-06 100000009 13 green 4 square
2 2012-05-07 100000009 12 red 5 circle
3 2012-06-08 100000009 12 red 6 circle
4 2012-06-20 100000009 12 red 7 circle
7 2012-04-05 100000034 12 red 10 circle
I am new to python, and groupby is throwing me a curve ball.
maxnum = df.groupby('KEY')['num'].transform(max)
df = df.loc[df.num == maxnum]
cond1 = (df[df['area'] == 12]) & (df[df['color'] == 'red']) & (df[df['shape'] == 'circle'])
cond2 = (df[df['area'] == 13]) & (df[df['color'] == 'green']) & (df[df['shape'] == 'square'])
回答1:
Define a custom function called function
:
def function(x):
i = x.query(
'area == 12 and color == "red" and shape == "circle"'
)
j = x.query(
'area == 13 and color == "green" and shape == "square"'
)
return x if not (i.empty or j.empty) else x[x.num == x.num.max()].head(1)
This function tests each group on the specified conditions and returns rows as appropriate. In particular, it queries on the conditions and tests for emptiness using df.empty
.
Pass this to groupby
+ apply
:
df.groupby('KEY', group_keys=False).apply(function)
Date1 KEY area color num shape
0 20120506 100000009 12 red 3 circle
1 20120506 100000009 13 green 4 square
2 20120507 100000009 12 red 5 circle
3 20120608 100000009 12 red 6 circle
4 20120620 100000009 12 red 7 circle
7 20120405 100000034 12 red 10 circle
来源:https://stackoverflow.com/questions/48819644/pandas-comparing-rows-within-groups