问题
I am looking for a way to generate a ranking with average as method based on multiple columns where one contains strings and the other integers (could be easily more than 2 columns, but I'm limiting to 2 for an easier example).
import pandas as pd
df = pd.DataFrame(data={'String':['a','a','a','a','b','b','c','c','c','c'],'Integer':[1,2,3,3,1,3,6,4,4,4]})
print(df)
String Integer
0 a 1
1 a 2
2 a 3
3 a 3
4 b 1
5 b 3
6 c 6
7 c 4
8 c 4
9 c 4
The idea is to be able to create ranking that ranks each row by String in descending order and integer in ascending order, this would be the output:
Rank String Integer
0 2 c 4
1 2 c 4
2 2 c 4
3 4 c 6
4 5 b 1
5 6 b 3
6 7 a 1
7 8 a 2
8 9.5 a 3
9 9.5 a 3
So far this is what I manage to do, but I'm having trouble on how to generate the 'average' when a rank is shared.
df['concat_values'] = df['String'] + df['Integer'].astype(str)
df = df.sort_values(['String','Integer'],ascending=[False,True])
df = df.reset_index(drop=True).reset_index()
df['repeated'] = df.groupby('concat_values')['concat_values'].transform('count')
df['pre_rank'] = df['index'] + 1
df = df.sort_values('pre_rank')
df = df.drop('index',axis=1)
print(df)
String Integer concat_values repeated pre_rank
0 c 4 c4 3 1
1 c 4 c4 3 2
2 c 4 c4 3 3
3 c 6 c6 1 4
4 b 1 b1 1 5
5 b 3 b3 1 6
6 a 1 a1 1 7
7 a 2 a2 1 8
8 a 3 a3 2 9
9 a 3 a3 2 10
I thought of using some filtering or formula so that when the column repeated
takes a value higher than one, the pre_rank
gets a function applied that returns the average, but that function can't be generalized for all rows, it'll work for the first one, but it will yield a higher value for the second one (because pre_rank
has a higher value now). I believe I am just missing the final step towards getting it done, but can't work it out. Thanks!
回答1:
My method:
df = df.sort_values(['String','Integer'], ascending=[False, True])
df['rank'] = np.arange(len(df)) + 1
df['rank'] = df.groupby(['String', 'Integer'])['rank'].transform('mean')
Output:
String Integer rank
7 c 4 2.0
8 c 4 2.0
9 c 4 2.0
6 c 6 4.0
4 b 1 5.0
5 b 3 6.0
0 a 1 7.0
1 a 2 8.0
2 a 3 9.5
3 a 3 9.5
回答2:
sort
+ ngroup
+ rank
.
Requires you to specify sort=False
within the groupby so the ngroup
labels are generated in the the order you sort.
df = df.sort_values(['String', 'Integer'], ascending=[False, True])
df['rank'] = df.groupby(['String', 'Integer'], sort=False).ngroup().rank()
String Integer rank
7 c 4 2.0
8 c 4 2.0
9 c 4 2.0
6 c 6 4.0
4 b 1 5.0
5 b 3 6.0
0 a 1 7.0
1 a 2 8.0
2 a 3 9.5
3 a 3 9.5
来源:https://stackoverflow.com/questions/58136881/compute-rank-average-for-multiple-columns-manually