I have a table in pandas:
import pandas as pd
df = pd.DataFrame({
\'LeafID\':[1,1,2,1,3,3,1,6,3,5,1],
\'pidx\':[10,10,300,10,30,40,20,10,30,45,20],
First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:
df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')
This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.
Row operations are also simple using indexing:
df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']
pandas
df[df.groupby('pidx').pidx.transform('count') > 2]
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.
df.groupby('pidx').filter(lambda x: len(x) > 2)
LeafID count pidx pidy
0 1 10 10 20
1 1 20 10 20
3 1 40 10 20
7 6 50 10 43
You can use value_counts with boolean indexing and isin:
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 30 54 9
9 5 10 45 112 0
10 1 70 20 23 1
s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
print (df)
print (df.groupby('pidx').filter(lambda x: len(x) > 120))
def jez(df):
s = df.pidx.value_counts()
return df[df.pidx.isin(s[s>120].index)]
print (jez(df))
In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop
In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop
In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop
In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop
In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop
For final_score
you can use:
df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))