Hi I would like to know the best way to do operations on columns in python using pandas.
I have a classical database which I have loaded as a dataframe, and I often have
simplest according to me.
from random import randint, randrange, uniform
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':randrange(0,10),'b':randrange(10,20),'c':np.random.randn(10)})
#If colC > 0,5, then ColC = ColB - Cola
df['c'][df['c'] > 0.5] = df['b'] - df['a']
Tested, it works.
a b c
2 11 -0.576309
2 11 -0.578449
2 11 -1.085822
2 11 9.000000
2 11 9.000000
2 11 -1.081405
Start with..
df = pd.DataFrame({'a':randrange(1,10),'b':randrange(10,20),'c':np.random.randn(10)})
a b c
0 7 12 0.475248
1 7 12 -1.090855
2 7 12 -1.227489
3 7 12 0.163929
end with...
df.ix[df.A < 1,df.A = df['c'] - df['d']]; df
a b c
0 7 12 5.000000
1 7 12 5.000000
2 7 12 5.000000
3 7 12 5.000000
4 7 12 1.813233
You can just use a boolean mask with either the .loc
or .ix
attributes of the DataFrame.
mask = df['A'] > 2
df.ix[mask, 'A'] = df.ix[mask, 'C'] - df.ix[mask, 'D']
If you have a lot of branching things then you can do:
def func(row):
if row['A'] > 0:
return row['B'] + row['C']
elif row['B'] < 0:
return row['D'] + row['A']
else:
return row['A']
df['A'] = df.apply(func, axis=1)
apply
should generally be much faster than a for loop.
There's lots of ways of doing this, but here's the pattern I find easiest to read.
#Assume df is a Panda's dataframe object
idx = df.loc[:, 'A'] > x
df.loc[idx, 'A'] = df.loc[idx, 'C'] - df.loc[idx, 'D']
Setting the elements less than x is as easy as df.loc[~idx, 'A'] = 0