I\'ve got a dataframe containing country names & their percentage of energy output. I need to add a new column that assigns a 1 or 0, based on whether the country\'s energy
@Vaishali explains why pd.DataFrame.where
didn't work as you expected and suggested you use np.where
instead, which is very good advice.
I'll offer up that you could have simply converted your boolean result to integers.
Setup
df = pd.DataFrame({
'name':['china', 'america', 'canada'],
'output': [33.2, 15.0, 5.0]
})
Option 1
df['newcol'] = (df['output'] > df['output'].median()).astype(int)
Option 2
Or faster yet by using the underlying numpy arrays
o = df['output'].values
df['newcol'] = (o > np.median(o)).astype(int)
You don't need loop as the solution is vectorized.
df['newcol'] = np.where((df['output'] > df['output'].median()), 1, 0)
name output newcol
0 china 33.2 1
1 america 15.0 0
2 canada 5.0 0
For the error wrong number of items passed, df.where works a little different from np.where. It Returns an object of same shape as self whose corresponding entries are from self where cond is True and otherwise are from other. So its returning a dataframe in your case with two columns instead of a series and hence when you try to assign that dataframe to a series, you get the error message.