Can someone explain to me why would this two statements (the for loop and the comprehension ) return two different answers. I thought they were the same, just different ways of
you are setting the whole column (vector) in each iteration step:
Top152['HighRenew'] = 1
Try this vectorized approach instead:
Top152['HighRenew'] = (Top152['% Renewable'] >= Top152['% Renewable'].median()).astype(int)
so your function may be implemented as follows:
def answer_ten():
return (Top15['% Renewable'] >= Top15['% Renewable'].median()).astype(int)
Better is convert boolean mask
to int
, because pandas
the fastest working with very fast vectorized functions:
print (Top152['% Renewable']> Top152['% Renewable'].median())
China True
United States False
Japan False
United Kingdom False
Russian Federation True
Canada True
Germany True
India False
France False
South Korea False
Italy True
Spain True
Iran False
Australia False
Brazil True
Name: % Renewable, dtype: bool
def answer_ten():
return (Top152['% Renewable'] > Top152['% Renewable'].median())
.astype(int).rename('HighRenew')
print (answer_ten())
China 1
United States 0
Japan 0
United Kingdom 0
Russian Federation 1
Canada 1
Germany 1
India 0
France 0
South Korea 0
Italy 1
Spain 1
Iran 0
Australia 0
Brazil 1
Name: HighRenew, dtype: int32
For loop, very slow solution is possible use iterrows, but faster is first solution:
def answer_ten():
for idx, x in Top152.iterrows():
if Top152.loc[idx, '% Renewable'] >= Top152['% Renewable'].median():
Top152.loc[idx, 'HighRenew'] = 1
else:
Top152.loc[idx, 'HighRenew'] = 0
return Top152['HighRenew'].astype(int)
print (answer_ten())
China 1
United States 0
Japan 0
United Kingdom 0
Russian Federation 1
Canada 1
Germany 1
India 0
France 1
South Korea 0
Italy 1
Spain 1
Iran 0
Australia 0
Brazil 1
Name: HighRenew, dtype: int32
Timings:
#[15000 rows x 1 columns]
Top152 = pd.concat([Top152]*1000).reset_index(drop=True)
def answer_ten1():
return (Top152['% Renewable']> Top152['% Renewable'].median()).astype(int).rename('HighRenew')
def answer_ten2():
for idx, x in Top152.iterrows():
if Top152.loc[idx, '% Renewable'] >= Top152['% Renewable'].median():
Top152.loc[idx, 'HighRenew'] = 1
else:
Top152.loc[idx, 'HighRenew'] = 0
return Top152['HighRenew'].astype(int)
def answer_ten3():
Top152['HighRenew'] = [1 if x >= Top152['% Renewable'].median() else 0 for x in Top152['% Renewable']]
return Top152['HighRenew']
print (answer_ten1())
print (answer_ten2())
print (answer_ten3())
In [169]: %timeit (answer_ten1())
1000 loops, best of 3: 528 µs per loop
In [170]: %timeit answer_ten2()
1 loop, best of 3: 16 s per loop
In [171]: %timeit (answer_ten3())
1 loop, best of 3: 2.67 s per loop
In the second approach you are editing your vector. While the for loop will save it (in the background) to avoid the unwanted edits!