I am trying to iterate over the rows of a Python Pandas dataframe. Within each row of the dataframe, I am trying to to refer to each value along a row by its column name.
If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows()
is far from being the best. itertuples() can be 100 times faster.
In short:
df.itertuples(name=None)
. In particular, when you have a fixed number columns and less than 255 columns. See point (3)df.itertuples()
except if your columns have special characters such as spaces or '-'. See point (2)itertuples()
even if your dataframe has strange columns by using the last example. See point (4)iterrows()
if you cannot the previous solutions. See point (1)Generate a random dataframe with a million rows and 4 columns:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows()
is convenient, but damn slow:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples()
is already much faster, but it doesn't work with column names such as My Col-Name is very Strange
(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples()
using name=None is even faster but not really convenient as you have to define a variable per column.
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples()
is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange
.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
This article is a very interesting comparison between iterrows and itertuples
The item from iterrows()
is not a Series, but a tuple of (index, Series), so you can unpack the tuple in the for loop like so:
for (idx, row) in df.iterrows():
print(row.loc['A'])
print(row.A)
print(row.index)
#0.890618586836
#0.890618586836
#Index(['A', 'B', 'C', 'D'], dtype='object')
I also like itertuples()
for row in df.itertuples():
print(row.A)
print(row.Index)
since row is a named tuples, if you meant to access values on each row this should be MUCH faster
speed run :
df = pd.DataFrame([x for x in range(1000*1000)], columns=['A'])
st=time.time()
for index, row in df.iterrows():
row.A
print(time.time()-st)
45.05799984931946
st=time.time()
for row in df.itertuples():
row.A
print(time.time() - st)
0.48400020599365234
for i in range(1,len(na_rm.columns)):
print ("column name:", na_rm.columns[i])
Output :
column name: seretide_price
column name: symbicort_mkt_shr
column name: symbicort_price