Does pandas iterrows have performance issues?

前端 未结 6 1708
名媛妹妹
名媛妹妹 2020-11-21 07:04

I have noticed very poor performance when using iterrows from pandas.

Is this something that is experienced by others? Is it specific to iterrows and should this fun

相关标签:
6条回答
  • 2020-11-21 07:44

    Here's the way to do your problem. This is all vectorized.

    In [58]: df = table1.merge(table2,on='letter')
    
    In [59]: df['calc'] = df['number1']*df['number2']
    
    In [60]: df
    Out[60]: 
      letter  number1  number2  calc
    0      a       50      0.2    10
    1      a       50      0.5    25
    2      b      -10      0.1    -1
    3      b      -10      0.4    -4
    
    In [61]: df.groupby('letter')['calc'].max()
    Out[61]: 
    letter
    a         25
    b         -1
    Name: calc, dtype: float64
    
    In [62]: df.groupby('letter')['calc'].idxmax()
    Out[62]: 
    letter
    a         1
    b         2
    Name: calc, dtype: int64
    
    In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
    Out[63]: 
      letter  number1  number2  calc
    1      a       50      0.5    25
    2      b      -10      0.1    -1
    
    0 讨论(0)
  • 2020-11-21 07:47

    Details in this video

    Benchmark

    0 讨论(0)
  • 2020-11-21 07:52

    Vector operations in Numpy and pandas are much faster than scalar operations in vanilla Python for several reasons:

    • Amortized type lookup: Python is a dynamically typed language, so there is runtime overhead for each element in an array. However, Numpy (and thus pandas) perform calculations in C (often via Cython). The type of the array is determined only at the start of the iteration; this savings alone is one of the biggest wins.

    • Better caching: Iterating over a C array is cache-friendly and thus very fast. A pandas DataFrame is a "column-oriented table", which means that each column is really just an array. So the native actions you can perform on a DataFrame (like summing all the elements in a column) are going to have few cache misses.

    • More opportunities for parallelism: A simple C array can be operated on via SIMD instructions. Some parts of Numpy enable SIMD, depending on your CPU and installation process. The benefits to parallelism won't be as dramatic as the static typing and better caching, but they're still a solid win.

    Moral of the story: use the vector operations in Numpy and pandas. They are faster than scalar operations in Python for the simple reason that these operations are exactly what a C programmer would have written by hand anyway. (Except that the array notion is much easier to read than explicit loops with embedded SIMD instructions.)

    0 讨论(0)
  • 2020-11-21 07:54

    Generally, iterrows should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:

    1) vectorization
    2) using a custom cython routine
    3) apply
        a) reductions that can be performed in cython
        b) iteration in python space
    4) itertuples
    5) iterrows
    6) updating an empty frame (e.g. using loc one-row-at-a-time)
    

    Using a custom Cython routine is usually too complicated, so let's skip that for now.

    1) Vectorization is ALWAYS, ALWAYS the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame, it may be faster to use other methods.

    3) apply usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply expression. For example, df.apply(lambda x: np.sum(x)) will be executed pretty swiftly, though of course, df.sum(1) is even better. However something like df.apply(lambda x: x['b'] + 1) will be executed in Python space, and consequently is much slower.

    4) itertuples does not box the data into a Series. It just returns the data in the form of tuples.

    5) iterrows DOES box the data into a Series. Unless you really need this, use another method.

    6) Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.

    0 讨论(0)
  • 2020-11-21 07:58

    Another option is to use to_records(), which is faster than both itertuples and iterrows.

    But for your case, there is much room for other types of improvements.

    Here's my final optimized version

    def iterthrough():
        ret = []
        grouped = table2.groupby('letter', sort=False)
        t2info = table2.to_records()
        for index, letter, n1 in table1.to_records():
            t2 = t2info[grouped.groups[letter].values]
            # np.multiply is in general faster than "x * y"
            maxrow = np.multiply(t2.number2, n1).argmax()
            # `[1:]`  removes the index column
            ret.append(t2[maxrow].tolist()[1:])
        global table3
        table3 = pd.DataFrame(ret, columns=('letter', 'number2'))
    

    Benchmark test:

    -- iterrows() --
    100 loops, best of 3: 12.7 ms per loop
      letter  number2
    0      a      0.5
    1      b      0.1
    2      c      5.0
    3      d      4.0
    
    -- itertuple() --
    100 loops, best of 3: 12.3 ms per loop
    
    -- to_records() --
    100 loops, best of 3: 7.29 ms per loop
    
    -- Use group by --
    100 loops, best of 3: 4.07 ms per loop
      letter  number2
    1      a      0.5
    2      b      0.1
    4      c      5.0
    5      d      4.0
    
    -- Avoid multiplication --
    1000 loops, best of 3: 1.39 ms per loop
      letter  number2
    0      a      0.5
    1      b      0.1
    2      c      5.0
    3      d      4.0
    

    Full code:

    import pandas as pd
    import numpy as np
    
    #%% Create the original tables
    t1 = {'letter':['a','b','c','d'],
          'number1':[50,-10,.5,3]}
    
    t2 = {'letter':['a','a','b','b','c','d','c'],
          'number2':[0.2,0.5,0.1,0.4,5,4,1]}
    
    table1 = pd.DataFrame(t1)
    table2 = pd.DataFrame(t2)
    
    #%% Create the body of the new table
    table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)
    
    
    print('\n-- iterrows() --')
    
    def optimize(t2info, t1info):
        calculation = []
        for index, r in t2info.iterrows():
            calculation.append(r['number2'] * t1info)
        maxrow_in_t2 = calculation.index(max(calculation))
        return t2info.loc[maxrow_in_t2]
    
    #%% Iterate through filtering relevant data, optimizing, returning info
    def iterthrough():
        for row_index, row in table1.iterrows():   
            t2info = table2[table2.letter == row['letter']].reset_index()
            table3.iloc[row_index,:] = optimize(t2info, row['number1'])
    
    %timeit iterthrough()
    print(table3)
    
    print('\n-- itertuple() --')
    def optimize(t2info, n1):
        calculation = []
        for index, letter, n2 in t2info.itertuples():
            calculation.append(n2 * n1)
        maxrow = calculation.index(max(calculation))
        return t2info.iloc[maxrow]
    
    def iterthrough():
        for row_index, letter, n1 in table1.itertuples():   
            t2info = table2[table2.letter == letter]
            table3.iloc[row_index,:] = optimize(t2info, n1)
    
    %timeit iterthrough()
    
    
    print('\n-- to_records() --')
    def optimize(t2info, n1):
        calculation = []
        for index, letter, n2 in t2info.to_records():
            calculation.append(n2 * n1)
        maxrow = calculation.index(max(calculation))
        return t2info.iloc[maxrow]
    
    def iterthrough():
        for row_index, letter, n1 in table1.to_records():   
            t2info = table2[table2.letter == letter]
            table3.iloc[row_index,:] = optimize(t2info, n1)
    
    %timeit iterthrough()
    
    print('\n-- Use group by --')
    
    def iterthrough():
        ret = []
        grouped = table2.groupby('letter', sort=False)
        for index, letter, n1 in table1.to_records():
            t2 = table2.iloc[grouped.groups[letter]]
            calculation = t2.number2 * n1
            maxrow = calculation.argsort().iloc[-1]
            ret.append(t2.iloc[maxrow])
        global table3
        table3 = pd.DataFrame(ret)
    
    %timeit iterthrough()
    print(table3)
    
    print('\n-- Even Faster --')
    def iterthrough():
        ret = []
        grouped = table2.groupby('letter', sort=False)
        t2info = table2.to_records()
        for index, letter, n1 in table1.to_records():
            t2 = t2info[grouped.groups[letter].values]
            maxrow = np.multiply(t2.number2, n1).argmax()
            # `[1:]`  removes the index column
            ret.append(t2[maxrow].tolist()[1:])
        global table3
        table3 = pd.DataFrame(ret, columns=('letter', 'number2'))
    
    %timeit iterthrough()
    print(table3)
    

    The final version is almost 10x faster than the original code. The strategy is:

    1. Use groupby to avoid repeated comparing of values.
    2. Use to_records to access raw numpy.records objects.
    3. Don't operate on DataFrame until you have compiled all the data.
    0 讨论(0)
  • 2020-11-21 08:04

    Yes, Pandas itertuples() is faster than iterrows(). you can refer the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

    "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows."

    0 讨论(0)
提交回复
热议问题