pandas df.loc[z,x]=y how to improve speed?

前端 未结 3 1274
陌清茗
陌清茗 2020-12-31 08:23

I have identified one pandas command

timeseries.loc[z, x] = y

to be responsible for most of the time spent in an iteration. And now I am l

相关标签:
3条回答
  • 2020-12-31 09:05

    I always think at is the fastest, but not. ix is faster:

    import pandas as pd
    
    df = pd.DataFrame({'A':[1,2,3],
                       'B':[4,5,6],
                       'C':[7,8,9],
                       'D':[1,3,5],
                       'E':[5,3,6],
                       'F':[7,4,3]})
    
    print (df)
       A  B  C  D  E  F
    0  1  4  7  1  5  7
    1  2  5  8  3  3  4
    2  3  6  9  5  6  3
    
    print (df.at[2, 'B'])
    6
    print (df.ix[2, 'B'])
    6
    print (df.loc[2, 'B'])
    6
    
    In [77]: %timeit df.at[2, 'B']
    10000 loops, best of 3: 44.6 µs per loop
    
    In [78]: %timeit df.ix[2, 'B']
    10000 loops, best of 3: 40.7 µs per loop
    
    In [79]: %timeit df.loc[2, 'B']
    1000 loops, best of 3: 681 µs per loop
    

    EDIT:

    I try MaxU df and differences are caused random.randint function:

    df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))
    
    
    In [4]: %timeit (df.ix[2, 'B'])
    The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
    10000 loops, best of 3: 20.7 µs per loop
    
    In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
    The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
    10000 loops, best of 3: 28 µs per loop
    
    0 讨论(0)
  • 2020-12-31 09:16

    if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.

    If you plan to do thins inside a big loop(say 10M‌ records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times

    0 讨论(0)
  • 2020-12-31 09:19

    UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

    =====================================================================

    @jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):

    setup:

    In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))
    
    In [16]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10000000 entries, 0 to 9999999
    Data columns (total 5 columns):
    a    float64
    b    float64
    c    float64
    d    float64
    e    float64
    dtypes: float64(5)
    memory usage: 381.5 MB
    
    In [17]: df.shape
    Out[17]: (10000000, 5)
    

    Timing:

    In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
    1000 loops, best of 3: 502 µs per loop
    
    In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
    1000 loops, best of 3: 394 µs per loop
    
    In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
    10000 loops, best of 3: 66.8 µs per loop
    
    In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
    10000 loops, best of 3: 32.9 µs per loop
    
    In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
    10000 loops, best of 3: 64.8 µs per loop
    
    In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
    1000 loops, best of 3: 503 µs per loop
    

    Results as a bar plot:

    Timing data as DF:

    In [88]: r
    Out[88]:
           method  timing
    0         loc   502.0
    1        iloc   394.0
    2          at    66.8
    3         iat    32.9
    4    ix_label    64.8
    5  ix_integer   503.0
    
    In [89]: r.to_dict()
    Out[89]:
    {'method': {0: 'loc',
      1: 'iloc',
      2: 'at',
      3: 'iat',
      4: 'ix_label',
      5: 'ix_integer'},
     'timing': {0: 502.0,
      1: 394.0,
      2: 66.799999999999997,
      3: 32.899999999999999,
      4: 64.799999999999997,
      5: 503.0}}
    

    Plotting

    ax = sns.barplot(data=r, x='method', y='timing')
    ax.tick_params(labelsize=16)
    [ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
    ax.set_xlabel('indexing method', size=20)
    ax.set_ylabel('timing (microseconds)', size=20)
    
    0 讨论(0)
提交回复
热议问题