Why Pandas .loc speed in Pandas depends on DataFrame initialization? How to make MultiIndex .loc as fast as possible?

后端 未结 1 1809
说谎
说谎 2021-02-09 20:49

I am trying to improve a code performance. I use Pandas 0.19.2 and Python 3.5.

I just realized that the .loc writing on a whole bunch of values at a time has very differ

1条回答
  •  孤街浪徒
    2021-02-09 21:44

    One difference I see here is you have (effectively) initialized df2 & df4 with dtype=int64 but df & df3 with dtype=object. You could initialize with empty real values like this for df2 & df4:

    #df has multiindex
    df = pd.DataFrame(np.empty([ncols,nlines]), 
                      columns = columns, index = lines)
    
    #df3 is mono-index and not initialized
    df3 = pd.DataFrame(np.empty([ncols,nlines]),
                       columns = np.arange(ncols), index = np.arange(nlines))
    

    You could also add dtype=int to initialize as integers rather reals but that didn't seem to matter as far as speed.

    I get a much faster timing than you did for df4 (with no difference in code), so that's a mystery to me. Anyway, with the above changes to df & df3 the timings are close for df2 to df4, but unfortunately df is still quite slow.

    %timeit df.loc[(0, 0, 0), (0, 0)] = 2*np.arange(ncols)
    1 loop, best of 3: 418 ms per loop
    
    %timeit df2.loc[:,0] = 2*np.arange(ncols)
    10000 loops, best of 3: 185 µs per loop
    
    %timeit df3.loc[0] = 2*np.arange(ncols)
    10000 loops, best of 3: 116 µs per loop
    
    %timeit df4.loc[:,0] = 2*np.arange(ncols)
    10000 loops, best of 3: 196 µs per loop
    

    Edit to add:

    As far your larger problem with the multi-index, I dunno, but 2 thoughts:

    1) Expanding on @ptrj's comment, I get a very fast timing for his suggestion (about the same as the simple-index methods):

    %timeit df.loc[(0, 0, 0) ] = 2*np.arange(ncols)
    10000 loops, best of 3: 133 µs per loop
    

    So I again get a very different timing from you (?). And FWIW, when you want the whole row with loc/iloc it is recommended to use : rather than leaving the column reference blank:

    timeit df.loc[(0, 0, 0), : ] = 2*np.arange(ncols)
    1000 loops, best of 3: 223 µs per loop
    

    But as you can see it's a bit slower, so I dunno which way to suggest here. I guess you should generally do it as recommended by the documentation, but on the other hand this may be an important difference in speed for you.

    2) Alternatively, this is rather brute force-ish, but you could just save your index/columns, reset the index/columns to be simple, then set index/columns back to multi. Although, that's not really any different from just taking df.values and I suspect not that convenient for you.

    0 讨论(0)
提交回复
热议问题