Why does pickle take so much longer than np.save?

后端 未结 3 793
北恋
北恋 2020-12-22 00:43

I want to save a dict or arrays.

I try both with np.save and with pickle and see that the former always take much less time.

相关标签:
3条回答
  • 2020-12-22 01:02

    Because as long as the written object contains no Python data,

    • numpy objects are represented in memory in a much simpler way than Python objects
    • numpy.save is written in C
    • numpy.save writes in a supersimple format that needs minimal processing

    meanwhile

    • Python objects have a lot of overhead
    • pickle is written in Python
    • pickle transforms the data considerably from the underlying representation in memory to the bytes being written on the disk

    Note that if a numpy array does contain Python objects, then numpy just pickles the array, and all the win goes out the window.

    0 讨论(0)
  • 2020-12-22 01:07

    This is because pickle works on all sorts of Python objects and is written in pure Python, whereas np.save is designed for arrays and saves them in an efficient format.

    From the numpy.save documentation, it can actually use pickle behind the scenes. This may limit portability between versions of Python and runs the risk of executing arbitrary code (which is a general risk when unpickling an unknown object).

    Useful reference: This answer

    0 讨论(0)
  • 2020-12-22 01:08

    I think you need better timings. I also disagree with the accepted answer.

    b is a dictionary with 9 keys; the values are lists of arrays. That means both pickle.dump and np.save will be using each other - pickle uses save to pickle the arrays, save uses pickle to save the dictionary and list.

    save writes arrays. That means it has to wrap your dictionary in a object dtype array in order to save it.

    In [6]: np.save('test1',b)
    In [7]: d=np.load('test1.npy')
    In [8]: d
    Out[8]: 
    array({0: [array([0, 0, 0, 0])], 1: [array([1, 0, 0, 0]), array([0, 1, 0, 0]), .... array([ 1, -1,  0,  0]), array([ 1,  0, -1,  0]), array([ 1,  0,  0, -1])]},
          dtype=object)
    In [9]: d.shape
    Out[9]: ()
    In [11]: list(d[()].keys())
    Out[11]: [0, 1, 2, 3, 4, 5, 6, 7, 8]
    

    Some timings:

    In [12]: timeit np.save('test1',b)
    850 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    In [13]: timeit d=np.load('test1.npy')
    566 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [20]: %%timeit 
        ...: with open('testpickle', 'wb') as myfile:
        ...:     pickle.dump(b, myfile)
        ...:     
    505 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    In [21]: %%timeit 
        ...: with open('testpickle', 'rb') as myfile:
        ...:     g1 = pickle.load(myfile)
        ...:     
    152 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    In my timings pickle is faster.

    The pickle file is slightly smaller:

    In [23]: ll test1.npy testpickle
    -rw-rw-r-- 1 paul 5740 Aug 14 08:40 test1.npy
    -rw-rw-r-- 1 paul 4204 Aug 14 08:43 testpickle
    
    0 讨论(0)
提交回复
热议问题