Why does pickle take so much longer than np.save?

后端未结

关注

 3  793

北恋

I want to save a dict or arrays.

I try both with np.save and with pickle and see that the former always take much less time.

相关标签:

3条回答

旧时难觅i

2020-12-22 01:02
Because as long as the written object contains no Python data,
- numpy objects are represented in memory in a much simpler way than Python objects
- numpy.save is written in C
- numpy.save writes in a supersimple format that needs minimal processing
meanwhile
- Python objects have a lot of overhead
- pickle is written in Python
- pickle transforms the data considerably from the underlying representation in memory to the bytes being written on the disk
Note that if a numpy array does contain Python objects, then numpy just pickles the array, and all the win goes out the window.
0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-22 01:07

This is because pickle works on all sorts of Python objects and is written in pure Python, whereas np.save is designed for arrays and saves them in an efficient format.

From the numpy.save documentation, it can actually use pickle behind the scenes. This may limit portability between versions of Python and runs the risk of executing arbitrary code (which is a general risk when unpickling an unknown object).

Useful reference: This answer

0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2020-12-22 01:08

I think you need better timings. I also disagree with the accepted answer.

b is a dictionary with 9 keys; the values are lists of arrays. That means both pickle.dump and np.save will be using each other - pickle uses save to pickle the arrays, save uses pickle to save the dictionary and list.

save writes arrays. That means it has to wrap your dictionary in a object dtype array in order to save it.

In [6]: np.save('test1',b)
In [7]: d=np.load('test1.npy')
In [8]: d
Out[8]: 
array({0: [array([0, 0, 0, 0])], 1: [array([1, 0, 0, 0]), array([0, 1, 0, 0]), .... array([ 1, -1,  0,  0]), array([ 1,  0, -1,  0]), array([ 1,  0,  0, -1])]},
      dtype=object)
In [9]: d.shape
Out[9]: ()
In [11]: list(d[()].keys())
Out[11]: [0, 1, 2, 3, 4, 5, 6, 7, 8]

Some timings:

In [12]: timeit np.save('test1',b)
850 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: timeit d=np.load('test1.npy')
566 µs ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [20]: %%timeit 
    ...: with open('testpickle', 'wb') as myfile:
    ...:     pickle.dump(b, myfile)
    ...:     
505 µs ± 9.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [21]: %%timeit 
    ...: with open('testpickle', 'rb') as myfile:
    ...:     g1 = pickle.load(myfile)
    ...:     
152 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my timings pickle is faster.

The pickle file is slightly smaller:

In [23]: ll test1.npy testpickle
-rw-rw-r-- 1 paul 5740 Aug 14 08:40 test1.npy
-rw-rw-r-- 1 paul 4204 Aug 14 08:43 testpickle

0 讨论(0)