What are the different use cases of joblib versus pickle?

前端 未结 3 1802
独厮守ぢ
独厮守ぢ 2020-12-02 07:40

Background: I\'m just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.

it may be more interesting t

相关标签:
3条回答
  • 2020-12-02 08:04

    Thanks to Gunjan for giving us this script! I modified it for Python3 results

    #comapare pickle loaders
    from time import time
    import pickle
    import os
    import _pickle as cPickle
    from sklearn.externals import joblib
    
    file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
    t1 = time()
    lis = []
    d = pickle.load(open(file,"rb"))
    print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
    
    t1 = time()
    cPickle.load(open(file,"rb"))
    print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
    
    t1 = time()
    joblib.load(file)
    print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
    
    time for loading file size with pickle 79708 KB => 0.16768312454223633
    time for loading file size with cpickle 79708 KB => 0.0002372264862060547
    time for loading file size joblib 79708 KB => 0.0006849765777587891
    
    0 讨论(0)
  • 2020-12-02 08:06

    I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file

    #comapare pickle loaders
    from time import time
    import pickle
    import os
    try:
       import cPickle
    except:
       print "Cannot import cPickle"
    import joblib
    
    t1 = time()
    lis = []
    d = pickle.load(open("classi.pickle","r"))
    print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
    
    t1 = time()
    cPickle.load(open("classi.pickle","r"))
    print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
    
    t1 = time()
    joblib.load("classi.pickle")
    print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
    

    Output for this is

    time for loading file size with pickle 1154320653 KB => 6.75876188278
    time for loading file size with cpickle 1154320653 KB => 52.6876490116
    time for loading file size joblib 1154320653 KB => 6.27503800392
    

    According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks

    0 讨论(0)
  • 2020-12-02 08:18
    • joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
    • joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
    • if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
    • since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
    • But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with mmap_mode="r".
    0 讨论(0)
提交回复
热议问题