How to Reduce the time taken to load a pickle file in python

后端 未结 3 774
星月不相逢
星月不相逢 2020-12-08 02:47

I have created a dictionary in python and dumped into pickle. Its size went to 300MB. Now, I want to load the same pickle.

output = open(\'myfile.pkl\', \'rb         


        
相关标签:
3条回答
  • 2020-12-08 02:49

    I've had nice results in reading huge files (e.g: ~750 MB igraph object - a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here

    Example snippet in your case would be something like:

    import timeit
    import cPickle as pickle
    import gc
    
    
    def load_cpickle_gc():
        output = open('myfile3.pkl', 'rb')
    
        # disable garbage collector
        gc.disable()
    
        mydict = pickle.load(output)
    
        # enable garbage collector again
        gc.enable()
        output.close()
    
    
    if __name__ == '__main__':
        print "cPickle load (with gc workaround): "
        t = timeit.Timer(stmt="pickle_wr.load_cpickle_gc()", setup="import pickle_wr")
        print t.timeit(1),'\n'
    

    Surely, there might be more apt ways to get the task done, however, this workaround does reduce the time required drastically. (For me, it reduced from 843.04s to 41.28s, around 20x)

    0 讨论(0)
  • 2020-12-08 03:13

    Try using the json library instead of pickle. This should be an option in your case because you're dealing with a dictionary which is a relatively simple object.

    According to this website,

    JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

    Also see this question: What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?

    Upgrading Python or using the marshal module with a fixed Python version also helps boost speed (code adapted from here):

    try: import cPickle
    except: import pickle as cPickle
    import pickle
    import json, marshal, random
    from time import time
    from hashlib import md5
    
    test_runs = 1000
    
    if __name__ == "__main__":
        payload = {
            "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
            "int": [random.randrange(0, 9999) for i in range(1000)],
            "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
        }
        modules = [json, pickle, cPickle, marshal]
    
        for payload_type in payload:
            data = payload[payload_type]
            for module in modules:
                start = time()
                if module.__name__ in ['pickle', 'cPickle']:
                    for i in range(test_runs): serialized = module.dumps(data, protocol=-1)
                else:
                    for i in range(test_runs): serialized = module.dumps(data)
                w = time() - start
                start = time()
                for i in range(test_runs):
                    unserialized = module.loads(serialized)
                r = time() - start
                print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))
    

    Results:

    C:\Python27\python.exe -u "serialization_benchmark.py"
    json int W 0.125 R 0.156
    pickle int W 2.808 R 1.139
    cPickle int W 0.047 R 0.046
    marshal int W 0.016 R 0.031
    json float W 1.981 R 0.624
    pickle float W 2.607 R 1.092
    cPickle float W 0.063 R 0.062
    marshal float W 0.047 R 0.031
    json str W 0.172 R 0.437
    pickle str W 5.149 R 2.309
    cPickle str W 0.281 R 0.156
    marshal str W 0.109 R 0.047
    
    C:\pypy-1.6\pypy-c -u "serialization_benchmark.py"
    json int W 0.515 R 0.452
    pickle int W 0.546 R 0.219
    cPickle int W 0.577 R 0.171
    marshal int W 0.032 R 0.031
    json float W 2.390 R 1.341
    pickle float W 0.656 R 0.436
    cPickle float W 0.593 R 0.406
    marshal float W 0.327 R 0.203
    json str W 1.141 R 1.186
    pickle str W 0.702 R 0.546
    cPickle str W 0.828 R 0.562
    marshal str W 0.265 R 0.078
    
    c:\Python34\python -u "serialization_benchmark.py"
    json int W 0.203 R 0.140
    pickle int W 0.047 R 0.062
    pickle int W 0.031 R 0.062
    marshal int W 0.031 R 0.047
    json float W 1.935 R 0.749
    pickle float W 0.047 R 0.062
    pickle float W 0.047 R 0.062
    marshal float W 0.047 R 0.047
    json str W 0.281 R 0.187
    pickle str W 0.125 R 0.140
    pickle str W 0.125 R 0.140
    marshal str W 0.094 R 0.078
    

    Python 3.4 uses pickle protocol 3 as default, which gave no difference compared to protocol 4. Python 2 has protocol 2 as highest pickle protocol (selected if negative value is provided to dump), which is twice as slow as protocol 3.

    0 讨论(0)
  • 2020-12-08 03:13

    If you are trying to store the dictionary to a single file, it's the load time for the large file that is slowing you down. One of the easiest things you can do is to write the dictionary to a directory on disk, with each dictionary entry being an individual file. Then you can have the files pickled and unpickled in multiple threads (or using multiprocessing). For a very large dictionary, this should be much faster than reading to and from a single file, regardless of the serializer you choose. There are some packages like klepto and joblib that already do much (if not all of the above) for you. I'd check those packages out. (Note: I am the klepto author. See https://github.com/uqfoundation/klepto).

    0 讨论(0)
提交回复
热议问题